Professional Documents
Culture Documents
NetGPT An AI Native Network Architecture For Provisioning Beyond Personalized Generative Service
NetGPT An AI Native Network Architecture For Provisioning Beyond Personalized Generative Service
Abstract—Large language models (LLMs) have triggered (NLP) and computer vision tasks. However, fine-tuning is still
tremendous success to empower our daily life by generative a perquisite to align pre-trained LLMs to follow human intents
arXiv:2307.06148v2 [cs.LG] 23 Jul 2023
information. The personalization of LLMs could further con- [2] and yield personalized outputs. Therefore, it might be cost-
tribute to their applications due to better alignment with human
intents. Towards personalized generative services, a collaborative ineffective to deploy LLMs on the centralized clouds only,
cloud-edge methodology is promising, as it facilitates the effective as it inevitably requires multiple copies of complete model
orchestration of heterogeneous distributed communication and parameters to support different purposes.
computing resources. In this article, after discussing the pros and In order to boost the personalization of LLMs, a collabora-
cons of several candidate cloud-edge collaboration techniques, we tive cloud-edge methodology is essential [3]. Compared to the
put forward NetGPT to capably synergize appropriate LLMs
at the edge and the cloud based on their computing capacity. cloud-only LLM deployment, such a cloud-edge collaboration
In addition, edge LLMs could efficiently leverage location- enjoys multi-folded merits. Firstly, it provides more freedom
based information for personalized prompt completion, thus to allow edge servers to deploy various fine-tuned LLMs and
benefiting the interaction with the cloud LLM. After deploying adapt to environmental differences, thus making the service
representative open-source LLMs (e.g., GPT-2-base model and personalization and customization possible. Meanwhile, it
LLaMA model) at the edge and the cloud, we present the
feasibility of NetGPT on the basis of low-rank adaptation- contributes to bridging data-abundant generative devices with
based light-weight fine-tuning. Subsequently, we highlight the more adjacent servers. Therefore, it could reduce the latency
essential changes required for an artificial intelligence (AI)- and save the communication overhead to upload data to more
native network architecture towards NetGPT, with emphasis on remote cloud servers. Incorporating generative LLMs into the
deeper integration of communications and computing resources edge networks promises to facilitate the effective utilization of
and careful calibration of logical AI workflow. Furthermore,
we demonstrate several benefits of NetGPT, which come as communication and computing (C&C) resources.
by-products, as the edge LLMs’ capability to predict trends As illustrated in Fig. 1, there are several distinctive ways
and infer intents promises a unified solution for intelligent to implement the cloud-edge collaboration for deployment of
network management & orchestration. We argue that NetGPT LLMs (e.g., local fine-tuning, model splitting). Specifically,
is a promising AI-native network architecture for provisioning by offloading cloud-trained LLMs, local edge servers tailor
beyond personalized generative services.
the cloud-trained LLMs to accommodate personalized and
customized services based on the user preference and spec-
I. I NTRODUCTION ified scenarios. In this regard, federated learning or parallel
With the remarkable success of deep learning spanning from training can be leveraged as a complement to facilitate the
decision-making in AlphaGo to human-level interaction like tuning [2], [4]. However, such an approach might face severe
ChatGPT, it is anticipated that artificial intelligence (AI) will implementation issues in practice, as repetitive fine-tuning
be embodied in 6G networks. Along with the enhanced edge of complete LLMs implies significant computational burden,
computing capabilities, AI could benefit effective orchestration and also distributed deployment of proprietary LLMs might
of network resources and improve the quality of service (QoS). raise intellectual property concerns from model developers.
Correspondingly, investigation on efficient AI-based service Meanwhile, force-fitting an entire LLM on edge possibly
provisioning has attracted an intense research interest. On strains the limited computing resources of edge servers and
the other hand, application of one AI model is often limited makes the cost of edge computing unacceptable, while of-
to certain scenarios or tasks. In this context, large language floading the LLMs incurs significant communication overhead.
models (LLMs) (e.g., generative pre-trained transformer, GPT Alternatively, splitting LLMs to cloud and edge servers [4],
[1]) could perform well in various natural language processing by deploying some layers of large-scale deep neural network
(DNNs) at the edge while leaving the remaining layers to the
Y. Chen and R. Li are with Zhejiang University, Hangzhou 310027, China, cloud, can effectively balance the disproportionate comput-
(email: {cyx00, lirongpeng}@zju.edu.cn). ing resources of edge and cloud servers. Within the model
C. Peng and J. Wu are with Huawei Technologies Co., Ltd., Shanghai
210026, China (email: {pengchenghui,wujianjun}@huawei.com). splitting, how to effectively partition the DNNs between the
Z. Zhao and H. Zhang are with Zhejaing Lab, Hangzhou 310012, edge and the cloud belongs to one of the most challenging
China as well as Zhejiang University, Hangzhou 310027, China (email: issues, as it should minimize the end-to-end latency while
{zhaozf,honggangzhang}@zhejianglab.com).
E. Hossain is with University of Manitoba, Winnipeg, Manitoba, Canada maintaining a sufficiently small model size for the edge servers
(email: ekram.hossain@umanitoba.ca). [4]. Such a model partitioning can be even more intricate,
2
Cloud
Cloud Cloud Cloud
Edge
Computing Plane
Offload & fine-tune Splitting Synergy User Plane
Control Plane
Terminal
Edge Edge Edge Edge Edge Edge
(a) LLM fine-tuning (b) LLM Splitting (c) LLM Synergy
Fig. 1. An illustration of candidate means to realize the could-edge collaboration for NetGPT.
given billions of parameters in a typical LLM. Moreover, the that NetGPT allows the utilization of other LLMs as well. On
wide adoption of residual links in LLMs possibly limits the this basis, we delve into implementation details of cloud-edge
choice of suitable partitioning points. Besides, the LLMs might LLM synergy towards NetGPT in an incremental manner.
leak private details from the data for training [5]. In other In particular, we start with a quick and general overview of
words, it might be challenging to directly adopt both local transformer, the basis of LLMs, and enumerate detailed DNN
fine-tuning and model splitting as an implementation means structures of two LLMs (i.e., LLaMA-7B model and GPT-
of collaborative cloud-edge methodology. 2-base model). Then, we discuss the effective means to fine-
In this article, we put forward NetGPT that aims to respect tune these LLMs on computation-limited devices, and manifest
the cloud-edge resource imbalance and synergize different the effectiveness for location-based personalized generative
sizes of functional LLMs at the edge and cloud. Specifically, services.
in apparent contrast to AI-exogenous network with decoupled
C&C resources, NetGPT could leverage converged C&C to A. An Overview of Transformer
deploy smaller edge LLMs for the edge while larger one
for the cloud, and meaningfully realize collaborative cloud- Transformer has been extensively adopted as the foun-
edge computing to provision personalized content genera- dation model to constitute a multi-layer decoder in LLMs.
tion services. Besides, NetGPT incorporates a logical AI The fundamental concept behind the transformer involves
workflow that could be developed to determine performance- the construction of the DNN structure through the use of
consistent communication links. For example, in NetGPT, the multi-layer self-attention and feed-forward neural networks
performance-driven communication link could terminate at the (FNNs). In particular, self-attention relies on an attention
edge to accelerate the response assuming the availability of head parameterized by query, key and value matrices (i.e.,
satisfactory edge LLM-induced content. Otherwise, inspired Q, K and V ) to compute the internal correlation within an
by the idea of prompt learning [6], the LLMs at the edge input sequence, by deriving and assigning different weights
can infer the context and actively append (or fill in) some to different positions within the sequence. Meanwhile, the
local or personalized information, so as to acquire a more FNN applies a non-linear transformation to the representa-
comprehensive result at the cloud. Furthermore, as a by- tion of each position. Additionally, techniques such as layer
product, the edge LLMs contribute to a unified solution normalization are employed to mitigate the issue of vanishing
for intelligent network management & orchestration (e.g., gradients.
user intent inference and popularity prediction). Therefore,
NetGPT coincides with the trend to deeply integrate C&C, B. DNN Structure of LLMs at the Edge and Cloud
and represents an LLM-empowered AI-native architecture. 1) DNN structure of GPT-2-base model: The GPT-2-base
model, which is the smallest version of the GPT-2 series,
II. I MPLEMENTATION S HOWCASE OF N ET GPT encompasses 12 stacked layers of the original transformer
As illustrated in Fig. 2, we present a synergistic cloud-edge structure (i.e., an 8-head self-attention sublayer and an FNN
framework to accomplish personalized generative services, by sublayer). A fixed absolute positional encoding of sine and co-
leveraging distinctive pre-trained LLMs for cloud and edge sine positions is employed to pre-transform the input sequence.
(e.g., base stations [BSs]) deployment. In particular, limited In addition, GPT-2 leverages a rectified linear unit (ReLU)
by the availability of open-source LLMs, we select and deploy activation function (i.e., fReLU (x) = max(0, x)). Due to its
the LLaMA-7B model [7] and the GPT-2-base model, which relatively exceptional performance and minimal computational
consist of approximately 6.7 and 0.1 billion parameters, at the requirement, it can be appropriate to be deployed at the
cloud and the edge, respectively. However, it should be noted network edge.
3
Comprehensive prompt
Fine-tuned Concise prompt Block inputs Modified Transformer Block
GPT-2-base Responsive LLaMA-7B
2) DNN structure of LLaMA model: LLaMA, which is information. Thereby, RoPE could align with intuitive under-
trained on a large set of unlabeled data and is ideal for standing and it exhibits superior performance in practice.
fine-tuning for downstream tasks, features various parameter
versions as well [7]. Compared to GPT-3, LLaMA incorporates C. Fine-Tuning Techniques
several specific enhancements to maintain similar performance
1) Low-rank adaptation-based light-weight fine-tuning: As
while significantly reducing the number of parameters [7].
LLaMA lacks the capability to generate responsive text [7], an
For example, in order to enhance training stability, LLaMA
extra fine-tuning is still required. However, a direct fine-tuning
normalizes the input of each sub-layer instead of normalizing
of LLMs such as a LLaMA still requires significant compu-
the output. Moreover, it adopts the root mean square layer
tational resources. For example, it demands 112 GB video
normalization (RMSNorm) function [8] as a simplified re-
random access memory (VRAM) to fine-tune the LLaMA-7B
placement for layer normalization, by employing the root mean
model, far more than the capacity of NVIDIA A100 Tensor
square (RMS) rather than the standard deviation. Additionally,
Core GPU. Therefore, we leverage a low-rank adaptation
RMSNorm introduces a learnable scaling factor that enables
(LoRA) technique [11] to achieve parameter-efficient fine-
adaptive feature scaling. Thus, it contributes to improving
tuning on a consumer-level hardware. In particular, in order
normalization effects for diverse features that have distinctive
to fine-tune a complete parameter matrix W ∈ Rdin ×dout ,
value ranges. Secondly, LLaMA replaces the ReLU activation
LoRA specially adds a bypass pathway to simulate the matrix
function with Swish-gated linear unit (SwiGLU) [9], which
update ∆W by using two downstream parameter matrices
combines the Swish function (i.e., fSwish (x) = x · σ(βx)
A ∈ Rdin ×r and B ∈ Rr×dout with the intrinsic rank r. In
with σ(x) = 1+e1−x and a trainable parameter β) with GLU
other words, under the condition that r ≪ min(din , dout ),
(i.e., fGLU (x) = x · σ(W x + b) with trainable parameters W
LoRA successfully transforms large parameter matrix ∆W
and b), thereby possibly activating neurons according to the
into lower-rank dimensional ones with ∆W ≈ AB. Our
input in a more selective manner and imparting smoothness
experiment shows that it only costs 28 GB VRAM to fine-
to effectively capture intricate non-linear relationships. Lastly,
tune the LLaMA-7B model, without significantly elongating
LLaMA introduces rotary position embedding (RoPE) [10],
the training duration. Additionally, the required storage space
which encodes positional information with a pre-defined rota-
for fine-tuning could be greatly reduced from 12.55 GB to
tion matrix and naturally incorporates explicit relative position
50 MB1 . On the basis of LoRA, we can utilize the Stanford
dependency in the self-attention formulation. Compared to
Alpaca dataset [12] to fine-tune LLaMA-7B model and obtain
absolute position encoding which assigns a distinct encoded
a qualified responsive LLaMA-7B model.
representation to each position in the sequence, the taken form
of relative position encoding in RoPE enables a more effective 1 Such statistics are obtained under the configuration that r = 8 and a
modeling of long-range dependencies within the contextual learning rate-related scalar factor equals 16.
4
2) LLM-instructed data collection: In order to implement LLM models, computing resources shall be consistently sched-
personalized edge LLM, it is crucial to grant the GPT-2- uled to accomplish NetGPT. In other words, as shown in
base model the capability to extend a “concise” prompt by Fig. 4, NetGPT necessitates the design of deeply converged
appending location-based information. Basically, the position- C&C in radio access networks (RANs). On top of these novel
ing information can be conveniently obtained according to features, a logical AI workflow shall be devised to establish
the locations of affiliated BSs stored in the 5G access and (beyond) generative service orchestration.
mobility management function (AMF). Meanwhile, in order
to complement more comprehensive information, we take A. Converged C&C in RAN
the self-instruct approach [13] and leverage OpenAI’s Text- In order to effectively organize heterogeneous resources,
Davinci-003 model to generate useful text samples. In partic- which may simultaneously cover terrestrial and non-terrestrial
ular, as for each location, we use a set of manually-written communications, the RAN for NetGPT has to first provide
location-related prompts to interact with the OpenAI’s Text- a control plane (CP) for seamless connectivity control to
Davinci-003 model, and leverage the generative response texts ubiquitously support prompting and generative information
as the “comprehensive prompt”. Correspondingly, a series of transmission in the user plane (UP). Such elements can
mappings between the “concise prompt” and a “comprehensive be developed in accordance with the 5G and 5G-Advanced
prompt” can be collected. Considering the size and task techniques. In addition, it is worthwhile to introduce an
complexity of the edge LLM, we collect a dataset comprising independent computing plane (CmP) to coordinate computing
approximately 4, 000 samples for directly fine-tuning the GPT- resources and perform AI-related functionalities, so as to
2-base model towards a prompt-completion model. Notably, facilitate the deployment and update of generative services.
for scenarios where stronger generality is required, edge LLMs
can be enhanced with a larger-scale LLM, and fine-tuning B. Data Processing and Privacy Protection
techniques such as LoRA can be employed as well. As discussed in Section II, data processing (e.g., data
collection and fine-tuning) is heavily leveraged to lay the very
D. Performance Showcase foundation for producing generative LLM models. Besides
collecting and storing data, it is essential to introduce data
Fig. 3 demonstrates the performance of NetGPT. In par- desensitization modules as key data processing services, so
ticular, as illustrated in Fig. 3, the edge LLM is capable as to avoid privacy risks and protect the privacy embedded
to complement the “concise prompt” according to the chart in the data. Meanwhile, data policy enforcement modules,
at the left part of Fig. 3, which highlights most frequently which handle data according to regulatory as well as non-
used words for generating each corresponding “comprehensive regulatory rules (e.g., geographic restrictions), will be executed
prompt”. Furthermore, different BSs add location-based per- by default to ensure the integrity and legitimation of data
sonalized information so as to satisfy distinctive requirements. processing. Moreover, contingent on the regulation and data
Subsequently, the edge LLM processes the user-submitted usage policy, it is also feasible to devise some data processing
“concise prompt”, and feeds the complemented prompt to the model libraries and expose the capabilities with appropriate
cloud. Next, a more complete generative response could be access control for entities to utilize the data services.
anticipated. It can be observed from the right part of Fig. 3 that
NetGPT could yield different location-based responses, which C. Personalized Profiling
manifests the capability to handle personalized generative In order to create a highly customized NetGPT, location-
services through effective cloud-edge synergy. oriented profiling shall be significantly enhanced to support
the definition and operation of personalized generative AI
III. AI-NATIVE N ETWORK A RCHITECTURE T OWARDS services. For example, local venue and facility information can
N ET GPT be specially gathered to train edge LLMs. On the other hand,
We aruge that NetGPT provides the opportunity to trans- user service nodes (USN) can contain customized services
form cellular networks into an AI-native networking architec- at end-user level as well, so as to meet diversified customer
ture, which provisions personalized, networked and inclusive requirements. Meanwhile, it could further support to establish
intelligence for end users and grants users more privilege the user profiling and characterize connected terminals.
to access generative services anytime and anywhere. Nev-
ertheless, such a transformation does come at a cost. It D. C&C Resource Management
requires substantial changes, far more than installing racks of As part of provisioned services in future cellular net-
servers at the edge location and local break-out of the traffic works, the orchestration of resources for NetGPT shares
for edge processing. In particular, compared with conven- some similarities as that for other network services, including
tional connectivity-oriented communications systems, wherein establishing connectivity and computing resource allocation.
a typical service establishes connections between two specific However, it also poses additional challenges, since the scope
terminals and the communication source and destination are of resources spans distributed nodes from the cloud to the
clearly defined by end users, NetGPT requires to establish terminal. Therefore, novel protocol stack needs be carried
generative performance-driven connections in a more implicit on radio control signaling (e.g., RRC) or radio data protocol
manner. Moreover, as NetGPT involves more frequent data (e.g. PDCP or SDAP) to transmit AI-generative messages and
collection and processing modules for training personalized implement model update & distribution.
5
User prompting:
Libraries
Fig. 3. A performance showcase of NetGPT for personalized generative services with edge LLM-assisted prompt completion.
CmP 6
Data processing Profiling AI model
Cleansing Transformation Consumer Preferences SpeechT5 GPT-2
Storage Desensitization Venues and Facilities LayoutLM BLIP
Governance and Security Network-wide Config. Stable Diffusion ···
5 7 8
UP Core Network
Data Prompting Cache Validation Prompting Enhancement
4 delivery Data Collection & Transmission Generative Text ··· 9 C&C Coherent Workflow for
10 Edge-Cloud Collaboration:
CP 1. Terminal Authentication
2
C&C resource managemnet 2. Resource Orchestration
Access Authentication Security Measures Resource Allocation 3. Signaling Configuration
control Audit Logging Mobility Management Resource Scheduling ··· 4. Concise Prompt or Training Data
1 5. Concise Prompt Request & Processing
Terminal 6. Calling Edge LLM
3 7. Comprehensive Prompt Generation
8. Calling Cloud LLM
9. Complete Response Generation
Access Network 10. Response Transmission
Fig. 4. The illustration of an AI-Native Network Architecture and Logical AI Workflow for NetGPT.
Masked Self-attention
Dataset of Netflix audience behavior (.csv file)
Label
List of [appear:0/1 <END>]
Fig. 5. Edge LLM for popularity prediction: From data-sample template, fine-tuning to prediction accuracy.