3 - EE-Tuning An Economical Yet Scalable Solution For Tuning Early-Exit Large Language Models2402.00518

Uploaded by

鄭惠文

0% found this document useful (0 votes)

3 views1 page

Original Title

3_EE-Tuning An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models2402.00518

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

3 views1 page

3 - EE-Tuning An Economical Yet Scalable Solution For Tuning Early-Exit Large Language Models2402.00518

Uploaded by

鄭惠文

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 1

Search inside document

Original standard LLM

Data Input
Transformer layers (backbone)
batch layer
Input
Transformer layers (backbone)
layer

EE EE EE Final-exit
layer layer layer layer
EE EE EE Final-exit
layer layer layer layer
Forward Backward
Added modules Loss Loss Loss

Stage 1: initialize early-exit layers Stage 2: tune early-exit layers

Figure 1: An outline of EE-Tuning, the proposed two-stage procedure that converts a pre-trained standard
LLM into a well-trained early-exit LLM.

downstream tasks through early exiting while maintaining comparable or even better benchmark scores, or
higher speedup if slight degeneration of output quality is acceptable. We thoroughly investigate the effects of
various design choices and provide practical guidelines for maximizing the performance of EE-Tuning. The
source code of EE-Tuning is available at https://github.com/pan-x-c/EE-LLM.

Related works. The idea of EE-Tuning, i.e. augmenting a pre-trained neural network with early-exit layers
that are tuned in a parameter-efficient manner, is not completely new. This strategy has been adopted in
some prior works for model architectures tailored to classification tasks, e.g. the encoder-only BERT model
[52, 29, 19] or others [33, 22, 3, 9]. However, there is no guarantee that results and conclusions from
these works can safely transfer to the case of decoder-only Transformers tailored to autoregressive sequence
generation, which is the focus of our work. Another recent work [50] proposed to initialize the model
parameters of early-exit LLMs with pre-trained standard LLMs, but followed by full-parameter training.
Closest to our setting and training methodology is the recent work [12], which investigated generative LLMs
of sizes up to 355M, and only considered linear exit heads that are randomly initialized. Moreover, that work
proposed to use multiple exits for improving the final output of full-model inference, rather than accelerating
inference via early exiting. In contrast, our implementation of the proposed EE-Tuning method (1) is unified
and systematic, with support for a wide range of configurations; (2) is scalable, thanks to its full compatibility
with 3D parallelism; and (3) has proven via extensive experiments to return early-exit LLMs that achieve
outstanding acceleration during autoregressive inference.

2 Methodology
This section elaborates on our methodology and implementation of obtaining a well-trained early-exit LLM
via EE-Tuning, the two-stage procedure visualized in Figure 1.

Preliminaries. Modern LLMs are mostly based on the Transformer architecture [51]. We focus on the
decoder-only generative pre-training (GPT) Transformer architecture [35, 36], although many techniques
presented in this work can be generalized to broader settings. As visualized in Figure 1, a GPT Transformer
is composed of an initial layer for input processing, a stack of Transformer layers as the backbone, and an
output layer that converts the final hidden states to logits on the vocabulary, which can be used for generating
new tokens. Each Transformer layer consists of an attention module and a multi-layer perceptron (MLP),
with layer normalization [1] and residual connections [15] applied in multiple places. The final output layer
includes an optional layer normalization module, followed by a large output embedding matrix.
A GPT Transformer can be trained in an unsupervised manner, by optimizing the language mod-
eling loss on unlabeled corpus. More specifically, given model parameters θ and a sequence of tokens
x = (x1 , x2 , . . . , xT ), the autoregressive language modeling loss, i.e. negative log-likelihood of next-token
prediction, is defined as L(x; θ) := − log P(x; θ) = − t∈[T ] log P(xt |x1 , . . . , xt−1 ; θ).
P

Ai For Everyone Andrew NG 190818125324 PDF
Document19 pages
Ai For Everyone Andrew NG 190818125324 PDF
Krishna
No ratings yet
4 - E-Tuning An Economical Yet Scalable Solution For Tuning Early-Exit Large Language Models2402.00518
Document1 page
4 - E-Tuning An Economical Yet Scalable Solution For Tuning Early-Exit Large Language Models2402.00518
鄭惠文
No ratings yet
5 - EE-Tuning An Economical Yge Models2402.00518
Document1 page
5 - EE-Tuning An Economical Yge Models2402.00518
鄭惠文
No ratings yet
Signal Processing Manual 2009
Document59 pages
Signal Processing Manual 2009
bhashik
No ratings yet
Synthesis of High-Speed Finite State Machines in Fpgas by State Splitting
Document7 pages
Synthesis of High-Speed Finite State Machines in Fpgas by State Splitting
Valery Salauyou
No ratings yet
NEOM Lab Manual - Experiment 1 To 5
Document46 pages
NEOM Lab Manual - Experiment 1 To 5
prakashchittora6421
No ratings yet
Stanford Dataset 2.0
Document9 pages
Stanford Dataset 2.0
Jinyuan Kang
No ratings yet
Erl 93 34
Document24 pages
Erl 93 34
Max Loren
No ratings yet
LSTM
Document24 pages
LSTM
Arda Surya
No ratings yet
Smith Waterman
Document23 pages
Smith Waterman
Surya Padma
100% (1)
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.111 - Introductory Digital Systems Laboratory
Document10 pages
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.111 - Introductory Digital Systems Laboratory
ayesha
No ratings yet
Real Time Matlab Interface For Speed Control of Induction Motor Drive Using Dspic 30F4011
Document6 pages
Real Time Matlab Interface For Speed Control of Induction Motor Drive Using Dspic 30F4011
RathnaSamy Sgr
No ratings yet
Chapter 2 MM
Document1 page
Chapter 2 MM
Mike Posktova
No ratings yet
7 - PDFsamn For Tuning Early-Exit Large Language Models2402.00518
Document1 page
7 - PDFsamn For Tuning Early-Exit Large Language Models2402.00518
鄭惠文
No ratings yet
Fulltext
Document20 pages
Fulltext
api-3731916
No ratings yet
MATLAB Probelms With Solutions
Document1 page
MATLAB Probelms With Solutions
Samreen
No ratings yet
EEE 4706 Lab 1
Document6 pages
EEE 4706 Lab 1
Rummanur Rahad
No ratings yet
Electra Pre Training Text Encoders As Discriminators Rather Than Generators
Document18 pages
Electra Pre Training Text Encoders As Discriminators Rather Than Generators
Minh Nguyen
No ratings yet
1) Convert The C Function Below To Ia-32 Assembly Language.: Compe 271 Mid-Term Exam #2 Fa16
Document4 pages
1) Convert The C Function Below To Ia-32 Assembly Language.: Compe 271 Mid-Term Exam #2 Fa16
Giorgi Tsibadze
No ratings yet
FPGA
Document28 pages
FPGA
main2510
No ratings yet
08a VHDL FSM
Document25 pages
08a VHDL FSM
Zaha George
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
Document24 pages
Lecture Notes - Advanced Language Model - BERT, GPT
Prathyusha Madavaneri
No ratings yet
Aim of The Experiment: System Identification of DC Motor Using Matlab/Labview. Equipments Required
Document10 pages
Aim of The Experiment: System Identification of DC Motor Using Matlab/Labview. Equipments Required
priyatosh dash
No ratings yet
DSP Lab Manual Pallavi
Document67 pages
DSP Lab Manual Pallavi
b vamshi
100% (1)
Signals Aand Ssystems: With M Matlab Computing and Ssimulink Modeling
Document687 pages
Signals Aand Ssystems: With M Matlab Computing and Ssimulink Modeling
anji.guvvala
No ratings yet
Solving Large Top-K Graph Eigenproblems With A Memory and Compute-Optimized FPGA Design
Document10 pages
Solving Large Top-K Graph Eigenproblems With A Memory and Compute-Optimized FPGA Design
Tiancheng
No ratings yet
Auto Encoder
Document73 pages
Auto Encoder
Ankit Kumar Ray
No ratings yet
Icassp2005 Chen
Document4 pages
Icassp2005 Chen
api-3764000
No ratings yet
Chapter 2 (Print)
Document14 pages
Chapter 2 (Print)
666aoki
No ratings yet
Lab Experiment
Document14 pages
Lab Experiment
Rashid Khurshid
No ratings yet
Control System Lab Manual (Kec-652)
Document29 pages
Control System Lab Manual (Kec-652)
VIKASH YADAV
No ratings yet
Lab 1
Document2 pages
Lab 1
CHEANG HOR PHENG
No ratings yet
Laboratory Activity #2.2: Name/Signature: - Section: - Date
Document6 pages
Laboratory Activity #2.2: Name/Signature: - Section: - Date
Johnbob Tovera
No ratings yet
Introduction To Matlab, Signal Processing & Speech Signal Processing
Document49 pages
Introduction To Matlab, Signal Processing & Speech Signal Processing
aijazmona
No ratings yet
93097v00 LTE Application Note
Document7 pages
93097v00 LTE Application Note
ansam.moh2016
No ratings yet
Fault Tolerant Huffman Coding For JPEG Image Coding System
Document11 pages
Fault Tolerant Huffman Coding For JPEG Image Coding System
Mymy Suity
No ratings yet
ME451: Control Systems Course Roadmap
Document3 pages
ME451: Control Systems Course Roadmap
Vu Nghia
No ratings yet
ME451: Control Systems Course Roadmap
Document3 pages
ME451: Control Systems Course Roadmap
Vu Nghia
No ratings yet
Speed Compact Priority Encoder
Document4 pages
Speed Compact Priority Encoder
Neha Tripathi
No ratings yet
Image Processing
Document36 pages
Image Processing
TapasRout
No ratings yet
Sharma 2015
Document4 pages
Sharma 2015
Nagendra Boyella
No ratings yet
3.Modelling-and-Analysis-for-process-control-Lec 5
Document9 pages
3.Modelling-and-Analysis-for-process-control-Lec 5
ibtihal esam
No ratings yet
EE-232: Signals and Systems Lab 4: Introduction To Complex Exponentials
Document18 pages
EE-232: Signals and Systems Lab 4: Introduction To Complex Exponentials
Muhammad Uzair Khan
No ratings yet
Multiple Stopping Criteria and High-Precision EMD Architecture Implementation For Hilbert-Huang Transform
Document5 pages
Multiple Stopping Criteria and High-Precision EMD Architecture Implementation For Hilbert-Huang Transform
bernicevdkooij
No ratings yet
StableLM-3B-4E1T - Stable-Lm - Weights & Biases
Document15 pages
StableLM-3B-4E1T - Stable-Lm - Weights & Biases
ouaatoufatima2
No ratings yet
Interpretation: Fernando, Ram F. EE-2/ B11
Document6 pages
Interpretation: Fernando, Ram F. EE-2/ B11
Ram Fernando
No ratings yet
Attention
Document12 pages
Attention
api-332129590
No ratings yet
Cmos CH 4
Document46 pages
Cmos CH 4
Pinak Roy
No ratings yet
The Research and Simulation of CSMACA Mechanism of
Document6 pages
The Research and Simulation of CSMACA Mechanism of
cristina comanescu
No ratings yet
Probabilistic and Bottle-Neck Features For LVCSR of Meetings
Document4 pages
Probabilistic and Bottle-Neck Features For LVCSR of Meetings
Sammer Burgos
No ratings yet
Matlab (FUAD MAHFUDIANTO)
Document14 pages
Matlab (FUAD MAHFUDIANTO)
Fuad Engine
No ratings yet
Text Classification With Switch Transformer - 1716327819025
Document5 pages
Text Classification With Switch Transformer - 1716327819025
mbohoumounpouyvanlandry
No ratings yet
Enablement EAGLE STP Basic Pctest Lab Activities V1.1
Document22 pages
Enablement EAGLE STP Basic Pctest Lab Activities V1.1
Guillermo García
No ratings yet
Utilization of Matlab For The Logarithmic Processor Development
Document4 pages
Utilization of Matlab For The Logarithmic Processor Development
baruaeee
No ratings yet
Comparative Analysis of KL and SA Partitioning Algorithms Implemented On VLSI Circuit Partitioning
Document7 pages
Comparative Analysis of KL and SA Partitioning Algorithms Implemented On VLSI Circuit Partitioning
www.irjes.com
No ratings yet
En For Tuning Early-Exit Large Language Models2402.00518
Document1 page
En For Tuning Early-Exit Large Language Models2402.00518
鄭惠文
No ratings yet
By: Dhaval S. Shukla: Fmax Margin/Value Improvement For Memory Block During ECO Stage
Document4 pages
By: Dhaval S. Shukla: Fmax Margin/Value Improvement For Memory Block During ECO Stage
Jitendra Kumar
No ratings yet
EE370 LabManual
Document50 pages
EE370 LabManual
Anoop Mohan Chemnikar
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Predicting Stock Market Trends
Document15 pages
Predicting Stock Market Trends
şafak erdoğdu
No ratings yet
Shrimp Rover Robot Using Rocker Bogie Mechanics
Document93 pages
Shrimp Rover Robot Using Rocker Bogie Mechanics
Hamza Shahid
No ratings yet
Assignment 3
Document5 pages
Assignment 3
Saurabh
No ratings yet
Machine Learning Module-3
Document23 pages
Machine Learning Module-3
Samba Shiva Reddy.g Shiva
No ratings yet
L5 Neural Network
Document67 pages
L5 Neural Network
chau.pm020902
No ratings yet
Rock Identification Using Deep Learning CNN Project Report: Abstract
Document12 pages
Rock Identification Using Deep Learning CNN Project Report: Abstract
Anantha Priya
No ratings yet
Writer Identification of Hand Written Scripts Using Machine Learning
Document26 pages
Writer Identification of Hand Written Scripts Using Machine Learning
B19ShivaDatta Erroju
No ratings yet
Visual Question Answering System For Indian Regional Languages
Document6 pages
Visual Question Answering System For Indian Regional Languages
PIIT Library
No ratings yet
Stock Market Prediction
Document15 pages
Stock Market Prediction
uday nayaka
100% (1)
Winter
Document1 page
Winter
Gaurav Kamath
No ratings yet
Online MachineLearningUsing Python
Document3 pages
Online MachineLearningUsing Python
Debasis Chandra
No ratings yet
Irjet V6i11121
Document5 pages
Irjet V6i11121
Roger Primo
No ratings yet
Language Translation
Document15 pages
Language Translation
Varnit Agnihotri
No ratings yet
Artificial Intelligence
Document49 pages
Artificial Intelligence
Kristopher France Punio
No ratings yet
Chapter 05 - Sharda 11e Full Accessible PPT 05
Document31 pages
Chapter 05 - Sharda 11e Full Accessible PPT 05
rorog345g
No ratings yet
Multimedia Answer Generation For Community Question Answering
Document17 pages
Multimedia Answer Generation For Community Question Answering
swamishailu
No ratings yet
Rahman 24 A
Document19 pages
Rahman 24 A
iqra maqsood
No ratings yet
What Is Chat GPT?
Document7 pages
What Is Chat GPT?
BCA
No ratings yet
WWW Techtarget Com Searchenterpriseai Definition Artificial General Intelligence AGI
Document7 pages
WWW Techtarget Com Searchenterpriseai Definition Artificial General Intelligence AGI
Jay
No ratings yet
Prediction of Heart Disease Using Neural Network With Back Propagation
Document4 pages
Prediction of Heart Disease Using Neural Network With Back Propagation
IIR india
No ratings yet
UNETR: Transformers For 3D Medical Image Segmentation
Document11 pages
UNETR: Transformers For 3D Medical Image Segmentation
Bashar Shami
No ratings yet
Amazon Products Review Sentiment Analysis
Document23 pages
Amazon Products Review Sentiment Analysis
الريس حمادة
No ratings yet
Algoritma Naive Bayes & K-NN (Efy Yosrita)
Document15 pages
Algoritma Naive Bayes & K-NN (Efy Yosrita)
efy
No ratings yet
Deep Learning With Python
Document396 pages
Deep Learning With Python
ngocle8a8
No ratings yet
Convolutional Autoencoder For Image Denoising
Document11 pages
Convolutional Autoencoder For Image Denoising
UMT Artificial Intelligence Review (UMT-AIR)
No ratings yet
Deeplearning - Ai Deeplearning - Ai
Document42 pages
Deeplearning - Ai Deeplearning - Ai
PF
No ratings yet
Welco ME
Document15 pages
Welco ME
altaf
No ratings yet
Liu Swin Transformer Hierarchical Vision Transformer Using Shifted Windows ICCV 2021 Paper
Document11 pages
Liu Swin Transformer Hierarchical Vision Transformer Using Shifted Windows ICCV 2021 Paper
Karthick Mohanraj
No ratings yet
Module - 1 - ECE3047 - Machine Learning - 1 (8748)
Document38 pages
Module - 1 - ECE3047 - Machine Learning - 1 (8748)
Utkarsh Maurya
No ratings yet