Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Pretrained Transformers as Universal Computation Engines

Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

Can the learned self-attention


mechanism generalize across modalities? Photo:
Quanta Magazine

We take GPT-2 and freeze the self-attention and


feedforward parameters.

Despite tokens coming from new modalities, we find the


frozen model exhibits strong classification performance. modality-specific “universal” data subnetwork “universal” subnetwork application-
encoders processor selection knowledge selection specific outputs

You might also like