Episode 575

You Do Not Fully Utilize Transformer's Representation Capacity

February 19, 2025 · 20:56

🤗 Upvotes: 25 | cs.LG, cs.CL

Authors:
Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov

Title:
You Do Not Fully Utilize Transformer's Representation Capacity

Arxiv:
http://arxiv.org/abs/2502.09245v1

Abstract:
In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

Listen to Daily Paper Cast using one of many popular podcasting apps or directories.

You Do Not Fully Utilize Transformer's Representation Capacity

Subscribe