Episode 158

Star Attention: Efficient LLM Inference over Long Sequences

November 27, 2024 · 20:34

🤗 Paper Upvotes: 32 | cs.CL, cs.AI, cs.LG

Authors:
Shantanu Acharya, Fei Jia, Boris Ginsburg

Title:
Star Attention: Efficient LLM Inference over Long Sequences

Arxiv:
http://arxiv.org/abs/2411.17116v1

Abstract:
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

Listen to Daily Paper Cast using one of many popular podcasting apps or directories.

Star Attention: Efficient LLM Inference over Long Sequences

Subscribe