StreamingLLM: Unlock Infinite Context for Your LLM Applications

by leveraging attention sinks!

Mar 23, 2025

Attention sinks kindly generated by Gemini :)

Are you hitting the context window limits of your LLMs when deploying them in streaming applications like chatbots or long-form summarization?

Dealing with the memory overhead of caching all previous tokens?

Let’s take a look at StreamingLLM, which is a framework offering infinite context length without sacrificing efficiency.

Sounds too good to be true? Keep reading :D

The problem

Limited Context Window: Pre-trained LLMs have a fixed attention window size.
Memory Bottleneck: Caching Key and Value states (KV) for all previous tokens during decoding consumes excessive memory and increases latency.
Window Attention Failure: window attention (caching only the most recent tokens), fails once initial tokens are evicted from the cache.

The solution: “Attention Sinks”

LLMs allocate a significant amount of attention score to initial tokens, even if those tokens are semantically unimportant. These tokens are called "attention sinks."

This happens because of the Softmax function, which forces the model to allocate attention scores somewhere, even when the query doesn't strongly relate to previous tokens.

Initial tokens are prime candidates because they are visible to all subsequent tokens during training.

StreamingLLM offers a refreshingly simple solution:

Maintain Attention Sinks: Keep the KV states of a few initial tokens in addition to the rolling window of recent tokens.
Rolling KV Cache: Continue using a sliding window to cache the most recent tokens' KV states.
Positional Encoding within the Cache: When calculating attention and position embeddings, focus on the position within the cache itself, not the original text.

Benefits of StreamingLLM:

Infinite Context Length: Allows LLMs to process sequences far exceeding their pre-training window size.
Efficiency: Significantly faster than sliding window re-computation (up to 22.2x speedup in the paper's experiments).
Stable Performance: Maintains perplexity levels comparable to dense attention within its pre-training window.
No Fine-Tuning Needed (in most cases): Works "out-of-the-box" with popular LLMs!

Experiments and Results:

Stable Language Modeling: Reliably models texts up to 4 million tokens and more.
Complementary to Context Extension: Can be combined with techniques like larger KV caches to further improve the context window.

Practical Implications for ML Engineers:

Real-World Chatbots: Deploy chatbots that can maintain a coherent conversation across entire days.
Long-Document Summarization: Process and summarize extremely lengthy documents in a single pass.
Efficient Inference: Significantly reduce memory requirements and inference latency for long sequences.

Key Takeaways:

If you want LLMs with long contexts without paying for expensive APIs and you are scared of finetuning your models to improve context, imho StreamingLLMs are quite easy to apply :). Let me know what you think!

References:

Paper

Machine learning at scale

Discussion about this post