Continual Learning via Sparse Memory Finetuning

Mar 04, 2026

∙ Paid

TLDR

Replaces standard Transformer FFN layers with “Memory Layers” (key-value pools) and updates only a tiny fraction of parameters (slots) during fine-tuning.
Uses TF-IDF ranking to identify memory slots specific to new data, masking out slots responsible for general pre-training knowledge.
On QA tasks, this method yields comparable learning to Full Finetuning and LoRA but drastically reduces forgetting (e.g., 11% drop in held-out performance vs. 89% for Full FT).
Catastrophic forgetting is a parameter interference problem; mathematically isolating "fact-specific" parameters from "general-capability" parameters solves it.

Introduction

The primary blocker to continual learning in production LLMs is catastrophic forgetting. When a model is updated on a stream of new data (e.g., breaking news, user-specific corrections), the gradient updates modify parameters shared across all tasks. Optimizing for the new distribution pushes weights away from the optima of previous distributions.

Continue reading this post for free, courtesy of Ludovico Bessi.

Or purchase a paid subscription.

Machine Learning At Scale

Continual Learning via Sparse Memory Finetuning

TLDR

Introduction

Continue reading this post for free, courtesy of Ludovico Bessi.