Machine Learning At Scale

Machine Learning At Scale

Continual Learning via Sparse Memory Finetuning

Ludovico Bessi's avatar
Ludovico Bessi
Mar 04, 2026
∙ Paid

TLDR

  • Replaces standard Transformer FFN layers with “Memory Layers” (key-value pools) and updates only a tiny fraction of parameters (slots) during fine-tuning.

  • Uses TF-IDF ranking to identify memory slots specific to new data, masking out slots responsible for general pre-training knowledge.

  • On QA tasks, this method yields comparable learning to Full Finetuning and LoRA but drastically reduces forgetting (e.g., 11% drop in held-out performance vs. 89% for Full FT).

  • Catastrophic forgetting is a parameter interference problem; mathematically isolating "fact-specific" parameters from "general-capability" parameters solves it.

Introduction

The primary blocker to continual learning in production LLMs is catastrophic forgetting. When a model is updated on a stream of new data (e.g., breaking news, user-specific corrections), the gradient updates modify parameters shared across all tasks. Optimizing for the new distribution pushes weights away from the optima of previous distributions.

User's avatar

Continue reading this post for free, courtesy of Ludovico Bessi.

Or purchase a paid subscription.
© 2026 Ludovico Bessi · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture