MoRAM logoLittle by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

1University of New South Wales, 2CSIRO
* Corresponding author.
MoRAM abstract: a token enters a model, which interacts with a growing memory pool — retrieves top-k atoms via self-activation, and appends new rank-1 atoms little by little.

Abstract

Continual learning (CL) with large pre-trained models aims to incrementally acquire knowledge without catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods expand capacity by adding isolated new experts while freezing old ones, but still suffer from redundancy, interference, routing ambiguity, and consequent forgetting.

We investigate the issues stemming from coarse-grained expert granularity. Coarse-grained experts (e.g., high-rank LoRA) encode low-specialty information, leading to expert duplication/interference and routing degradation/confusion as experts accumulate.

In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices act as linear associative memories, MoRAM achieves CL as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key–value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory.

Extensive experiments on CLIP and LLMs show that MoRAM significantly outperforms state-of-the-art methods, achieving a better plasticity–stability trade-off, stronger generalization, and reduced forgetting.

"Little by little, we gave you everything you ever dreamed of …" — Little by little, Oasis.

Architecture overview: per-module memory banks grow across tasks via sparse self-activated routing.

Current Pitfalls

Top — MoE-LoRA. Adding a new rank-\(r\) expert per task is convenient, but the expert is treated as an indivisible block. Three coupled pitfalls follow: (①) intra-expert inter-rank interference — within an expert, only a few of the \(r\) ranks match the input while the rest co-fire as noise; (②) inter-expert redundancy — to mask ①, the router learns to co-activate a duplicate expert on the same input, wasting capacity; (③) routing collapse — at inference, overlapping experts and an external router make routing ambiguous, driving catastrophic forgetting on prior tasks.

Bottom — MoRAM. Each task appends \(r\) rank-1 atoms to the memory bank little by little; the LoRA update is reframed as a linear associative memory \(\Delta\mathbf{W}=\sum_{i}w_{i}\mathbf{B}_{i}\mathbf{A}_{i}^{\top}\) where each atom is a key–value entry. Self-activation \(s_{i}=\mathbf{A}_{i}\!\cdot\!\mathbf{x}\) scores every atom in parallel; a sparse top-\(k\) mask keeps only the most relevant; threshold pruning \(s_{i}\!\geq\!\delta\) drops the weakest survivors. The surviving sparse mixture mitigates each pitfall one-for-one: (①) per-atom specialization — rank-1 atoms either fire or stay silent, eliminating within-expert noise; (②) knowledge reuse — atoms learned on earlier tasks re-fire on related new inputs, so no duplicate experts are needed; (③) forgetting mitigation — frozen \(\mathbf{A}_{i},\mathbf{B}_{i}\) keep prior atoms reachable by their own keys, so capacity grows without rewriting old memory.

Animated comparison: MoE-LoRA’s three pitfalls (intra-expert interference, inter-expert redundancy, routing collapse) vs. MoRAM’s three benefits (per-atom specialization, knowledge reuse, forgetting mitigation).

Figure 1. Animated comparison of MoE-LoRA pitfalls vs. MoRAM solutions. See the paper for full analysis.

Method

Overview

Each new task freezes all earlier rank-1 atoms and adds \(r\) new ones. A sparse self-activated mixture over the full bank sets input-dependent weights so only a few ranks activate at once. Figure 2 sketches growth from task \(t\) to \(t+1\): (a,c) conceptually, (b,d) the mixture computation. §3.2–§3.3 fill in the math.

Figure 2: MoRAM overview for tasks t and t+1

Figure 2. Freeze past ranks, add \(r\) new ranks, sparse mixture over all atoms. (a,c) overview; (b,d) tasks \(t\) and \(t+1\).

Weights as Linear Associative Memory (§3.2)

To mitigate the coarse granularity and routing ambiguity of dense rank-\(r\) updates, the paper reframes LoRA not as monolithic blocks but as a linear associative memory over the weight matrix.

Definition 1 (informal). A matrix \(\mathbf{W}\in\mathbb{R}^{d_{\text{out}}\times d_{\text{in}}}\) of rank \(m\) is viewed as \(m\) atomic key–value pairs \(\{(\mathbf{k}_{i},\mathbf{v}_{i})\}_{i=1}^{m}\) with \(\mathbf{k}_{i}\in\mathbb{R}^{d_{\text{in}}}\), \(\mathbf{v}_{i}\in\mathbb{R}^{d_{\text{out}}}\), and \(\mathbf{W}\approx\sum_{i=1}^{m}\mathbf{v}_{i}\mathbf{k}_{i}^{\top}\). For a hidden state \(\mathbf{x}\), the product \(\mathbf{W}\mathbf{x}\) acts like a content-addressable read: each inner product \(\mathbf{k}_{i}^{\top}\mathbf{x}\) scores relevance to slot \(i\) and gates the retrieved value \(\mathbf{v}_{i}\).

\[ \mathbf{y}=\mathbf{W}\mathbf{x}\approx\sum_{i=1}^{m}\mathbf{v}_{i}\,(\mathbf{k}_{i}^{\top}\mathbf{x}). \tag{2} \]

Remark. Unlike self-attention—where keys and values are dynamic functions of the input—here \(\mathbf{k}_{i}\) and \(\mathbf{v}_{i}\) are static parameters of the matrix, encoding patterns acquired during pre-training.

Mixture of Rank-1 Memory Experts (§3.3.1)

Fine-grained rank-1 memory augmentation. The low-rank update \(\Delta\mathbf{W}\) is not one rank-\(r\) slab but a sum of \(r\) rank-1 key–value pairs: row \(\mathbf{A}_{i,:}\) acts as the Key (relevance to the input) and column \(\mathbf{B}_{:,i}\) as the Value (stored correction). Adaptation becomes memory expansion—a flexible set of atoms rather than a single rigid adapter.

\[ \Delta\mathbf{W}\,\mathbf{x}=\sum_{i=1}^{r}\mathbf{B}_{:,i}\,(\mathbf{A}_{i,:}\mathbf{x}). \tag{3} \]

Standard LoRA (and MoE-LoRA at the adapter level) still densely mixes rank-1 components, which encourages interference on mismatched inputs and routing collapse by ignoring the natural role of \(\mathbf{A}\) as content-based retrieval keys in favor of indiscriminate activation or extra routers.

MoRAM formulation. Adaptation parameters are a growing set of atoms \(\mathcal{M}_{t}=\{(\mathbf{B}_{:,i},\mathbf{A}_{i,:})\}_{i=1}^{r_{t}}\). For task \(t\), the effective update is a sparse, input-dependent mixture with weights \(\mathrm{w}_{i}\in\mathbb{R}\) (retrieval confidence for atom \(i\)):

\[ \Delta\mathbf{W}^{t}=\sum_{i=1}^{r_{t}} \mathrm{w}_{i}\,\mathbf{B}_{:,i}\mathbf{A}_{i,:}. \tag{4} \]

This lets the model freeze old atoms for stability while adding new ones for plasticity, or combine several memories when concepts overlap—without an auxiliary router module.

Self-Activation for MoRAM Routing (§3.3.2)

Mixing weights come from each rank-1 adapter’s own key \(\mathbf{A}_{i,:}\), not from a separate MoE router over \((\mathbf{A}_{i,:},\mathbf{B}_{:,i})\) pairs—routing is content-addressable retrieval over static memory keys, which avoids extra router parameters and the forgetting they can introduce.

Self-activated relevance scoring. For hidden state \(\mathbf{x}\in\mathbb{R}^{d_{\text{in}}}\) and \(r_{t}\) accumulated atoms after task \(t\), the raw score \(s_{i}\) for atom \(i\) is the key response \(\mathbf{A}_{i,:}\mathbf{x}\), \(\ell_{2}\)-normalized across all atoms:

\[ s_{i}=\frac{\mathbf{A}_{i,:}\mathbf{x}}{\sqrt{\sum_{j=1}^{r_{t}}(\mathbf{A}_{j,:}\mathbf{x})^{2}}}\,. \tag{5} \]

The numerator is alignment with key \(i\); the denominator stabilizes scale across the memory bank. The paper finds this intrinsic scoring competitive with external routers (their Table 4).

Sparse Expert Routing and Mixture (§3.3.3)

While Eq. (5) measures relevance, naive dense activation leads to low specialization and induces interference and computational overhead; MoRAM therefore uses sparse routing to sharpen the mixture.

Sparse rank selection. To limit interference and cost, we enforce sparsity via top-\(k\) masking: only the \(k\) largest scores stay active; others are masked to \(-\infty\) before softmax so at most \(k\) of the \(r_{t}\) atoms receive gradient.

\[ [\mathrm{TopK}(\mathbf{s},k)]_{i}=\begin{cases} s_{i}, & \text{if } s_{i}\in\text{top-}k(\mathbf{s}),\\ -\infty, & \text{otherwise.} \end{cases} \tag{6} \]

Sharpness enhancement. To further encourage rank specialization and concentrate the update on the most relevant ranks, mixture weights use temperature-scaled softmax on the masked scores with \(\tau_{\text{MoRAM}}\); lower \(\tau_{\text{MoRAM}}\) sharpens which experts activate in the forward pass and routes gradients more selectively in the backward pass.

\[ \mathrm{w}_{i}=\mathrm{softmax}\!\left(\frac{\mathrm{TopK}(\mathbf{s},k)}{\tau_{\text{MoRAM}}}\right)_{\! i}. \tag{7} \]

Threshold-based expert selection. At inference, a cutoff \(\delta\) on normalized scores filters weak experts among the top-\(k\) set, reducing compute and noise:

\[ \mathrm{w}_{i}:=\mathbf{1}\{s_{i}\geq\delta\}\odot \mathrm{w}_{i}\,. \tag{8} \]

This yields a highly sparse, input-dependent set comprising only the most significant memory experts.

Experiments and results

Benchmarks

X-TAIL is a cross-domain, task-incremental protocol for CLIP-style models: a sequence of image-classification domains. We report Transfer (future domains before training), Average (mean over stages and domains), and Last (all seen domains after the full run)

TRACE benchmarks continual learning in LLMs with eight mixed tasks (e.g. reasoning, summarization, code). We report Overall Performance (OP) after the final task and Backward Transfer (BWT) for forgetting; higher OP and lower BWT are better.

X-TAIL (CLIP)

Comparisons on X-TAIL for each domain in terms of Transfer, Average, and Last accuracy (%). Section labels and the rightmost Average column follow the paper. MoRAM rows are highlighted.

Method Aircraft Caltech DTD EuroSAT Flowers Food MNIST OxPet Cars SUN397 Average
CLIP
Zero-shot 23.5 76.8 37.3 36.7 63.6 84.0 46.7 86.7 66.1 63.7 58.5
Fine-tune 39.6 84.7 70.0 94.7 97.0 85.8 97.6 93.4 81.0 74.7 81.9
Transfer
CLIP zero-shot 76.8 37.3 36.7 63.6 84.0 46.7 86.7 66.1 63.7 62.4
LwF 66.6 26.9 19.5 51.0 78.4 26.6 68.9 35.5 56.1 47.7
WiSE-FT 70.1 31.9 25.3 56.3 79.8 29.9 74.9 45.6 56.8 52.3
iCaRL 71.7 35.0 43.0 63.4 86.9 43.9 87.8 63.7 60.0 61.7
ZSCL 73.3 32.6 36.8 62.1 83.8 42.1 83.6 56.5 60.2 59.0
MoE-Adapter 71.0 34.9 19.2 63.0 86.6 20.0 87.2 63.7 58.6 56.0
RAIL-Primal 76.8 37.3 36.7 63.6 84.0 46.7 86.7 66.1 63.7 62.4
CoDyRA 74.3 36.8 44.2 69.9 83.5 42.8 88.9 64.6 63.4 63.2
MoRAM 74.5 38.1 46.9 65.3 82.9 45.8 88.2 65.1 62.9 63.3
Average
LwF 24.7 79.7 38.3 36.9 63.9 81.0 36.5 71.9 42.7 56.7 53.2
WiSE-FT 27.1 76.5 40.9 31.3 68.7 81.6 31.4 74.7 51.7 58.4 54.2
iCaRL 25.4 72.1 37.5 51.6 65.1 87.1 59.1 88.0 63.7 60.1 61.0
ZSCL 36.0 75.0 40.7 40.5 71.0 85.3 46.3 83.3 60.7 61.5 60.0
MoE-Adapter 43.6 77.9 52.1 34.7 75.9 86.3 45.2 87.4 66.6 60.2 63.0
RAIL-Primal 42.4 89.8 55.7 68.5 84.0 83.3 65.3 85.8 67.9 64.5 70.7
CoDyRA 41.4 81.0 58.7 77.8 83.4 84.6 64.5 90.4 67.2 64.4 71.3
MoRAM 44.1 81.6 64.6 79.6 83.9 84.4 66.5 89.7 68.4 64.1 72.7
Last
LwF 25.5 72.1 38.9 55.4 65.5 87.3 81.9 88.6 63.6 61.5 64.0
WiSE-FT 21.8 76.8 42.9 20.8 77.5 84.9 30.7 76.6 75.8 72.5 58.0
iCaRL 25.5 72.1 38.9 55.4 65.5 87.3 81.9 88.6 63.6 61.5 64.0
ZSCL 33.1 75.3 43.5 35.2 74.6 87.4 50.4 84.2 77.3 73.4 63.4
MoE-Adapter 43.2 78.7 57.6 32.8 79.4 86.0 86.7 87.8 78.2 74.2 70.5
RAIL-Primal 41.7 94.0 66.0 86.4 97.2 82.4 93.1 83.6 75.0 71.3 79.1
CoDyRA 37.7 81.5 65.1 89.9 91.4 85.5 96.8 93.3 77.3 73.5 79.2
MoRAM 37.7 81.5 70.7 92.4 95.0 86.0 97.6 92.6 81.0 74.7 80.9

TRACE (LLMs)

Comparison on the TRACE benchmark: Overall Performance (OP, higher is better) and Backward Transfer (BWT, lower is better). Values are mean ± standard deviation over three runs. The MoRAM column is highlighted.

FIX(ICL) SeqLoRA OGD GEM EWC L2P DualPrompt HiDeLoRA O-LoRA TreeLoRA MoRAM
meta-llama / LLaMA-2-7B-Chat
OP 38.94 ± 0.3 34.3 ± 1.2 42.09 ± 1.6 40.08 ± 1.6 42.36 ± 1.2 36.23 ± 0.8 37.69 ± 1.2 41.60 ± 0.8 42.78 ± 0.8 43.52 ± 1.0 44.54 ± 0.9
BWT 18.5 ± 0.8 8.06 ± 1.2 6.77 ± 1.2 5.97 ± 0.8 8.25 ± 0.8 8.03 ± 0.8 7.12 ± 0.4 7.16 ± 0.4 3.46 ± 0.4 1.37 ± 0.3
google / Gemma-2B-it
OP 32.3 ± 0.2 31.89 ± 0.8 32.85 ± 1.4 26.48 ± 1.5 28.35 ± 1.6 31.14 ± 1.2 32.42 ± 1.0 33.25 ± 0.9 33.73 ± 0.8 33.41 ± 0.9 36.27 ± 0.7
BWT 15.28 ± 0.4 12.27 ± 0.9 18.25 ± 0.9 16.96 ± 1.2 15.77 ± 0.7 14.25 ± 0.5 13.66 ± 0.5 12.36 ± 0.4 8.50 ± 0.5 2.74 ± 0.4
meta-llama / LLaMA-3-1B-Instruct
OP 31.16 ± 0.4 29.73 ± 1.6 30.12 ± 2.0 32.19 ± 2.0 31.96 ± 1.6 29.38 ± 1.2 30.76 ± 1.2 33.73 ± 1.2 32.94 ± 0.8 36.14 ± 0.7 37.77 ± 0.8
BWT 17.03 ± 1.2 15.2 ± 1.6 10.74 ± 1.6 11.62 ± 1.2 13.57 ± 0.8 11.34 ± 0.8 12.36 ± 0.8 12.89 ± 1.2 7.36 ± 0.8 3.12 ± 0.8

Visualization of rank activations

Because MoRAM routes through per-atom mixture weights, we can read off each rank-1 atom’s contribution as a rank activation map: a heatmap whose rows are rank indices (grouped by task, separated by dashed lines) and columns are image patches, with colour intensity proportional to activation value. Maps below are extracted from the \(K\) projection in CLIP’s image-encoder attention (layer 8) during continual fine-tuning on X-TAIL. Coloured boxes highlight semantically informative patch columns. Together the three figures illustrate atom specialization, forgetting mitigation, and knowledge reuse. For full analysis and protocol, see §4.2 and Appendix A.12 in the paper.

Rank activations across Tasks 1–2 (Fig. 3)

(a) Task 1 (Aircraft) input evaluated after learning Task 1 only — 16 ranks available. A handful of ranks (e.g. rank 12) fire strongly on specific patches while others stay near zero, showing clear specialization. (b) The same Task 1 input re-evaluated after Task 2 is added (32 ranks total). Task 1 ranks (1–16) retain virtually the same pattern; the newly added Task 2 ranks (17–32) remain mostly silent — evidence that new capacity does not disturb earlier atoms. (c) A Task 2 input evaluated after learning Task 2. Task 2 ranks now show strong, distinct activations, while some Task 1 ranks also fire where visual patterns overlap, indicating early knowledge reuse. Orange boxes highlight patch columns of interest across panels.

Visualization of MoRAM rank activations on Tasks 1 and 2 (Fig. 3)
Fig. 3. Rank activation maps on X-TAIL Tasks 1–2. Rows = rank indices (grouped by task); columns = image patches; colour = activation value.

Forgetting mitigation across the full task stream (Fig. 6)

The same Task 1 (Aircraft) input is tracked as the model grows from 16 ranks to 160 ranks over the full 10-task X-TAIL sequence. (a) After Task 1 (16 ranks). (b) After Task 2 (32 ranks). (c) After Task 10 (160 ranks). Despite the memory bank expanding by 10×, Task 1 ranks retain essentially the same activation pattern — strong, consistent firing on the same patch columns across all three snapshots. Ranks from Tasks 2–10 remain largely inactive on Task 1 data, confirming that the freeze-and-expand design prevents cross-task interference even as capacity grows.

Extended rank-activation visualization: forgetting mitigation (Fig. 6)
Fig. 6. Forgetting mitigation: Task 1 rank activations stay stable from 16 ranks (after Task 1) to 160 ranks (after Task 10).

Knowledge reuse across tasks (Fig. 7)

Blue boxes trace shared visual structure across domains. (a) Task 1 (Aircraft) data after learning Task 1 — blue boxes mark the patch groups where Task 1 ranks respond most. (b) Later-task data (Cars) evaluated with only Task 1 atoms — Task 1 ranks still fire on patches with comparable spatial structure (e.g. object body, background). (c) Task 9 (Cars) data after learning nine tasks (144 ranks). Task 1 ranks (bottom rows, blue box) still activate on patches that share visual structure with Aircraft, while Task 9-specific ranks handle novel object details. This decomposition — reused atoms for shared patterns, fresh atoms for new concepts — emerges naturally from the self-activation mechanism without any explicit reuse objective.

Extended rank-activation visualization: knowledge reuse (Fig. 7)
Fig. 7. Knowledge reuse: Task 1 ranks (blue boxes) continue to activate on later-task data where visual patterns recur, alongside task-specific atoms.

BibTeX

@inproceedings{lu2026little,
  title     = {Little By Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts},
  author    = {Lu, Haodong and Zhao, Chongyang and Xue, Minhui and Yao, Lina and Moore, Kristen and Gong, Dong},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=P247k4ELcn}
}