Group E: Cold-Mutation Rescue

The fourth research stream, opened 2026-05-14 after Group C / Phase D closed. Phase D had nailed the per-task locality (D1/C8) and per-task depth (D3) findings, but left one stubborn structural blocker: patch-count evolution doesn’t work. Every Group C variant (C3 single +1, C4 head-weight=0, C5a 4× rate, C7 burst of 8, D2 in-niche) ended with the top of the population pinned at the seed count. Group E names the mechanism that breaks the blocker and characterises where it does and doesn’t help.

The hypothesis

Group C’s negative was that cold structural mutations get culled before they can train. A freshly-inserted patch has random indices, random internal weights, and a random downstream contribution; its host’s fitness drops at insertion; selection happens before the patch matures; the macro-mutant gets bred out.

Group E: warm-start the insertion. Instead of fresh random patches, clone an existing patch’s indices and weights with a small perturbation, halve the parent’s outgoing weights, and add identical-but-halved outgoing edges for the child. At insertion the network’s forward pass is mathematically unchanged (Net2Net behavior preservation) — the host’s fitness doesn’t drop. SGD then has gradient signal on the perturbed indices to differentiate the two patches.

Predict: with warm-start enabled, macro-mutants survive selection; on niches with accuracy headroom, count growth translates to accuracy gain.

The pages

Headline result

E2 — warm-start delivers headroom-driven accuracy gains.

Same 5 niches as Group C, same warm-start mechanism, 900K steps each. Niches sort cleanly by accuracy headroom on both count growth and accuracy gain:

Niche E1 (300K) E2 (900K) Δ accuracy E1 patches E2 patches
mnist 96.50% 96.83% +0.3pp 129 129
fashion 86.72% 86.59% -0.1pp 129 131
kmnist 90.01% 90.29% +0.3pp 132 136
emnist 77.48% 79.07% +1.6pp 128 151
mixed 81.46% 83.20% +1.7pp 129 139

EMNIST (the most under-capacitied niche) gains the most. Mixed (carrying all four tasks, widest aggregate headroom) gains nearly as much. MNIST is saturated and shows neither count nor accuracy growth — exactly as the headroom hypothesis predicts.

Population-level: EMNIST went from avg 128 patches → avg 152.5. KMNIST → 135.5. Mixed → 140.9. Saturated niches stayed at 128-131.

Experiment table

Exp Setup Wall (M4 Pro) Result
E0 EMNIST data sanity check — D1 reproduction on parquet-derived data 178 s 77.72% top — within 0.6pp of original-IDX D1; minor parquet train/test ordering drift, no orientation issue
E1 Warm-start 5 niches × 300K steps, identical to D2 except warm_patch_insertion=true ~30 min Survival positive: 4/5 niches show >128-patch individuals in top-3 (D2: 1/5). KMNIST top and rank-2 at 132 patches. Accuracy flat: all niches within noise of D1/D2 — halving tax not yet repaid in 300K steps
E2 E1 with 3× training budget (900K steps per niche) ~20 min Translation positive on non-saturated niches: EMNIST +1.6pp/+23 patches, mixed +1.7pp/+10 patches. MNIST saturated (+0.3pp / 0 count). Headroom-driven shape confirmed
E3 Split-ratio ablation on Fashion + KMNIST: parent_ratio ∈ {0.5, 0.7, 0.9} 151 s Clean negative for asymmetric: only 0.5/0.5 grew count. At higher ratios the child is too weak a lineage to specialize. Net2Net’s behavior-preserving 0.5/0.5 is Pareto-optimal. Surfaced + diagnosed the residual cycle bug along the way
E4 EMNIST capacity ceiling — 1.8M steps, single niche 178 s Asymptote bounded: 79.90% top / 161 patches; rank-3 by fit hits 80.09% / 156 patches. Doubling budget gave +0.83pp and +10 patches — returns diminishing but not exhausted. Asymptote ~80-81% at ~170-180 patches
E5 Warm-start + depth=32 hidden layer on KMNIST and EMNIST 110 s Depth and count evolution are substitutive, not additive. KMNIST: depth caps count growth (128 patches), D3 alone (93.49%) still beats E5 (91.91%). EMNIST: count grows under depth (139p) but depth itself hurts accuracy (74.89% vs E2 79.07%) — Group B B34 robust to warm-start
E6 Permuted-MNIST CL, 3 sequential tasks × 300K, pop=1 vs pop=50+warm 38 s Negative-leaning for the “population buffers forgetting” hypothesis. pop=50 vs pop=1: +1.5pp final accuracy, −1pp forgetting. Both lose ~45pp to catastrophic forgetting. Ecological speciation as currently implemented is a task-partitioning mechanism, not a forgetting-protection mechanism
E7 Replay-based CL on E6’s setup — 4 conditions: pop=1 with 0/100/1000-per-task buffers, plus pop=50+warm+100-buffer ~100 s Replay solves CL. 100 examples/task closes 64% of the forgetting gap (43pp → 15pp); 1000/task closes essentially all of it (43pp → 6pp). Replay axis is ~20× the population axis. Population stacks weakly with replay (+2pp).
E8 E7’s pop=1 vs pop=50 conditions extended to N_TASKS ∈ {5, 8} at fixed replay=100 ~280 s Replay-100 is partial CL. Forgetting climbs ~linearly with task count (15→23→27pp). Population effect (+1-3pp) doesn’t widen with difficulty — “population buffers forgetting” is twice-rejected. Strong recency bias in retained accuracy.
E9 N=8 fixed, pop=1, buffer size sweep ∈ {100, 300, 1000, 3000} ~150 s Buffer, not mechanism. Forgetting halves with each ~3× buffer increase: 28→20→13→6pp. No diminishing-returns knee. Structural CL mechanisms (task-aware speciation, EWC) unmotivated at this scale; replay alone suffices at predictable memory cost.

All Group E runs were on a single Apple M4 Pro (14 cores, 48 GB unified memory) for E1-E6, and a 16-thread i9-9900K for E7-E9. Total compute for E1-E9 across ~12M training steps: under two hours wall.

Three headline scientific findings

1. Warm-start (Net2Net) unblocks patch-count evolution

C3-C7 + D2 had established that cold structural mutations get culled before training matures them. E1’s survival result — 4 of 5 niches with >128-patch individuals in top-3, where D2 had 1 of 5 — closes the loop. The mechanism works in the predicted direction: by making insertion behavior-preserving (parent halved + child clone halved = same downstream signal), the host’s fitness doesn’t drop at insertion, so selection doesn’t immediately cull the larger architecture.

This makes Group C’s biggest negative result reversible. The patch-count axis is evolvable; it just needs a smarter insertion than fresh-random.

2. The translation to accuracy is gated by training time and by niche headroom

E1 grew count but didn’t translate to accuracy. E2 with 3× more steps did — but only on niches with accuracy headroom. The pattern that emerged in E2 is robust:

Niche state Count growth Accuracy gain
Saturated (MNIST) No Minimal (existing-patch tuning)
Near-saturated (Fashion) Small Noise
Mid-room (KMNIST) Clear (+5-8) Modest (+0.3pp)
Headroom (Mixed) Big (+11) +1.7pp
Most headroom (EMNIST) Biggest (+24) +1.6pp

The mechanism interacts with ecological speciation in exactly the predicted way: each niche’s own headroom determines whether extra capacity pays off. Where it doesn’t, the count axis is correctly inert — no fitness signal for capacity, selection doesn’t reward it.

3. Depth and warm-start are substitutive, not additive

D3 had shown depth=32 gives +3.3pp on KMNIST. The natural follow-up was to stack: warm-start patches feeding a hidden layer. E5 ran this on KMNIST and EMNIST. The results:

Niche D3 (depth, no warm, 300K) E2 (no depth, warm, 900K) E5 (both, 900K)
KMNIST 93.49% / 128p 90.29% / 136p 91.91% / 128p
EMNIST 75.56% / 128p 79.07% / 151p 74.89% / 139p

Stacking doesn’t help. On KMNIST the 32-node hidden bottleneck caps useful patch count at 128 (no warm-start mutants survive) and depth-alone still wins; warm-start adds overhead without benefit. On EMNIST depth itself hurts the task (Group B B34’s finding, robust to warm-start), and combined performance is worse than either alone.

The right frame is that each niche wants one of depth or warm-start, not both — another instance of per-task architectural conditionality. A future system that picks between them per niche would Pareto-dominate any single-architecture configuration.

The continual-learning arc: E6 → E9

E6 opened with a negative result that was strategically important; E7–E9 closed the question.

E6 (negative for the original hypothesis). 3 sequential permuted-MNIST tasks under two conditions: pop=1 baseline vs pop=50 with warm-start. The population condition gave +1.5pp final accuracy and −1pp forgetting — within seed noise. Both conditions still lost ~45pp to catastrophic forgetting. The informal “population diversity buffers forgetting” hypothesis didn’t survive. Ecological speciation is a task-partitioning mechanism, not a forgetting-protection mechanism.

E7 (positive for replay). Same 3 tasks; added a balanced replay buffer of either 100 or 1000 examples per prior task.

Condition avg final avg forgetting
no replay 52.19% 42.98pp
replay=100 79.59% 15.24pp
replay=1000 88.74% 6.07pp
pop=50 warm + replay=100 81.64% 13.65pp

100 examples per prior task — 0.2% of each task’s training set — closes 64% of the gap. 1000 closes essentially all of it. The replay axis is ~20× the population axis. Population stacks weakly (+2pp) on top.

E8 (replay-100 degrades with sequence length). Pushed to N_TASKS ∈ {5, 8} at fixed replay=100. Forgetting climbs roughly linearly: 15pp (N=3), 23pp (N=5), 27pp (N=8). The population effect (+1-3pp) doesn’t widen at harder regimes — the “population buffers forgetting” hypothesis is now twice-rejected, at short and long sequence. Strong recency bias in retained accuracy: most-recent task ≈ 90%, oldest task ≈ 60%.

E9 (buffer scaling resolves the question). Held N=8 fixed, swept buffer size ∈ {100, 300, 1000, 3000}.

buffer/task avg_final avg_forget task0_drop
100 64.40% 28.04pp 36.07pp
300 71.75% 19.87pp 27.23pp
1000 77.02% 12.88pp 14.95pp
3000 82.57% 6.04pp 10.31pp

Forgetting halves with each ~3× buffer increase. No diminishing-returns knee in the sampled range. At 3000/task (24K examples — about half a task’s training set per task), forgetting drops to 6pp at N=8 — comparable to E7-C at N=3.

Closing the CL mechanism question. Structural CL mechanisms (task-aware speciation, EWC, parameter isolation) are unmotivated at this scale and task family. Replay alone is sufficient at any practical operating point; the cost is memory linear in task count. The “small replay” CL narrative is real at low task counts (100 examples works at N=3) but degrades to “store most of the data and rehearse it” as N grows. E10 (task-aware speciation) is correspondingly deprioritized — no headroom for it given replay’s coverage.

What Group E resolved and what’s still open

Resolved:

Still open:

What this implies for the broader research direction

Four things shifted in priority based on E1-E9:

  1. The patch-count evolution story is now a positive finding to publish. Group C documented the negative; Group E supplies the mechanism (warm-start) and characterises where it applies (headroom-driven, per-task). The Group B → Group C → Group E arc — manual mapping → emergent discovery → mechanistic resolution — is a coherent research narrative.

  2. The CL question closes via replay. What looked like an open mechanism problem (E6 → “need explicit preservation”) turned out to be a memory-budget problem. The replay-axis effect (E7) dwarfs the population-axis effect by 20×; the scaling story (E8 → E9) is clean log-linear with no structural barriers. Continual learning on this system, on this task family, is solved at predictable cost.

  3. “Population diversity buffers forgetting” is dead. Tested at short (E6) and long (E8) sequence, with and without replay (E7 D-vs-B): the population-as-CL-mechanism effect is consistently 1–3pp regardless of regime. Time to retire the hypothesis.

  4. The interesting unanswered question is now outside CL. The system’s distinguishing claim — online per-example SGD + evolutionary topology — has never been ablated against offline mini-batch SGD on the same architecture. That’s the foundational load-bearing claim that the project’s positioning rests on, and as far as Group E has surfaced, it has not been examined rigorously. A natural next stream candidate.