Group E: Cold-Mutation Rescue

The fourth research stream, opened 2026-05-14 after Group C / Phase D closed. Phase D had nailed the per-task locality (D1/C8) and per-task depth (D3) findings, but left one stubborn structural blocker: patch-count evolution doesn’t work. Every Group C variant (C3 single +1, C4 head-weight=0, C5a 4× rate, C7 burst of 8, D2 in-niche) ended with the top of the population pinned at the seed count. Group E names the mechanism that breaks the blocker and characterises where it does and doesn’t help.

The hypothesis

Group C’s negative was that cold structural mutations get culled before they can train. A freshly-inserted patch has random indices, random internal weights, and a random downstream contribution; its host’s fitness drops at insertion; selection happens before the patch matures; the macro-mutant gets bred out.

Group E: warm-start the insertion. Instead of fresh random patches, clone an existing patch’s indices and weights with a small perturbation, halve the parent’s outgoing weights, and add identical-but-halved outgoing edges for the child. At insertion the network’s forward pass is mathematically unchanged (Net2Net behavior preservation) — the host’s fitness doesn’t drop. SGD then has gradient signal on the perturbed indices to differentiate the two patches.

Predict: with warm-start enabled, macro-mutants survive selection; on niches with accuracy headroom, count growth translates to accuracy gain.

The pages

Group E Journal — chronological narrative.
Group E Experiments — structured records (E0, E1, E2, E3, E4, E5, E6, E7, E8, E9).

Headline result

E2 — warm-start delivers headroom-driven accuracy gains.

Same 5 niches as Group C, same warm-start mechanism, 900K steps each. Niches sort cleanly by accuracy headroom on both count growth and accuracy gain:

Niche	E1 (300K)	E2 (900K)	Δ accuracy	E1 patches	E2 patches
mnist	96.50%	96.83%	+0.3pp	129	129
fashion	86.72%	86.59%	-0.1pp	129	131
kmnist	90.01%	90.29%	+0.3pp	132	136
emnist	77.48%	79.07%	+1.6pp	128	151
mixed	81.46%	83.20%	+1.7pp	129	139

EMNIST (the most under-capacitied niche) gains the most. Mixed (carrying all four tasks, widest aggregate headroom) gains nearly as much. MNIST is saturated and shows neither count nor accuracy growth — exactly as the headroom hypothesis predicts.

Population-level: EMNIST went from avg 128 patches → avg 152.5. KMNIST → 135.5. Mixed → 140.9. Saturated niches stayed at 128-131.

Experiment table

Exp	Setup	Wall (M4 Pro)	Result
E0	EMNIST data sanity check — D1 reproduction on parquet-derived data	178 s	77.72% top — within 0.6pp of original-IDX D1; minor parquet train/test ordering drift, no orientation issue
E1	Warm-start 5 niches × 300K steps, identical to D2 except `warm_patch_insertion=true`	~30 min	Survival positive: 4/5 niches show >128-patch individuals in top-3 (D2: 1/5). KMNIST top and rank-2 at 132 patches. Accuracy flat: all niches within noise of D1/D2 — halving tax not yet repaid in 300K steps
E2	E1 with 3× training budget (900K steps per niche)	~20 min	Translation positive on non-saturated niches: EMNIST +1.6pp/+23 patches, mixed +1.7pp/+10 patches. MNIST saturated (+0.3pp / 0 count). Headroom-driven shape confirmed
E3	Split-ratio ablation on Fashion + KMNIST: parent_ratio ∈ {0.5, 0.7, 0.9}	151 s	Clean negative for asymmetric: only 0.5/0.5 grew count. At higher ratios the child is too weak a lineage to specialize. Net2Net’s behavior-preserving 0.5/0.5 is Pareto-optimal. Surfaced + diagnosed the residual cycle bug along the way
E4	EMNIST capacity ceiling — 1.8M steps, single niche	178 s	Asymptote bounded: 79.90% top / 161 patches; rank-3 by fit hits 80.09% / 156 patches. Doubling budget gave +0.83pp and +10 patches — returns diminishing but not exhausted. Asymptote ~80-81% at ~170-180 patches
E5	Warm-start + depth=32 hidden layer on KMNIST and EMNIST	110 s	Depth and count evolution are substitutive, not additive. KMNIST: depth caps count growth (128 patches), D3 alone (93.49%) still beats E5 (91.91%). EMNIST: count grows under depth (139p) but depth itself hurts accuracy (74.89% vs E2 79.07%) — Group B B34 robust to warm-start
E6	Permuted-MNIST CL, 3 sequential tasks × 300K, pop=1 vs pop=50+warm	38 s	Negative-leaning for the “population buffers forgetting” hypothesis. pop=50 vs pop=1: +1.5pp final accuracy, −1pp forgetting. Both lose ~45pp to catastrophic forgetting. Ecological speciation as currently implemented is a task-partitioning mechanism, not a forgetting-protection mechanism
E7	Replay-based CL on E6’s setup — 4 conditions: pop=1 with 0/100/1000-per-task buffers, plus pop=50+warm+100-buffer	~100 s	Replay solves CL. 100 examples/task closes 64% of the forgetting gap (43pp → 15pp); 1000/task closes essentially all of it (43pp → 6pp). Replay axis is ~20× the population axis. Population stacks weakly with replay (+2pp).
E8	E7’s pop=1 vs pop=50 conditions extended to N_TASKS ∈ {5, 8} at fixed replay=100	~280 s	Replay-100 is partial CL. Forgetting climbs ~linearly with task count (15→23→27pp). Population effect (+1-3pp) doesn’t widen with difficulty — “population buffers forgetting” is twice-rejected. Strong recency bias in retained accuracy.
E9	N=8 fixed, pop=1, buffer size sweep ∈ {100, 300, 1000, 3000}	~150 s	Buffer, not mechanism. Forgetting halves with each ~3× buffer increase: 28→20→13→6pp. No diminishing-returns knee. Structural CL mechanisms (task-aware speciation, EWC) unmotivated at this scale; replay alone suffices at predictable memory cost.

All Group E runs were on a single Apple M4 Pro (14 cores, 48 GB unified memory) for E1-E6, and a 16-thread i9-9900K for E7-E9. Total compute for E1-E9 across ~12M training steps: under two hours wall.

Three headline scientific findings

1. Warm-start (Net2Net) unblocks patch-count evolution

C3-C7 + D2 had established that cold structural mutations get culled before training matures them. E1’s survival result — 4 of 5 niches with >128-patch individuals in top-3, where D2 had 1 of 5 — closes the loop. The mechanism works in the predicted direction: by making insertion behavior-preserving (parent halved + child clone halved = same downstream signal), the host’s fitness doesn’t drop at insertion, so selection doesn’t immediately cull the larger architecture.

This makes Group C’s biggest negative result reversible. The patch-count axis is evolvable; it just needs a smarter insertion than fresh-random.

2. The translation to accuracy is gated by training time and by niche headroom

E1 grew count but didn’t translate to accuracy. E2 with 3× more steps did — but only on niches with accuracy headroom. The pattern that emerged in E2 is robust:

Niche state	Count growth	Accuracy gain
Saturated (MNIST)	No	Minimal (existing-patch tuning)
Near-saturated (Fashion)	Small	Noise
Mid-room (KMNIST)	Clear (+5-8)	Modest (+0.3pp)
Headroom (Mixed)	Big (+11)	+1.7pp
Most headroom (EMNIST)	Biggest (+24)	+1.6pp

The mechanism interacts with ecological speciation in exactly the predicted way: each niche’s own headroom determines whether extra capacity pays off. Where it doesn’t, the count axis is correctly inert — no fitness signal for capacity, selection doesn’t reward it.

3. Depth and warm-start are substitutive, not additive

D3 had shown depth=32 gives +3.3pp on KMNIST. The natural follow-up was to stack: warm-start patches feeding a hidden layer. E5 ran this on KMNIST and EMNIST. The results:

Niche	D3 (depth, no warm, 300K)	E2 (no depth, warm, 900K)	E5 (both, 900K)
KMNIST	93.49% / 128p	90.29% / 136p	91.91% / 128p
EMNIST	75.56% / 128p	79.07% / 151p	74.89% / 139p

Stacking doesn’t help. On KMNIST the 32-node hidden bottleneck caps useful patch count at 128 (no warm-start mutants survive) and depth-alone still wins; warm-start adds overhead without benefit. On EMNIST depth itself hurts the task (Group B B34’s finding, robust to warm-start), and combined performance is worse than either alone.

The right frame is that each niche wants one of depth or warm-start, not both — another instance of per-task architectural conditionality. A future system that picks between them per niche would Pareto-dominate any single-architecture configuration.

The continual-learning arc: E6 → E9

E6 opened with a negative result that was strategically important; E7–E9 closed the question.

E6 (negative for the original hypothesis). 3 sequential permuted-MNIST tasks under two conditions: pop=1 baseline vs pop=50 with warm-start. The population condition gave +1.5pp final accuracy and −1pp forgetting — within seed noise. Both conditions still lost ~45pp to catastrophic forgetting. The informal “population diversity buffers forgetting” hypothesis didn’t survive. Ecological speciation is a task-partitioning mechanism, not a forgetting-protection mechanism.

E7 (positive for replay). Same 3 tasks; added a balanced replay buffer of either 100 or 1000 examples per prior task.

Condition	avg final	avg forgetting
no replay	52.19%	42.98pp
replay=100	79.59%	15.24pp
replay=1000	88.74%	6.07pp
pop=50 warm + replay=100	81.64%	13.65pp

100 examples per prior task — 0.2% of each task’s training set — closes 64% of the gap. 1000 closes essentially all of it. The replay axis is ~20× the population axis. Population stacks weakly (+2pp) on top.

E8 (replay-100 degrades with sequence length). Pushed to N_TASKS ∈ {5, 8} at fixed replay=100. Forgetting climbs roughly linearly: 15pp (N=3), 23pp (N=5), 27pp (N=8). The population effect (+1-3pp) doesn’t widen at harder regimes — the “population buffers forgetting” hypothesis is now twice-rejected, at short and long sequence. Strong recency bias in retained accuracy: most-recent task ≈ 90%, oldest task ≈ 60%.

E9 (buffer scaling resolves the question). Held N=8 fixed, swept buffer size ∈ {100, 300, 1000, 3000}.

buffer/task	avg_final	avg_forget	task0_drop
100	64.40%	28.04pp	36.07pp
300	71.75%	19.87pp	27.23pp
1000	77.02%	12.88pp	14.95pp
3000	82.57%	6.04pp	10.31pp

Forgetting halves with each ~3× buffer increase. No diminishing-returns knee in the sampled range. At 3000/task (24K examples — about half a task’s training set per task), forgetting drops to 6pp at N=8 — comparable to E7-C at N=3.

Closing the CL mechanism question. Structural CL mechanisms (task-aware speciation, EWC, parameter isolation) are unmotivated at this scale and task family. Replay alone is sufficient at any practical operating point; the cost is memory linear in task count. The “small replay” CL narrative is real at low task counts (100 examples works at N=3) but degrades to “store most of the data and rehearse it” as N grows. E10 (task-aware speciation) is correspondingly deprioritized — no headroom for it given replay’s coverage.

What Group E resolved and what’s still open

Resolved:

Cold-mutation problem (E1 + E2) — warm-start Net2Net insertion lets macro-mutants survive selection; given enough training time, count growth translates to accuracy on niches with headroom.
Split-ratio choice (E3) — 0.5/0.5 (canonical Net2Net behavior-preserving) is Pareto-optimal. Asymmetric splits make the child too weak to specialize.
EMNIST capacity ceiling (E4) — ~80-81% at ~170-180 patches on this LR/seed configuration. Diminishing returns past 150 patches.
Depth vs warm-start interaction (E5) — substitutive, not additive. Per-niche architecture choice matters; stacking them is worse than either alone.
The cycle bug (Group C / Phase D leftover) — root-caused as a sanitize-iteration completeness issue under dense SCCs. Fix: aggressive sanitize that disables all inter-cycle edges per pass instead of one. 15K+ cycle breaks in E5 all resolved in the first call; the Individual::from_genome defensive sanitize now reports 0 late-cycle breaks. 7 unit tests lock the fix in.
Continual learning (E6 → E9, resolved). Replay alone suffices: at any practical operating point, scaling buffer size linearly with task count keeps forgetting bounded. Structural CL mechanisms are unmotivated at this scale. Population diversity contributes a stable +1–3pp modifier independent of regime.

Still open:

Cross-niche transfer. Take E2’s best EMNIST individual, transplant its patch geometry into a fresh genome, train on KMNIST. Does the geometry transfer as a useful prior or does it have to re-evolve from scratch?
Larger-scale warm-start. 256-patch or 512-patch seed + warm-start at extended budgets, see if warm-start can match Group C’s C5d 512-patch seed result via evolution rather than seeding.
Online-learning ablation. The system’s distinguishing positioning is online per-example SGD; what does that buy over offline mini-batch SGD on the same architecture? An untested load-bearing claim.

What this implies for the broader research direction

Four things shifted in priority based on E1-E9:

The patch-count evolution story is now a positive finding to publish. Group C documented the negative; Group E supplies the mechanism (warm-start) and characterises where it applies (headroom-driven, per-task). The Group B → Group C → Group E arc — manual mapping → emergent discovery → mechanistic resolution — is a coherent research narrative.
The CL question closes via replay. What looked like an open mechanism problem (E6 → “need explicit preservation”) turned out to be a memory-budget problem. The replay-axis effect (E7) dwarfs the population-axis effect by 20×; the scaling story (E8 → E9) is clean log-linear with no structural barriers. Continual learning on this system, on this task family, is solved at predictable cost.
“Population diversity buffers forgetting” is dead. Tested at short (E6) and long (E8) sequence, with and without replay (E7 D-vs-B): the population-as-CL-mechanism effect is consistently 1–3pp regardless of regime. Time to retire the hypothesis.
The interesting unanswered question is now outside CL. The system’s distinguishing claim — online per-example SGD + evolutionary topology — has never been ablated against offline mini-batch SGD on the same architecture. That’s the foundational load-bearing claim that the project’s positioning rests on, and as far as Group E has surfaced, it has not been examined rigorously. A natural next stream candidate.