Group E: Cold-Mutation Rescue
The fourth research stream, opened 2026-05-14 after Group C / Phase D closed. Phase D had nailed the per-task locality (D1/C8) and per-task depth (D3) findings, but left one stubborn structural blocker: patch-count evolution doesn’t work. Every Group C variant (C3 single +1, C4 head-weight=0, C5a 4× rate, C7 burst of 8, D2 in-niche) ended with the top of the population pinned at the seed count. Group E names the mechanism that breaks the blocker and characterises where it does and doesn’t help.
The hypothesis
Group C’s negative was that cold structural mutations get culled before they can train. A freshly-inserted patch has random indices, random internal weights, and a random downstream contribution; its host’s fitness drops at insertion; selection happens before the patch matures; the macro-mutant gets bred out.
Group E: warm-start the insertion. Instead of fresh random patches, clone an existing patch’s indices and weights with a small perturbation, halve the parent’s outgoing weights, and add identical-but-halved outgoing edges for the child. At insertion the network’s forward pass is mathematically unchanged (Net2Net behavior preservation) — the host’s fitness doesn’t drop. SGD then has gradient signal on the perturbed indices to differentiate the two patches.
Predict: with warm-start enabled, macro-mutants survive selection; on niches with accuracy headroom, count growth translates to accuracy gain.
The pages
- Group E Journal — chronological narrative.
- Group E Experiments — structured records (E0, E1, E2, E3, E4, E5, E6, E7, E8, E9).
Headline result
E2 — warm-start delivers headroom-driven accuracy gains.
Same 5 niches as Group C, same warm-start mechanism, 900K steps each. Niches sort cleanly by accuracy headroom on both count growth and accuracy gain:
| Niche | E1 (300K) | E2 (900K) | Δ accuracy | E1 patches | E2 patches |
|---|---|---|---|---|---|
| mnist | 96.50% | 96.83% | +0.3pp | 129 | 129 |
| fashion | 86.72% | 86.59% | -0.1pp | 129 | 131 |
| kmnist | 90.01% | 90.29% | +0.3pp | 132 | 136 |
| emnist | 77.48% | 79.07% | +1.6pp | 128 | 151 |
| mixed | 81.46% | 83.20% | +1.7pp | 129 | 139 |
EMNIST (the most under-capacitied niche) gains the most. Mixed (carrying all four tasks, widest aggregate headroom) gains nearly as much. MNIST is saturated and shows neither count nor accuracy growth — exactly as the headroom hypothesis predicts.
Population-level: EMNIST went from avg 128 patches → avg 152.5. KMNIST → 135.5. Mixed → 140.9. Saturated niches stayed at 128-131.
Experiment table
| Exp | Setup | Wall (M4 Pro) | Result |
|---|---|---|---|
| E0 | EMNIST data sanity check — D1 reproduction on parquet-derived data | 178 s | 77.72% top — within 0.6pp of original-IDX D1; minor parquet train/test ordering drift, no orientation issue |
| E1 | Warm-start 5 niches × 300K steps, identical to D2 except warm_patch_insertion=true |
~30 min | Survival positive: 4/5 niches show >128-patch individuals in top-3 (D2: 1/5). KMNIST top and rank-2 at 132 patches. Accuracy flat: all niches within noise of D1/D2 — halving tax not yet repaid in 300K steps |
| E2 | E1 with 3× training budget (900K steps per niche) | ~20 min | Translation positive on non-saturated niches: EMNIST +1.6pp/+23 patches, mixed +1.7pp/+10 patches. MNIST saturated (+0.3pp / 0 count). Headroom-driven shape confirmed |
| E3 | Split-ratio ablation on Fashion + KMNIST: parent_ratio ∈ {0.5, 0.7, 0.9} | 151 s | Clean negative for asymmetric: only 0.5/0.5 grew count. At higher ratios the child is too weak a lineage to specialize. Net2Net’s behavior-preserving 0.5/0.5 is Pareto-optimal. Surfaced + diagnosed the residual cycle bug along the way |
| E4 | EMNIST capacity ceiling — 1.8M steps, single niche | 178 s | Asymptote bounded: 79.90% top / 161 patches; rank-3 by fit hits 80.09% / 156 patches. Doubling budget gave +0.83pp and +10 patches — returns diminishing but not exhausted. Asymptote ~80-81% at ~170-180 patches |
| E5 | Warm-start + depth=32 hidden layer on KMNIST and EMNIST | 110 s | Depth and count evolution are substitutive, not additive. KMNIST: depth caps count growth (128 patches), D3 alone (93.49%) still beats E5 (91.91%). EMNIST: count grows under depth (139p) but depth itself hurts accuracy (74.89% vs E2 79.07%) — Group B B34 robust to warm-start |
| E6 | Permuted-MNIST CL, 3 sequential tasks × 300K, pop=1 vs pop=50+warm | 38 s | Negative-leaning for the “population buffers forgetting” hypothesis. pop=50 vs pop=1: +1.5pp final accuracy, −1pp forgetting. Both lose ~45pp to catastrophic forgetting. Ecological speciation as currently implemented is a task-partitioning mechanism, not a forgetting-protection mechanism |
| E7 | Replay-based CL on E6’s setup — 4 conditions: pop=1 with 0/100/1000-per-task buffers, plus pop=50+warm+100-buffer | ~100 s | Replay solves CL. 100 examples/task closes 64% of the forgetting gap (43pp → 15pp); 1000/task closes essentially all of it (43pp → 6pp). Replay axis is ~20× the population axis. Population stacks weakly with replay (+2pp). |
| E8 | E7’s pop=1 vs pop=50 conditions extended to N_TASKS ∈ {5, 8} at fixed replay=100 | ~280 s | Replay-100 is partial CL. Forgetting climbs ~linearly with task count (15→23→27pp). Population effect (+1-3pp) doesn’t widen with difficulty — “population buffers forgetting” is twice-rejected. Strong recency bias in retained accuracy. |
| E9 | N=8 fixed, pop=1, buffer size sweep ∈ {100, 300, 1000, 3000} | ~150 s | Buffer, not mechanism. Forgetting halves with each ~3× buffer increase: 28→20→13→6pp. No diminishing-returns knee. Structural CL mechanisms (task-aware speciation, EWC) unmotivated at this scale; replay alone suffices at predictable memory cost. |
All Group E runs were on a single Apple M4 Pro (14 cores, 48 GB unified memory) for E1-E6, and a 16-thread i9-9900K for E7-E9. Total compute for E1-E9 across ~12M training steps: under two hours wall.
Three headline scientific findings
1. Warm-start (Net2Net) unblocks patch-count evolution
C3-C7 + D2 had established that cold structural mutations get culled before training matures them. E1’s survival result — 4 of 5 niches with >128-patch individuals in top-3, where D2 had 1 of 5 — closes the loop. The mechanism works in the predicted direction: by making insertion behavior-preserving (parent halved + child clone halved = same downstream signal), the host’s fitness doesn’t drop at insertion, so selection doesn’t immediately cull the larger architecture.
This makes Group C’s biggest negative result reversible. The patch-count axis is evolvable; it just needs a smarter insertion than fresh-random.
2. The translation to accuracy is gated by training time and by niche headroom
E1 grew count but didn’t translate to accuracy. E2 with 3× more steps did — but only on niches with accuracy headroom. The pattern that emerged in E2 is robust:
| Niche state | Count growth | Accuracy gain |
|---|---|---|
| Saturated (MNIST) | No | Minimal (existing-patch tuning) |
| Near-saturated (Fashion) | Small | Noise |
| Mid-room (KMNIST) | Clear (+5-8) | Modest (+0.3pp) |
| Headroom (Mixed) | Big (+11) | +1.7pp |
| Most headroom (EMNIST) | Biggest (+24) | +1.6pp |
The mechanism interacts with ecological speciation in exactly the predicted way: each niche’s own headroom determines whether extra capacity pays off. Where it doesn’t, the count axis is correctly inert — no fitness signal for capacity, selection doesn’t reward it.
3. Depth and warm-start are substitutive, not additive
D3 had shown depth=32 gives +3.3pp on KMNIST. The natural follow-up was to stack: warm-start patches feeding a hidden layer. E5 ran this on KMNIST and EMNIST. The results:
| Niche | D3 (depth, no warm, 300K) | E2 (no depth, warm, 900K) | E5 (both, 900K) |
|---|---|---|---|
| KMNIST | 93.49% / 128p | 90.29% / 136p | 91.91% / 128p |
| EMNIST | 75.56% / 128p | 79.07% / 151p | 74.89% / 139p |
Stacking doesn’t help. On KMNIST the 32-node hidden bottleneck caps useful patch count at 128 (no warm-start mutants survive) and depth-alone still wins; warm-start adds overhead without benefit. On EMNIST depth itself hurts the task (Group B B34’s finding, robust to warm-start), and combined performance is worse than either alone.
The right frame is that each niche wants one of depth or warm-start, not both — another instance of per-task architectural conditionality. A future system that picks between them per niche would Pareto-dominate any single-architecture configuration.
The continual-learning arc: E6 → E9
E6 opened with a negative result that was strategically important; E7–E9 closed the question.
E6 (negative for the original hypothesis). 3 sequential permuted-MNIST tasks under two conditions: pop=1 baseline vs pop=50 with warm-start. The population condition gave +1.5pp final accuracy and −1pp forgetting — within seed noise. Both conditions still lost ~45pp to catastrophic forgetting. The informal “population diversity buffers forgetting” hypothesis didn’t survive. Ecological speciation is a task-partitioning mechanism, not a forgetting-protection mechanism.
E7 (positive for replay). Same 3 tasks; added a balanced replay buffer of either 100 or 1000 examples per prior task.
| Condition | avg final | avg forgetting |
|---|---|---|
| no replay | 52.19% | 42.98pp |
| replay=100 | 79.59% | 15.24pp |
| replay=1000 | 88.74% | 6.07pp |
| pop=50 warm + replay=100 | 81.64% | 13.65pp |
100 examples per prior task — 0.2% of each task’s training set — closes 64% of the gap. 1000 closes essentially all of it. The replay axis is ~20× the population axis. Population stacks weakly (+2pp) on top.
E8 (replay-100 degrades with sequence length). Pushed to N_TASKS ∈ {5, 8} at fixed replay=100. Forgetting climbs roughly linearly: 15pp (N=3), 23pp (N=5), 27pp (N=8). The population effect (+1-3pp) doesn’t widen at harder regimes — the “population buffers forgetting” hypothesis is now twice-rejected, at short and long sequence. Strong recency bias in retained accuracy: most-recent task ≈ 90%, oldest task ≈ 60%.
E9 (buffer scaling resolves the question). Held N=8 fixed, swept buffer size ∈ {100, 300, 1000, 3000}.
| buffer/task | avg_final | avg_forget | task0_drop |
|---|---|---|---|
| 100 | 64.40% | 28.04pp | 36.07pp |
| 300 | 71.75% | 19.87pp | 27.23pp |
| 1000 | 77.02% | 12.88pp | 14.95pp |
| 3000 | 82.57% | 6.04pp | 10.31pp |
Forgetting halves with each ~3× buffer increase. No diminishing-returns knee in the sampled range. At 3000/task (24K examples — about half a task’s training set per task), forgetting drops to 6pp at N=8 — comparable to E7-C at N=3.
Closing the CL mechanism question. Structural CL mechanisms (task-aware speciation, EWC, parameter isolation) are unmotivated at this scale and task family. Replay alone is sufficient at any practical operating point; the cost is memory linear in task count. The “small replay” CL narrative is real at low task counts (100 examples works at N=3) but degrades to “store most of the data and rehearse it” as N grows. E10 (task-aware speciation) is correspondingly deprioritized — no headroom for it given replay’s coverage.
What Group E resolved and what’s still open
Resolved:
- Cold-mutation problem (E1 + E2) — warm-start Net2Net insertion lets macro-mutants survive selection; given enough training time, count growth translates to accuracy on niches with headroom.
- Split-ratio choice (E3) — 0.5/0.5 (canonical Net2Net behavior-preserving) is Pareto-optimal. Asymmetric splits make the child too weak to specialize.
- EMNIST capacity ceiling (E4) — ~80-81% at ~170-180 patches on this LR/seed configuration. Diminishing returns past 150 patches.
- Depth vs warm-start interaction (E5) — substitutive, not additive. Per-niche architecture choice matters; stacking them is worse than either alone.
-
The cycle bug (Group C / Phase D leftover) — root-caused as a sanitize-iteration completeness issue under dense SCCs. Fix: aggressive sanitize that disables all inter-cycle edges per pass instead of one. 15K+ cycle breaks in E5 all resolved in the first call; the
Individual::from_genomedefensive sanitize now reports 0 late-cycle breaks. 7 unit tests lock the fix in. - Continual learning (E6 → E9, resolved). Replay alone suffices: at any practical operating point, scaling buffer size linearly with task count keeps forgetting bounded. Structural CL mechanisms are unmotivated at this scale. Population diversity contributes a stable +1–3pp modifier independent of regime.
Still open:
- Cross-niche transfer. Take E2’s best EMNIST individual, transplant its patch geometry into a fresh genome, train on KMNIST. Does the geometry transfer as a useful prior or does it have to re-evolve from scratch?
- Larger-scale warm-start. 256-patch or 512-patch seed + warm-start at extended budgets, see if warm-start can match Group C’s C5d 512-patch seed result via evolution rather than seeding.
- Online-learning ablation. The system’s distinguishing positioning is online per-example SGD; what does that buy over offline mini-batch SGD on the same architecture? An untested load-bearing claim.
What this implies for the broader research direction
Four things shifted in priority based on E1-E9:
-
The patch-count evolution story is now a positive finding to publish. Group C documented the negative; Group E supplies the mechanism (warm-start) and characterises where it applies (headroom-driven, per-task). The Group B → Group C → Group E arc — manual mapping → emergent discovery → mechanistic resolution — is a coherent research narrative.
-
The CL question closes via replay. What looked like an open mechanism problem (E6 → “need explicit preservation”) turned out to be a memory-budget problem. The replay-axis effect (E7) dwarfs the population-axis effect by 20×; the scaling story (E8 → E9) is clean log-linear with no structural barriers. Continual learning on this system, on this task family, is solved at predictable cost.
-
“Population diversity buffers forgetting” is dead. Tested at short (E6) and long (E8) sequence, with and without replay (E7 D-vs-B): the population-as-CL-mechanism effect is consistently 1–3pp regardless of regime. Time to retire the hypothesis.
-
The interesting unanswered question is now outside CL. The system’s distinguishing claim — online per-example SGD + evolutionary topology — has never been ablated against offline mini-batch SGD on the same architecture. That’s the foundational load-bearing claim that the project’s positioning rests on, and as far as Group E has surfaced, it has not been examined rigorously. A natural next stream candidate.