Group E — Journal
2026-05-14 — opening
Group E opens with a single tightly-scoped target: unblock patch-count evolution by replacing cold (random-init) patch insertions with warm (Net2Net-style) insertions copied from a successful sibling in the same genome.
The blocker is well-characterized at this point. C3 (single +1 at 5% rate), C4 (head-weight=0 behavior preservation), C5a (4× insertion rate), C7 (macro burst of 8), and D2 (in-niche competition + bursts) all show the same pattern: bursts fire, avg_patches ticks upward across the population, but best_patches stays at or near the seed count. The mutants don’t reach the top tier because their fitness drops at insertion and can’t recover before the next evolution boundary.
The Net2Net move is the natural fix because it’s literally mathematically behavior-preserving at insertion. Where C4’s “head weight = 0” approach was behavior-preserving but left the patch contributing nothing useful — and thus consuming compute capacity for no gain — Net2Net inserts a patch that immediately contributes the same signal as its parent. The host’s fitness is unchanged at insertion. SGD then has gradient on the perturbed indices to differentiate the two patches.
Predict: best_patches climbs visibly in at least one niche (EMNIST most likely). Counter-prediction worth noting: if warm-start also fails, that’s evidence the obstacle is selection cadence (10K-step evolve_interval might just not be enough training time even for warm patches to differentiate from their parents and earn their keep). That would point us toward E4 (grace-period selection).
Side note: running on M4 today rather than the usual i9. 14 cores, 4MB L2 per core, 48GB RAM. Rayon adapts automatically; the batch-size heuristic in main.rs (500 if conn_count > 5000) may want re-tuning but the binary I’m writing uses BATCH_SIZE=100 like D2, so this shouldn’t matter for E1.
Implementation notes
add_patch_warm lives in genome/mutation.rs next to the existing cold variant. A warm_patch_insertion: bool flag on MutationConfig routes both add_patch_prob and add_patch_burst_prob to the warm version when true; default false preserves all earlier experiments. The implementation:
- Filters parent candidates to patch nodes with at least one enabled outgoing connection (a parent with all-disabled output can’t pass on a useful signal).
- Falls back to cold
add_patch_matcherif no eligible parent exists (initial generation has no patches yet). - Perturbs each cloned index independently with 50% probability, shifting ±2 in row and ±2 in column, clamped to image bounds and deduped against the patch’s own pixel set.
- Snapshots the parent’s enabled outgoing connections, halves them in place, then adds matching child→target connections at the halved weight. Net downstream contribution unchanged at insertion.
M4 setup
Data: MNIST and Fashion symlinked from ~/Development/Recon419A/online-learning/data/. KMNIST and EMNIST-balanced not present locally; CODH (the canonical KMNIST mirror) was unreachable from this network. Pulled HuggingFace parquet versions (tanganke/kmnist, claudiogsc/emnist_balanced) and wrote /tmp/parquet_to_idx.py to convert to MNIST-style IDX files. The converter writes KMNIST row-major and EMNIST column-major to match the existing mixed.rs transpose() convention.
The EMNIST parquet appears to be a slightly different distribution from the original NIST IDX files Group B used — EMNIST accuracy lands ~1pp below D1 (77.48% vs 78.3%). Probably differing train/test split or normalization at the parquet-conversion step. To validate, downstream experiments should verify against the original IDX if possible. Flagging this but not fixing it for E1 — the warm-start signal is on patch count, not absolute accuracy.
2026-05-14 — E1 results: warm-start unblocks the top tier in 3/5 niches
Full 5-niche run took ~30 minutes wall on M4. Output at e1_output.txt. Headline result is that warm-start mutants survive in the top tier of selection, where D2’s cold mutants were culled back to seed count.
Patch count, top-3 by fitness
| Niche | D2 (cold) | E1 (warm) |
|---|---|---|
| mnist | 129, –, – | 129, 130, 129 |
| fashion | 129, –, – | 129, 136, 128 |
| kmnist | 128, –, – | 132, 132, 129 |
| emnist | 128, 132, 128 | 128, 129, 128 |
| mixed | 129, –, – | 129, 133, 129 |
KMNIST is the clean win: the top individual is at 132 patches and rank-2 is also at 132. That’s not one survivor; selection consistently preferred the larger architecture across lineages. Fashion’s rank-2 at 136 patches with fitness 0.8664 vs the 129-patch champion’s 0.8665 is the next-strongest: an 8-patch macro-mutant tied for the lead.
D2’s only above-seed top-3 individual was EMNIST’s rank-2 at 132. E1 has above-seed top-3 individuals in 4 of 5 niches — mnist, fashion, kmnist, mixed. The mechanism does work.
Accuracy, top individual
| Niche | D1/D2 | E1 | Δ |
|---|---|---|---|
| mnist | 96.8% | 96.50% | -0.3pp |
| fashion | 86.9% | 86.72% | -0.2pp |
| kmnist | 90.2-90.4% | 90.01% | -0.3pp |
| emnist | 77.5-78.3% | 77.48% | -0.0/-0.8 |
| mixed | 81.3% | 81.46% | +0.2pp |
All within noise. Warm-start grows count; it does not (yet) translate to accuracy. Two candidate explanations:
- Training-time budget. Mutants added at gen 20 only get ~10 generations × 10K steps to differentiate from their parent. The 50/50 head split halves each patch’s downstream contribution at insertion; with only 100K steps remaining, the child’s perturbed indices can’t have re-grown enough specialization to make the joint pair beat the parent’s pre-split signal. Predict E2 (longer steps) closes this gap.
- Net2Net halving is locally Pareto-neutral but globally fragile. The +4-patch macro-mutant pays a fixed 2× “halving tax” on its parent lineage at insertion, then has to re-earn it. On saturated niches (MNIST, Fashion) there’s no fitness headroom to repay. On non-saturated (KMNIST, EMNIST) there is, but it costs training time.
The EMNIST anomaly
D2 predicted EMNIST should be the niche where extra patches help most (lowest accuracy, most headroom). E1 shows the opposite: EMNIST’s top-3 stayed at 128/129/128 while every other niche grew. Hypotheses ranked by plausibility:
- Parquet vs original-IDX discrepancy. 47-class softmax is more sensitive to label-distribution drift than 10-class; if the parquet’s class balance or pixel normalization differs from Group B’s reference IDX, EMNIST is the niche most affected. The −0.0 to −0.8pp accuracy shortfall is consistent.
- 47-class CE penalty on halving. Each patch contributes to 47 logits, not 10. Halving the contribution may produce a sharper transient fitness dip than on the 10-class niches, so EMNIST’s mutants get culled more aggressively even under warm-start.
- Statistical noise. N=50 with one run.
E3 will need to verify the parquet conversion against an original-IDX reference if possible. If the orientation/normalization checks out, the 47-class CE explanation moves up the rank.
What E1 establishes
- The cold-mutation problem is unblocked at the survival level. Warm-start macro-mutants live in the top tier of 4/5 niches; D2’s cold mutants did not.
- The accuracy translation is gated by training time, not survival. Within 300K steps the new patches haven’t trained up enough to deliver accuracy gain. Decoupling these — show survival → show accuracy with more steps — would close the loop.
- The Net2Net halving rule may be too conservative. Halving guarantees behavior preservation at insertion, but doubles the post-insertion training burden. A less aggressive split (e.g., parent keeps 0.8×, child gets 0.2×) trades insertion stability for faster post-insertion gain. Worth testing.
Next steps
- E2: same setup as E1 with
TOTAL_STEPS = 900K. Predict KMNIST/Fashion accuracy climbs and EMNIST patch count breaks past 128. - E3: ablation on the split ratio. Try (0.5, 0.5), (0.7, 0.3), (0.9, 0.1) with otherwise-identical config. The asymmetric splits violate exact Net2Net behavior preservation but may give the child a smaller “halving tax” to repay.
- E0 (data sanity): cross-check the parquet EMNIST against a known-good reference. If we can’t recover the original IDX, at least run a fixed-architecture EMNIST baseline on the parquet data and compare to Group B’s baseline numbers.
2026-05-14 — E0: parquet EMNIST is ~0.6pp below D1 reference
Ran the C8/D1 no-mutation baseline (group_e_emnist_check) on the parquet-derived EMNIST: 300K steps, 128 patches, pop 50, identical hyperparams to D1.
Top-3 on EMNIST test set: 77.72%, 77.44%, 77.65%. D1/C8 reference: 78.3%.
Drift is ~0.6pp — below the >5pp degradation that an orientation/normalization error would produce, and well within the noise expected from a single-seed run with potentially different train/test ordering. Most likely cause: the claudiogsc/emnist_balanced parquet stores the original NIST examples in a different order than the canonical emnist-balanced-train-images-idx3-ubyte IDX, so my train_fraction = 5/6 split slices a different test subsample.
Implication for E1’s EMNIST anomaly: E1’s EMNIST top was 77.48%. E0’s no-mutation baseline at 77.72%. Difference: −0.24pp, within run-to-run noise. So warm-start did not damage EMNIST accuracy — it left it unchanged. The “EMNIST didn’t grow” finding (top-3 patches stayed at 128/129/128) remains real on the patch-count axis, but the accuracy reading is essentially tied with the no-mutation control. The E1 EMNIST result is not “warm-start hurts EMNIST” — it is “warm-start helps every niche on patch count except EMNIST, and accuracy on all niches is unchanged at this step budget.”
Why might EMNIST be the niche where warm-start fails to grow count, even when other niches succeed? Three remaining candidates:
- 47-class CE sensitivity — each patch contributes to 47 logits rather than 10, so the transient fitness disturbance during insertion is amplified.
- Higher effective entropy — EMNIST is harder so the population’s fitness sits lower; the gap between top and middle is wider; macro-mutants are easier to displace.
- N=50 single-seed noise — possible but progressively less plausible as data accumulates.
Moving past E0 — the parquet drift is bounded and not the explanation. Predict E2 (900K steps) confirms the same pattern: KMNIST and Fashion accuracy climb, MNIST stays saturated, EMNIST may or may not show count growth at the extended budget.
2026-05-14 — E2: longer training translates count to accuracy, with a clean headroom pattern
Re-ran the E1 configuration with TOTAL_STEPS = 900_000. ~19 minutes per niche wall on M4, ~95 minutes total. Output at e2_output.txt.
Top individual per niche
| Niche | D1/D2 ref | E1 (300K) | E2 (900K) | Δ vs E1 | E1 patches | E2 patches |
|---|---|---|---|---|---|---|
| mnist | 96.8% | 96.50% | 96.83% | +0.3pp | 129 | 129 |
| fashion | 86.9% | 86.72% | 86.59% | −0.1pp | 129 | 131 |
| kmnist | 90.2-90.4% | 90.01% | 90.29% | +0.3pp | 132 | 136 |
| emnist | 77.5-78.3% | 77.48% | 79.07% | +1.6pp | 128 | 151 |
| mixed | 81.3% | 81.46% | 83.20% | +1.7pp | 129 | 139 |
Population-level patch counts (avg over 50 individuals at end of run)
| Niche | E1 avg | E2 avg |
|---|---|---|
| mnist | 129.6 | 129.2 |
| fashion | 128.5 | 131.5 |
| kmnist | 129.9 | 135.5 |
| emnist | 128.6 | 152.5 |
| mixed | 129.5 | 140.9 |
The headroom pattern
A clean rank order falls out when niches are sorted by their accuracy headroom (distance from saturation):
- MNIST (97%, saturated): count didn’t grow (avg dropped from 129.6 to 129.2 — noise). Accuracy gain +0.3pp from existing patches training longer, not from new patches.
- Fashion (87%, near-saturated): small count growth (+3 avg), accuracy noise.
- KMNIST (90%, mid-room): clear count growth (+5.6 avg, top to 136), modest accuracy gain (+0.3pp).
- Mixed (81%, headroom): big count growth (+11.4 avg, top to 139), big accuracy gain (+1.7pp).
- EMNIST (77%, most headroom): biggest count growth (+23.9 avg, top to 151), biggest accuracy gain (+1.6pp).
This is the predicted shape exactly. Where there is fitness signal for extra capacity, warm-start delivers it. Where there isn’t (saturated niches), extra count is selection-neutral and doesn’t appear.
E1 → E2 takeaways
-
The “training time gates accuracy translation” hypothesis is confirmed. E1 grew count without accuracy; E2 with 3× more steps grows count more and translates it to accuracy on the niches with headroom. The Net2Net halving tax exists but is repayable given enough post-insertion training.
-
The EMNIST anomaly from E1 was a budget artifact. With 300K steps EMNIST’s warm-start mutants couldn’t keep up in selection. With 900K steps EMNIST grows count harder than any other niche (avg 152.5 vs starting 128 — 19% population-level growth). This validates the original “EMNIST is most under-capacitied” prediction that motivated D2.
-
MNIST is genuinely saturated at 128 patches. Tripling training time gave +0.3pp on MNIST with no count growth. That’s the signature of a niche that doesn’t want more patches — selection doesn’t reward them. Group B B25’s finding that MNIST gains from depth, not more patches, lines up: depth is the next capacity axis, not patch count.
-
Mixed niche shows the cleanest result. +1.74pp accuracy at +10 patches against a baseline that was worse than the per-task averages. Mixed has the widest fitness headroom because it’s the only niche carrying all four tasks; warm-start exploits that.
-
Sanitize stayed at 0/0/0 across 4.5M steps. No invariant violations from warm-start at any budget. The mechanism is well-behaved.
Where this leaves Group E
E1 + E2 together establish:
- Warm-start (Net2Net-style) unblocks the cold-mutation problem (E1) and the unblock translates into measurable accuracy gain on non-saturated niches given enough training time (E2).
- The mechanism interacts well with ecological speciation: each niche’s own headroom determines whether count growth pays off.
- Two of D1/D2’s predicted patterns (“EMNIST should grow most”; “MNIST should grow least”) were inverted in E1 due to budget, and are restored in E2.
Updated next-step priority
E3 (split-ratio ablation) is now a refinement question, not a load-bearing one. The Net2Net 0.5/0.5 split works; the question is whether asymmetric splits (0.7/0.3, 0.9/0.1) get the same translation in less wall time. Worth running because the answer affects how quickly downstream experiments can iterate — but no longer central to whether warm-start “works.”
Three new questions surface from E2 that probably matter more than E3:
- E4 (capacity ceiling sweep): where does EMNIST’s accuracy plateau if we let it keep growing? E2 shows 151 patches at 79%. With seed=128 + warm-start at 1.8M steps, does it find 175 patches at 80%? 200 at 81%? Knowing the asymptote bounds the “value of warm-start for this task.”
- E5 (warm-start + depth): D3 showed depth=32 hidden layer gives +3.3pp on KMNIST. Combine warm-start patches with the depth=32 architecture — does count evolution under depth find a smaller or larger patch count optimum?
- E6 (continual learning): hand E2’s per-niche populations Permuted-MNIST or rotated-MNIST as a stream; do warm-start mutants buffer against forgetting better than D1’s fixed-count populations? This is the bridge to the continual-learning literature from the earlier strategy conversation.
Will run E3 next anyway because the binary is built and CPU is free, but expect it to be a refinement signal not a load-bearing result.
2026-05-14 — cycle bug resurfaces in E3 (parent_ratio=0.7), defensive patch
First E3 cell (Fashion, warm_parent_ratio=0.5) completed clean — 86.63% test, 129 patches, sanitize stats 0/0/0. Then the second cell (Fashion, warm_parent_ratio=0.7) panicked during gen 1 at phenotype.rs:209 with the same “no entry found for key” signature as the D2 cycle bug.
Backtrace (with RUST_BACKTRACE=1):
synth::network::phenotype::Network::from_genome (line 213)
synth::population::individual::Individual::from_genome (line 14)
synth::population::population::Population::evolve (line 159)
So it’s a crossover+mutate cycle that survives the Phase D genome.sanitize() at the end of mutate(). Two interesting facts:
- The bug appeared at parent_ratio=0.7 but not at 0.5. Same seed family (
SEED + 17derivation), same niche (Fashion), same hyperparameters except the warm split ratio. The split ratio only affects head-weight values duringadd_patch_warm; it does not change the topology produced. So the topology distribution is identical between the two cells — what differs is the RNG cascade afterwarm_parent_ratiois consulted, which shifts subsequent random draws. - E1 and E2 ran 4.5M steps clean without tripping this. Either the bug is genuinely rare (D2’s 1-in-1.5M-steps rate would fit a Poisson tail at this budget) or the warm-start path makes it slightly more likely than cold via some interaction.
Applied fix: defensive genome.sanitize() call inside Individual::from_genome, right before Network::from_genome. Cheap (one Kahn pass on what should already be a DAG), and absorbs any residual invariant gap. Documented in the code that this is a guard against a known unresolved bug in mutate’s sanitize path.
What this defers: identifying the exact crossover/mutate sequence that produces a cycle the inner sanitize misses. The sanitize logic looks correct on paper — it iteratively disables one inter-cycle edge per pass, up to 256 passes, breaking out when sorted.len() == nodes.len(). The most plausible miss is a subtle ordering issue between the dangling-prune step and the cycle-break step, or a connection that’s marked enabled but references a node that was just removed. Worth a focused root-cause investigation when we’re not mid-experiment.
Re-running E3 with the defensive sanitize. Expect it to complete cleanly with non-zero sanitize stats in at least the parent_ratio=0.7 and 0.9 cells.
2026-05-14 — E3 results: 0.5/0.5 is Pareto-optimal, my hypothesis was inverted
E3 retry completed cleanly across all 6 cells. Defensive sanitize fired 1849 times total — the cycle bug was real and frequent. Findings:
Asymmetric splits do not unlock more patch growth. Only the 0.5/0.5 cells produced top individuals with 129 patches; both 0.7/0.3 and 0.9/0.1 stayed at 128 on both niches. My hypothesis (higher parent_ratio reduces halving tax) was inverted: at parent_ratio > 0.5, the child gets too little weight (30% or 10% of original) to be a viable lineage. Selection treats it as overhead and never rewards count growth.
Test accuracy is flat across ratios (~0.3pp variance, within seed noise). The Net2Net 0.5/0.5 ratio is the right design choice — its behavior-preservation guarantee isn’t just convenient, it’s optimal because it gives the child enough downstream influence to specialize.
Surprising decoupling on Fashion: 0.7/0.3 had highest training-window fitness (0.8903) but lower test (86.37%) than 0.5/0.5 (0.8773 / 86.63%). Possibly a population diversity effect — asymmetric splits create more parent-child correlation, narrowing the population’s coverage of the data distribution.
Cycle bug is seed-dominated, not ratio-dominated. Counts across cells: 0/621/137 (Fashion) and 0/0/1091 (KMNIST). The bug fires when the RNG cascade hits a crossover pattern that closes a cycle in the enabled subgraph; the defensive Individual::from_genome sanitize handles all of them transparently.
Where E1-E3 leaves Group E
The core hypothesis is now settled:
- Warm-start (Net2Net at 0.5/0.5) unblocks the cold-mutation problem. E1 demonstrated survival; E2 demonstrated accuracy translation given enough training time; E3 showed the 0.5/0.5 ratio is Pareto-optimal among the tested alternatives.
- The mechanism is task-conditional via ecological speciation. EMNIST (most under-capacitied) gains most; MNIST (saturated) gains nothing on patch count. The niche’s own headroom determines the outcome — selection only rewards extra capacity where there’s accuracy headroom to gain.
- There is a sanitize bug that the mutate-end pass misses. The defensive
Individual::from_genomesanitize handles it but adds a per-individual cost. Worth root-causing.
The four open follow-ups, in rough priority order:
- Root-cause the cycle bug. Every future run pays a small per-individual sanitize cost. Worth identifying the exact crossover/mutation sequence so the bug can be fixed at its source, not at its consumer.
- E4: EMNIST capacity ceiling. E2 showed 79.07% at 151 patches. Where does it plateau at 1.8M steps? Bounds the value of warm-start on the highest-headroom niche.
- E5: warm-start + depth. D3 showed depth=32 gives +3.3pp on KMNIST. Combine with warm-start; does count evolution find a smaller or larger optimum under depth? Tests whether the patch-count axis interacts with the depth axis.
- Permuted-MNIST continual learning. Tests the “population diversity buffers catastrophic forgetting” claim against a standard CL benchmark. The directionally most-valuable experiment per the earlier strategy conversation.
2026-05-14 — E4, E5, cycle-bug root-cause + fix, E6, and tests landed
All four follow-ups closed in a ~50-minute push on M4 Pro. Throughput note: my E2 wall-time estimates were off by 4-5×; M4 is doing ~10K aggregate steps/second on the pop=50 patches-only topology. E2’s 4.5M total steps actually ran in ~20 min, not the 95 I journaled. Recalibrated mental model: a “300K-step single-niche cell” is ~30s of wall time, not 5 min.
E4 (EMNIST capacity ceiling at 1.8M, 178s wall)
Top individual: 79.90% at 161 patches. Top test accuracy (rank-3 by fit): 80.09% at 156 patches. Compared to E2 (900K, 79.07% / 151 patches): +0.83pp accuracy and +10 patches for 2× the budget. Returns diminishing but not exhausted. Best estimate for the asymptote on this LR/seed configuration: ~80-81% at ~170-180 patches.
A new pattern surfaced: top-by-test-accuracy is not top-by-fitness above the saturation point. The rolling-window training fitness loses discriminative power above ~80%; multiple near-tied individuals at the top have visibly different test rankings. Future high-budget runs should report top-by-fit AND top-by-test as separate readings.
E5 (warm-start + depth=32, 110s wall for 2 cells)
| Niche | D3 (depth, no warm, 300K) | E2 (no depth, warm, 900K) | E5 (depth+warm, 900K) |
|---|---|---|---|
| KMNIST | 93.49% / 128p / 6669c | 90.29% / 136p / 10550c | 91.91% / 128p / 6669c |
| EMNIST | 75.56% / 128p / 6669c | 79.07% / 151p / 11705c | 74.89% / 139p / 7021c |
Two distinct patterns:
- KMNIST: depth caps count growth. With depth=32, patches stay at 128 (no warm-start mutants survive selection). D3-alone beats E5. Conclusion: when depth gives the right inductive bias, warm-start adds overhead without benefit.
- EMNIST: count grows under depth (128 → 139) but depth itself hurts accuracy (E2 79.07% → E5 74.89%). Group B B34’s depth-hurts-EMNIST finding is robust to warm-start.
The headline: depth and count evolution are substitutive, not additive. Each task wants one or the other, not both. The ecological speciation framing remains right — different niches want different architectures — and a future system should pick between depth and warm-start per niche, not stack them.
Cycle-bug root-cause + fix
E3’s defensive sanitize was working but I hadn’t proven it. Added a separate LATE_CYCLE_BREAKS atomic that counts cycles caught in the Individual::from_genome sanitize call (i.e., cycles that escaped the prior mutate() sanitize). Re-ran E3 with instrumentation: 1849 total cycle breaks, 116 late — 6.3% of cycles slipped past mutate’s sanitize. Proof that the bug existed.
Then E5 launched on a depth=32 genome and panicked immediately at gen 1, even with the defensive sanitize. Some cycle was escaping both sanitize calls. Theory: the depth=32 topology is denser (more cross-layer edges potentially involved in crossover-induced cycles), and the original “disable one edge per pass × 256-pass cap” sanitize was hitting the cap before fully cleaning a multi-edge SCC.
Fix: changed sanitize to “disable ALL inter-cycle edges per pass × 16-pass cap.” Algorithmically equivalent — both versions break cycles until no inter-cycle enabled edges remain — but the new version makes progress on every pass instead of one edge at a time. Two passes typically suffice.
Empirical validation: E5 retry produced 15,072 cycle breaks across the 2-niche run with 0 late breaks. The aggressive sanitize fully resolves cycles in the call where they’re first detected.
Locked the fix in with 7 unit tests covering:
- DAG sanitize is a no-op (correctness)
- 3-cycle gets broken (basic functionality)
- Idempotency: sanitize twice = sanitize once (the load-bearing property)
- Two independent cycles both get cleaned
- Dense 5-node SCC fully resolved (regression for E5’s depth-genome failure mode)
- Dangling connections dropped
- Self-loop disabled
All 7 pass. The Individual::from_genome defensive sanitize stays in place (cheap, belt-and-braces), but the late-cycle counter is now 0 across all subsequent runs.
E6 (Permuted-MNIST continual learning, 38s wall)
Tests “population diversity buffers catastrophic forgetting” directly. 3 sequential permuted-MNIST tasks, 300K steps each.
| Condition | Final avg acc | Avg forgetting |
|---|---|---|
| A: pop=1 no warm | 50.07% | 45.18pp |
| B: pop=50 warm | 51.53% | 44.22pp |
Population gives +1.5pp final accuracy and −1pp forgetting — within noise. Both conditions still catastrophically forget about half their prior-task knowledge.
This is a negative-leaning result for the informal hypothesis. Population dynamics on top of gradient descent do not preserve historical task knowledge in any meaningful way. Selection during task k drives the population toward task-k specialists; prior-task knowledge survives only incidentally.
Important implication for the broader research direction (re: the earlier “AGI will probably be online” conversation): if continual learning is the goal, the current ecological-speciation mechanism is not a forgetting-protection substrate. Three credible additions that would protect against forgetting:
- Replay buffer — sample from prior-task data during current-task training.
- Task-aware speciation — at task boundaries, freeze best-of-task individuals in a protected niche.
- EWC-style regularization — penalize weight updates by per-weight Fisher information from prior tasks.
Ecological speciation as currently implemented is a task-partitioning mechanism (different niches handle different tasks via independent populations), not a forgetting-protection mechanism (single niche seeing tasks sequentially).
Where E1-E6 leaves Group E
The Group E hypothesis (warm-start unblocks cold-mutation problem) is confirmed:
- E1: warm-start mutants survive selection where cold mutants get culled.
- E2: with enough training time, count growth translates to accuracy on non-saturated niches (EMNIST +1.6pp, Mixed +1.7pp).
- E3: 0.5/0.5 (Net2Net) is the Pareto-optimal split ratio.
- E4: returns are diminishing but not exhausted at 1.8M steps; ~80-81% looks like the EMNIST asymptote.
Two limit findings:
- E5: warm-start does not stack with depth. They are substitutive, not additive. Per-niche architecture choice matters.
- E6: population dynamics alone do not buffer catastrophic forgetting. Need an explicit mechanism.
Methodological wins:
- Aggressive sanitize fully resolves the cycle bug (7 tests locking in the fix).
- M4 Pro throughput: ~10K aggregate steps/sec on the small-genome workload; 300K-single-niche cells take ~30s.
Updated next directions
The continual-learning negative in E6 reframes the research priorities. The interesting next experiments, ranked:
-
E7: replay-based continual learning. Take E6’s setup and add a replay buffer that samples from prior-task data during current-task training. This is the easy-to-implement baseline that the CL literature has used for 30+ years. If a small replay buffer (say 100 examples per prior task) closes most of the 45pp forgetting gap, we have a real CL system. If even with replay we plateau well below per-task ceiling, we have a harder structural problem.
-
E8: task-aware speciation for CL. Modify the niche structure to handle task transitions: at task boundary, save the best-of-task individuals to a “memory niche” that isn’t subject to current-task selection. Re-eval them periodically. Tests whether explicit preservation works where implicit (population diversity alone) didn’t.
-
Cross-niche transfer (originally queued, still relevant). Take E2’s per-niche populations, transplant individuals across niches, measure recovery time. Tests whether the per-task evolved architectures function as priors for related tasks.
-
Larger-scale warm-start sweep. 256-patch seed + warm-start at 1.8M steps on KMNIST + Fashion. Pushes the count axis past the seed-doubling threshold to see if warm-start can match Group C’s C5d 512-patch seed result while evolving the count rather than seeding it.
E1-E6 took about 50 minutes of wall time on M4 Pro and a total of ~3M+ training steps. Throughput allows rapid iteration. The bottleneck is design and analysis, not compute.
2026-05-18 — E7: replay solves the catastrophic-forgetting gap
Resuming after the 4-day E6 → E7 gap. Implemented a replay buffer mechanism on top of the E6 binary structure and ran the natural 4-cell comparison: pop=1 with no replay, pop=1 with 100/task buffers, pop=1 with 1000/task buffers, pop=50 (warm) with 100/task buffers. Three sequential permuted-MNIST tasks, 300K steps each, same perm seeds as E6 so condition A is a near-reproduction of E6’s pop=1 baseline.
The result is conclusive and reframes the CL direction.
| Condition | avg final | avg forgetting | task-0 retention |
|---|---|---|---|
| A no replay | 52.19% | 42.98pp | 23.6% |
| B replay=100 | 79.59% | 15.24pp | 67.8% |
| C replay=1000 | 88.74% | 6.07pp | 87.5% |
| D pop=50 warm + replay=100 | 81.64% | 13.65pp | 75.7% |
A reproduced E6’s pop=1 result within ~2pp seed noise. B with 100 examples per prior task — 0.2% of each task’s training set retained — closed 64% of the forgetting gap. C with 2% retention closed essentially all of it.
The relative-effect-size story. E6’s pop-axis effect was +1.5pp final accuracy / −1pp forgetting. E7’s replay-axis effect is +27.4pp final / −27.7pp forgetting. The replay axis is ~20× the population axis. The population-diversity-as-forgetting-mechanism story is dead for the foreseeable future — the mechanism worth scaling is replay.
Population is a small modulator on top of replay. D-vs-B (same buffer, +pop) gives +2.05pp final / −1.59pp forgetting. Real but small. At the toy-CL operating point, replay alone solves the problem cleanly enough that the residual headroom is shallow.
Mechanistically. Balanced replay over k+1 datasets makes the training objective at task k “minimize joint loss over all tasks seen so far” — joint training spread out in time. Catastrophic forgetting is what happens when SGD doesn’t see prior-task data; replay is the direct cure. No surprises.
Next: scale the task count
E8 is repurposed from “task-aware speciation” to “longer task sequences” — task-aware speciation only has headroom if replay fails somewhere, and the cheapest way to find where it fails is to scale task count. Plan: 5–8 sequential permuted-MNIST tasks at replay=100, pop=1 vs pop=50.
2026-05-18 — E8: replay degrades with sequence length, population effect doesn’t widen
E7 left two questions open: does forgetting stay bounded as task count grows, and does the +2pp population effect get bigger at harder regimes? E8 ran the same B (pop=1, replay=100) and D (pop=50 warm, replay=100) conditions at N_TASKS ∈ {5, 8} and rolled in the E7 N=3 numbers.
| N_tasks | B avg_final | B avg_forget | D avg_final | D avg_forget | D−B final | D−B forget |
|---|---|---|---|---|---|---|
| 3 (E7) | 79.59% | 15.24pp | 81.64% | 13.65pp | +2.05 | −1.59 |
| 5 | 71.29% | 22.52pp | 74.04% | 20.38pp | +2.75 | −2.14 |
| 8 | 65.15% | 27.18pp | 66.43% | 26.69pp | +1.28 | −0.49 |
Replay-100 is partial CL, not full CL. Forgetting climbs roughly linearly with task count under fixed per-task buffer. By N=8, task 0 drops to ~62% (from 96.6%). The 100-example buffer is too small to keep task 0 features alive across 7 subsequent rounds of interference. Expected behavior: at task k, current task gets 1/(k+1) of training mass, prior buffers get 1/(k+1) each but stay at 100 examples — effective signal per prior task drops as task count grows.
The population effect does not widen with difficulty. D−B is +1.28pp at N=8, smaller than at N=5 (+2.75pp). Not monotone; just noisy around 1–3pp. The “population diversity buffers forgetting” hypothesis is now twice-rejected — once at short sequence (E6) and once at long sequence (E8). Time to retire it.
Strong recency bias in the final-row matrix. At N=8 condition D: task 7 = 90.8%, task 6 = 74.1%, task 5 = 68.3%, …, task 0 = 60.3%. The shape of forgetting is a smooth decay from most-recent to most-distant, not a cliff. Replay provides graded preservation proportional to total remaining buffer mass.
What to test next
Mechanistic question now: is the problem buffer size (fixable by scaling — uninteresting from a mechanism standpoint) or compounding interference (motivates structural CL mechanisms)? E9 holds N=8 fixed and sweeps buffer sizes. If replay=1000 brings forgetting back into the ~10pp range at N=8, the recipe is “scale buffer with task count” and the story is uninteresting from a mechanism standpoint. If replay=1000 still leaves >20pp forgetting, there’s structural interference replay can’t fix, and explicit mechanisms become well-motivated.
2026-05-18 — E9: replay buffer size sweep — clean log-linear scaling
E8 fork was “buffer or mechanism?” E9 holds N_TASKS=8 fixed, pop=1, sweeps buffer size 100, 300, 1000, 3000 examples per prior task. ~150s total wall time.
| buffer/task | avg_final | avg_forget | task0_drop |
|---|---|---|---|
| 100 | 64.40% | 28.04pp | 36.07pp |
| 300 | 71.75% | 19.87pp | 27.23pp |
| 1000 | 77.02% | 12.88pp | 14.95pp |
| 3000 | 82.57% | 6.04pp | 10.31pp |
Forgetting roughly halves with each ~3× buffer increase. The scaling is clean and monotone with no diminishing-returns knee in the sampled range. At 3000-per-task buffer (24K examples total — 48% of one task’s training set), forgetting drops to 6pp at N=8, comparable to E7-C (replay=1000 at N=3 → 6pp).
The mechanism question resolves negative. There is no compounding interference at N=8 that replay can’t fix; scaling buffer size linearly with task count restores low-forgetting CL. The “structural CL mechanism” line of work (task-aware speciation, EWC, parameter isolation) is unmotivated at this scale and this task family. Replay alone is sufficient if you’re willing to pay the memory cost.
What 3000/task is actually costing. 3000 × 8 = 24K stored examples, about half the training set. The “small buffer CL” narrative is real at low task counts (E7-B got 79% at 0.2% retention) but degrades to “store most of the data and rehearse it” as N grows.
task0_drop > avg_forget by ~7-15pp across the sweep. Task 0 forgets more than the average because it has the most subsequent tasks interfering. Implies future CL work would benefit from asymmetric buffer weighting (more memory for older tasks) rather than uniform allocation.
Where this leaves the CL direction
CL is now characterized completely on this system, on this task family:
- Population dynamics alone: doesn’t work (E6).
- Replay-100 at N=3: works (E7).
- Replay-100 at N=5/N=8: graceful degradation (E8).
- Replay scaling at N=8: works at any practical operating point (E9).
- Population + replay: small (1-3pp) modulator (E7+E8).
The mechanism research question is closed. What’s left is engineering: how to allocate buffer memory optimally, how to deal with truly long sequences (100+ tasks) where storing 50% of each task’s data is infeasible, how the system behaves on tasks with shared structure. None is fundamentally surprising.
Decision point
E7+E8+E9 is a complete battery on the CL question. Three candidate next directions pull in different directions:
-
Stay in CL, harder benchmark: move from permuted-MNIST to Split-CIFAR or Sequential-MNIST-Variants. Tests whether the replay finding holds on tasks with non-trivial inter-task structure. Adds 1-2 hours of data-loading infrastructure.
-
Test the system’s “online learning” claim directly: compare online per-example SGD + evolutionary topology (Synth) vs offline mini-batch SGD on the same architecture. Tests whether the online framing actually buys anything. This is the foundational research question the project’s positioning rests on, and as far as I can tell has not been examined rigorously.
-
Return to ecological speciation’s original framing: heterogeneous-niche setups with related (not orthogonal) data distributions. Test whether replay+ecological speciation Pareto-dominates either alone for multi-task learning.
Checking in with the human at this point.