Raw structured experiment records for the typed-species NEAT integration (Group C) stream. Reproduced exactly as produced.
Structured records for each experiment.
C1: 4-way joint MLP baselines
Date: 2026-05-13
Binary: src/bin/group_c_baselines.rs
Hypothesis: establish the floor that Group C must beat. Dense MLPs trained via the existing seeded-genome + Network::forward/backward on the joint 4-way task (77 classes: MNIST 10 + Fashion 10 + KMNIST 10 + EMNIST-balanced 47).
Setup: DatasetSplit::load with all 4 datasets, train_fraction=5/6 (~244K train / ~48.8K test), 3 epochs of online SGD per seed, lr 0.01→0.001 linear, 3 seeds per architecture. Architectures: [64], [128], [128, 64]. hidden_input_fraction = 1.0 (dense MLP from inputs).
Results (mean ± std over 3 seeds, test accuracy):
| Arch | Overall | MNIST | Fashion | KMNIST | EMNIST | Conn |
|---|---|---|---|---|---|---|
[64] |
0.758 ± 0.005 | 0.907 ± 0.007 | 0.775 ± 0.016 | 0.798 ± 0.010 | 0.648 ± 0.007 | 55,245 |
[128] |
0.774 ± 0.006 | 0.918 ± 0.006 | 0.780 ± 0.018 | 0.825 ± 0.007 | 0.665 ± 0.003 | 110,413 |
[128, 64] |
0.355 ± 0.266 | 0.581 ± 0.412 | 0.437 ± 0.294 | 0.345 ± 0.259 | 0.198 ± 0.202 | 113,741 |
Per-dataset difficulty ordering (consistent across single-hidden arms)
MNIST (~91%) > KMNIST (~82%) > Fashion (~78%) > EMNIST (~65%). EMNIST is the hardest by ~17pp, which lines up with its 47-class space (2.5× MNIST’s class count, each class has ~half the training data) and Group B’s per-task findings.
[128, 64] collapse: initialization is the bottleneck
The [128, 64] arm is anomalous. Per-seed: 59.6% (slow recovery from saturated softmax), 6.9% (never escapes), 40.2% (partial recovery). Stddev 26.6pp — by far the largest of any arm.
Cause: Genome::new_seeded initializes connection weights as U(-1, 1). Calibrated for the existing system’s sparse start (hidden_input_fraction = 0.10), this puts σ ≈ 5.8 on the first hidden layer’s pre-activation when used dense. With two hidden layers and U(-1, 1) inter-layer weights, output logits start at σ ≈ 74 — softmax saturates near a single class. Whether SGD ever escapes depends on which class got the lucky logit, and the gradient through 2 saturated layers is small.
This means C1’s [128, 64] number is not a meaningful “depth helps?” data point; it’s a “this init can’t reach depth 2 at this LR” data point. Recording it for completeness and as motivation for either He init in new_seeded or sparse start in the baseline.
Floor that Group C must beat
For Group C comparisons, the operative floor is ~77% overall test accuracy (single-hidden-layer dense MLP at width 128, 3 epochs). The patch-matcher path got to 96.6% on MNIST alone with no hidden layer at all (C2), so the Group C hypothesis remains live — patches may dominate this dense baseline even before the evolutionary search starts.
C2: Integrated patch-matcher verifier (MNIST)
Date: 2026-05-13
Binary: src/bin/group_c_patch_verify.rs
Hypothesis: with the integration’s mechanical core landed (NodeGene patch slot, PatchTopo in phenotype, forward/backward branch, Genome::new_with_patches), a Group-B-shaped patch genome trained through the integrated forward/backward should hit Group B’s spatial-patch numbers on MNIST (~95-96%).
Setup: 320 spatial 5×5 patches, linear classifier head, 50K MNIST train / 10K test, 3 epochs, lr=0.05 (constant), seed 0xC2.
Result:
| Epoch | Test acc |
|---|---|
| 1 | 0.9628 |
| 2 | 0.9649 |
| 3 | 0.9664 |
Hits Group B’s spatial-patch ceiling on the first epoch. End-to-end integration verified — the patch forward/backward branch is correct.
C3: First patch-evolved population (4-way joint)
Date: 2026-05-13
Binary: src/bin/group_c_evolve.rs
Hypothesis: a population of patch-seeded genomes under NEAT-style evolution (patch index mutation + add_patch + connection ops, no scalar add_node) should reach or beat the dense MLP floor on the 4-way joint task with far fewer parameters.
Setup:
- Population 50, seeded 50/50 spatial vs random-index 5×5 patches (
N_SEED_PATCHES = 64). - 500K steps, batch size 100, LR 0.05 → 0.005 linear, evolve every 10K steps.
- Joint stream 25/25/25/25 across MNIST/Fashion/KMNIST/EMNIST-balanced (77 classes).
- Mutation:
add_patch_prob=0.05,mutate_patch_indices_prob=0.30,per_patch_index_swap_prob=0.02,add_connection_prob=0.05,add_node_prob=0(no scalar splits),remove_connection_prob=0.05. - Seed 0xC301.
Result (top-5 individuals on test):
| id | fitness | test | MNIST | Fashion | KMNIST | EMNIST | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 654 | 0.7835 | 0.7590 | 0.9075 | 0.8032 | 0.7887 | 0.6407 | 64 | 5,005 |
| 751 | 0.7807 | 0.7553 | 0.9081 | 0.8043 | 0.7858 | 0.6317 | 64 | 5,005 |
| 766 | 0.7805 | 0.7572 | 0.9075 | 0.8033 | 0.7858 | 0.6376 | 64 | 5,005 |
| 771 | 0.7801 | 0.7546 | 0.9022 | 0.8037 | 0.7836 | 0.6345 | 64 | 5,005 |
| 633 | 0.7795 | 0.7572 | 0.9079 | 0.8050 | 0.7856 | 0.6366 | 64 | 5,005 |
Vs C1 baselines:
| Method | Test | Connections | Conn ratio |
|---|---|---|---|
[64] MLP |
0.758 | 55,245 | 1× |
[128] MLP |
0.774 | 110,413 | 2× |
| C3 patches | 0.759 | 5,005 | 0.09× |
Patches match the [64] dense MLP’s accuracy with 11× fewer connections and trail [128] by 1.5pp with 22× fewer connections. The patch primitive is dramatically more parameter-efficient than dense layers on the 4-way task.
Per-dataset structure: MNIST 91% > Fashion 80% ≈ KMNIST 79% > EMNIST 64%. Same difficulty ordering as the dense MLP baselines. EMNIST stays hardest by ~15pp.
Key dynamics observation: add_patch additions don’t survive selection
best_patches stayed at 64 for the entire 50-generation run. avg_patches floated between 64.0 and 64.2 — add_patch_prob=0.05 fires ~1 add per individual per generation on average, yet net growth is zero. New patches enter with random weights and need training time to be useful; in a 10K-step generation window they look strictly worse than mature patches and get culled in the next evolution step.
This is the NEAT-classic problem of “structural mutations look worse short-term,” solved there by behavior-preserving insertion (split keeps weight 1.0 + original on the path). The current add_patch_matcher uses head_weight = N(0, 0.1) for the new patch→target connection, so a new patch immediately adds random noise to the output and hurts fitness.
Fix for C4: initialize the new patch’s outgoing connection weight to 0.0. The patch contributes nothing at insertion time, fitness doesn’t drop, SGD then trains the connection weight upward if the patch’s features are useful, or leaves it near 0 if not. Behavior-preserving insertion, NEAT-style.
What C3 tells us about the typed-species hypothesis
C3 partially confirms the Group B hypothesis: patch index mutation works — selection preserves better patch placements over generations (fitness climbed from 0.66 to 0.81). But patch count mutation doesn’t work yet, so we haven’t tested whether evolution discovers the right number of patches per task. That’s the next experiment’s question.
Next: C4 candidates
- Fix add_patch insertion — head weight = 0.0 (behavior-preserving). Re-run with no other changes; see if
best_patchesgrows. - Larger seed, let pruning win — start at 128 or 256 patches, raise
remove_connection_prob, see if evolution prunes to a smaller optimal set (lottery ticket). - Ecological speciation — split into 4 pure-task niches + mixed; test whether KMNIST’s niche converges to a different patch geometry than the others.
C4: Behavior-preserving add_patch (head_weight = 0)
Date: 2026-05-13
Binary: src/bin/group_c_evolve.rs (same as C3, with one-line fix to add_patch_matcher)
Hypothesis: setting the new patch → target connection weight to 0 at insertion (instead of N(0, 0.1)) means the new patch contributes exactly 0 to the output, fitness doesn’t drop, and the patch survives long enough for SGD to find a use for it.
Setup: identical to C3 except head_weight = 0.0 in add_patch_matcher. Same seed.
Result (top-5 individuals on test):
| id | fitness | test | MNIST | Fashion | KMNIST | EMNIST | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 776 | 0.7968 | 0.7432 | 0.8931 | 0.8048 | 0.7923 | 0.6045 | 64 | 5,006 |
| 704 | 0.7967 | 0.7513 | 0.8985 | 0.8065 | 0.7985 | 0.6185 | 66 | 5,007 |
| 764 | 0.7959 | 0.7475 | 0.9006 | 0.8043 | 0.7931 | 0.6115 | 65 | 5,008 |
| 759 | 0.7958 | 0.7526 | 0.9015 | 0.8084 | 0.7993 | 0.6188 | 66 | 5,008 |
| 772 | 0.7945 | 0.7503 | 0.9008 | 0.8067 | 0.7939 | 0.6171 | 64 | 5,006 |
Average top-5 test: ~0.749 (vs C3’s ~0.757). Patches moved (64-66 range; avg_patches 64.0 → 64.8) but only slightly. Net result: similar fitness, slightly lower test, patches did drift positive.
Why so little growth?
Three mechanisms throttle patch growth even with behavior-preserving insertion:
- Bootstrap is slow, not blocked.
head_weight = 0doesn’t freeze the patch:∂L/∂head_weight = δ_target × post_act(patch), andpost_act(patch)is nonzero for random He-init weights, so SGD trainshead_weightupward whenever the patch’s random feature happens to correlate with the loss surface. But the initial gradient is small (one connection’s worth out of 5,005), so the patch grows from a noise-level contribution slowly. - NEAT crossover loses disjoint genes from the lesser parent. When a 65-patch parent mates with a 64-patch parent, the new patch is a disjoint gene. In NEAT classic, disjoint genes are inherited only from the fitter parent. If the 65-patch parent is fitter, the patch survives; if not (and an immature added patch usually isn’t fitter), the offspring drops to 64. With 30% cull and tournament-of-3 selection, new-patch carriers need a real fitness advantage to persist, and an immature patch doesn’t yet have one.
- Steady-state arithmetic. Add rate ≈ 0.05 × 50 = 2.5 patches/gen across the population. Breeding overwrites a fraction of new patches each gen. Net per-generation growth is fractional, consistent with the observed +0.8 over 50 generations.
C4’s finding: behavior-preserving insertion is necessary but not sufficient. The NEAT-crossover-keeps-only-fitter-disjoints rule is the bigger blocker — even a non-harmful add gets bred out without a fitness advantage.
Next: C5a / C5b
Two follow-up experiments to disentangle this:
- C5a — same as C4 with
add_patch_prob = 0.20(4× higher). If patch growth is rate-limited rather than blocked, this should show stronger drift. - C5b — same as C4 with
N_SEED_PATCHES = 128. Tests whether the joint task wants more capacity than 64 patches (in which case test accuracy goes up, and growing patches via evolution is meaningful), or whether 64 is already at saturation (in which case the system is correctly stable and we should be looking at what the patches represent, not how many of them there are).
C5a: Higher add_patch_prob
Date: 2026-05-13
Setup: same as C4, ADD_PATCH_PROB=0.20 (4× C4), seed 50593.
Result (top-5 test):
| id | fitness | test | M | F | K | E | patches |
|---|---|---|---|---|---|---|---|
| 730 | 0.7781 | 0.7504 | 0.8942 | 0.8058 | 0.7945 | 0.6210 | 65 |
| 426 | 0.7778 | 0.7530 | 0.8956 | 0.8078 | 0.7956 | 0.6253 | 65 |
| 695 | 0.7768 | 0.7521 | 0.8957 | 0.8064 | 0.7943 | 0.6243 | 66 |
| 707 | 0.7768 | 0.7527 | 0.8950 | 0.8086 | 0.7969 | 0.6237 | 65 |
| 723 | 0.7760 | 0.7524 | 0.8957 | 0.8092 | 0.7963 | 0.6226 | 67 |
avg_patches went 64.0 → 65.3 (vs C4’s 64.8 and C3’s 64.1). Top-5 test ~75.2% — basically identical to C4 (~74.9%) and C3 (~75.7%) within noise. 4× add_patch_prob produced no meaningful accuracy gain.
The bottleneck isn’t insertion rate. Selection can’t tell a 65-patch individual apart from a 64-patch individual because the +1 patch’s marginal fitness contribution (~1/64 ≈ 1.5%) is below the per-generation fitness noise floor.
C5b: Larger seed (128 patches)
Date: 2026-05-13
Setup: same as C4, N_SEED_PATCHES=128 (2× C4), seed 50609.
Result (top-5 test):
| id | fitness | test | M | F | K | E | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 752 | 0.8406 | 0.8202 | 0.9484 | 0.8275 | 0.8592 | 0.7273 | 128 | 9,933 |
| 784 | 0.8389 | 0.8181 | 0.9465 | 0.8281 | 0.8596 | 0.7223 | 128 | 9,933 |
| 537 | 0.8379 | 0.8244 | 0.9513 | 0.8284 | 0.8640 | 0.7338 | 128 | 9,933 |
| 656 | 0.8368 | 0.8245 | 0.9503 | 0.8273 | 0.8635 | 0.7353 | 128 | 9,933 |
| 769 | 0.8368 | 0.8189 | 0.9470 | 0.8260 | 0.8567 | 0.7269 | 128 | 9,933 |
Top-5 test mean 82.1%, best 82.45%. Beats every prior result.
Vs all prior runs
| Method | Test | Conn | Conn ratio | MNIST | Fashion | KMNIST | EMNIST |
|---|---|---|---|---|---|---|---|
[64] MLP |
0.758 | 55,245 | 1× | 0.907 | 0.775 | 0.798 | 0.648 |
[128] MLP |
0.774 | 110,413 | 2× | 0.918 | 0.780 | 0.825 | 0.665 |
| C3 (64p) | 0.759 | 5,005 | 0.09× | 0.908 | 0.803 | 0.789 | 0.641 |
| C4 (64p) | 0.749 | 5,006 | 0.09× | 0.898 | 0.806 | 0.794 | 0.612 |
| C5a (64p, 4×add) | 0.752 | 5,007 | 0.09× | 0.896 | 0.807 | 0.795 | 0.624 |
| C5b (128p) | 0.824 | 9,933 | 0.18× | 0.950 | 0.827 | 0.864 | 0.735 |
C5b is +5.0pp over [128] MLP at 11× fewer connections, and +6.5pp over the 64-patch population at 2× the connections. The biggest per-dataset gains: KMNIST +6pp and EMNIST +9-11pp, exactly the tasks with the most headroom.
What C5a + C5b together imply
- Patch count matters: 2× seed → +6.5pp overall, +11pp on EMNIST. The 64-patch baseline was under-capacitied.
- Per-generation +1 patch mutation can’t climb this gradient: even at 4× add rate (C5a) net growth stays around +1.3 patches over 50 gens.
- The evolutionary architectural search is too slow given the fitness noise floor. A 1-patch change in a 64-patch network is statistically invisible to selection.
Next: C5c
Seed 256 patches and re-run. If C5c continues climbing, the joint task wants even more capacity and we should be running large initial seeds + pruning (lottery ticket pattern). If C5c plateaus at C5b’s number, 128 is approximately right for this 50-individual / 500K-step / 4-way config.
C5c: Larger seed (256 patches)
Date: 2026-05-13
Setup: same as C4, N_SEED_PATCHES=256, seed 50625.
Result (top-5 test):
| id | fitness | test | M | F | K | E | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 661 | 0.8952 | 0.8577 | 0.9647 | 0.8629 | 0.9026 | 0.7740 | 256 | 19,790 |
| 762 | 0.8942 | 0.8573 | 0.9656 | 0.8615 | 0.9020 | 0.7737 | 256 | 19,790 |
| 732 | 0.8932 | 0.8565 | 0.9655 | 0.8607 | 0.9010 | 0.7726 | 256 | 19,790 |
| 664 | 0.8916 | 0.8585 | 0.9653 | 0.8596 | 0.9054 | 0.7762 | 256 | 19,789 |
| 744 | 0.8907 | 0.8560 | 0.9653 | 0.8620 | 0.9006 | 0.7711 | 256 | 19,790 |
Top-5 mean 85.72%, best 85.85%. Still climbing but the curve is flattening.
Capacity scaling (C3/C4 + C5a/C5b/C5c)
| Patches | Conn | Test | ΔTest from prior | MNIST | Fashion | KMNIST | EMNIST |
|---|---|---|---|---|---|---|---|
| 64 (C3) | 5K | 0.759 | — | 0.908 | 0.803 | 0.789 | 0.641 |
| 128 (C5b) | 10K | 0.824 | +6.5pp | 0.950 | 0.827 | 0.864 | 0.735 |
| 256 (C5c) | 20K | 0.857 | +3.6pp | 0.965 | 0.863 | 0.905 | 0.776 |
Each 2× in patches costs 2× connections and gives ~half the prior gain. Returns are diminishing but not yet near zero. MNIST is essentially saturated at 96.5%; Fashion / KMNIST / EMNIST still have headroom compared to Group B’s single-task ceilings (~88% / ~96% / ~82%).
Connection pruning is dead
best_conn=19790 and avg_conn=19790 for almost all of C5c. With remove_connection_prob=0.05 per-genome per generation (rather than per-connection), only ~125 disable events fire across the entire 50-gen × 50-pop run, against ~19,800 connections. Effective prune rate <1%. The lottery-ticket hypothesis can’t be tested at this prune rate.
Next
- C5d — seed 512 (one more doubling); find the plateau.
- C6 — bump
remove_connection_probto a per-connection rate. Test whether evolution can compress a 256/512-seeded population back down. - C7 — macro-mutation: an
add_patch_burstthat adds 8-16 patches at once. Tests whether the per-+1 fitness-noise problem can be sidestepped by larger jumps.
C5d: Larger seed (512 patches)
Date: 2026-05-13
Setup: same as C4, N_SEED_PATCHES=512, seed 50641.
Result (top-5 test):
| id | fitness | test | M | F | K | E | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 750 | 0.8865 | 0.8709 | 0.9706 | 0.8639 | 0.9220 | 0.7943 | 513 | 39,502 |
| 570 | 0.8838 | 0.8738 | 0.9719 | 0.8697 | 0.9223 | 0.7980 | 513 | 39,502 |
| 784 | 0.8837 | 0.8690 | 0.9685 | 0.8653 | 0.9166 | 0.7928 | 512 | 39,501 |
| 772 | 0.8831 | 0.8716 | 0.9698 | 0.8674 | 0.9199 | 0.7959 | 513 | 39,501 |
| 748 | 0.8828 | 0.8715 | 0.9698 | 0.8674 | 0.9209 | 0.7952 | 513 | 39,502 |
Top-5 mean 87.14%, best 87.38%.
Capacity scaling — log-linear with halving gains
| Patches | Conn | Test | ΔTest from prior |
|---|---|---|---|
| 64 (C3) | 5K | 0.759 | — |
| 128 (C5b) | 10K | 0.824 | +6.5pp |
| 256 (C5c) | 20K | 0.857 | +3.3pp |
| 512 (C5d) | 40K | 0.871 | +1.4pp |
Each doubling roughly halves the gain. Extrapolated asymptote ~88% — the 4-way joint task’s practical ceiling at this LR/steps configuration with a single linear classifier on top of patches. 1024 patches would project to +0.7pp.
Per-dataset comparison to Group B single-task ceilings
| Task | C5d acc | Group B single-task best | Gap |
|---|---|---|---|
| MNIST | 0.972 | ~0.987 ([128] MLP), ~0.997 ([128,64]) | 1.5pp (need depth) |
| Fashion | 0.870 | ~0.88 | ~1pp (near ceiling) |
| KMNIST | 0.922 | ~0.957 (M=64 multilayer) | 3.5pp (depth or geom) |
| EMNIST | 0.798 | ~0.826 (M=0) | 2.8pp (more patches) |
The under-capacities are well-targeted by Group B’s task-conditional findings: MNIST and KMNIST want depth, EMNIST wants more raw patches. The patches-only architecture has a structural ceiling that depth would help break.
Connection prune signal is dead at default settings
Across all C5* runs, avg_conn is essentially flat: 5005/9933/19790/39502 for seed 64/128/256/512. remove_connection_prob=0.05 per-genome-per-gen produces <1% disable rate over the run. Lottery-ticket evolution can’t function here. Fixed in C6 with a new per_conn_remove_prob (per-connection, per-generation independent draws).
C6: Per-connection pruning from seed 256
Date: 2026-05-13
Setup: same as C4 with N_SEED_PATCHES=256 and PER_CONN_REMOVE_PROB=0.005 (per-conn, per-gen independent draws). Seed 50657.
Result (top-5 test):
| id | fitness | test | M | F | K | E | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 611 | 0.8579 | 0.8516 | 0.9636 | 0.8570 | 0.8912 | 0.7681 | 257 | 18,996 |
| 647 | 0.8578 | 0.8521 | 0.9621 | 0.8579 | 0.8893 | 0.7708 | 257 | 18,939 |
| 587 | 0.8542 | 0.8537 | 0.9614 | 0.8582 | 0.8916 | 0.7739 | 256 | 19,221 |
| 565 | 0.8534 | 0.8535 | 0.9627 | 0.8559 | 0.8925 | 0.7734 | 257 | 19,030 |
| 727 | 0.8531 | 0.8509 | 0.9633 | 0.8571 | 0.8887 | 0.7677 | 257 | 19,081 |
Top-5 mean 85.24%, conn 19,053 (vs C5c’s 85.72% / 19,790).
Trade: −0.5pp accuracy for −4% connections over 50 generations. Pruning is mechanistically functional but the gain is small.
Why so little pruning even at 0.5% per-conn-per-gen?
Expected disables per individual per gen: 20K × 0.005 = 100. Observed avg_conn drop: ~25/gen. The other ~75/gen are being resurrected.
Mechanism: NEAT crossover treats enabled as part of the matching ConnectionGene and inherits the value 50/50 from either parent. If parent A has the connection disabled (after pruning) and parent B has it enabled, the child gets one or the other with equal probability. So at every breeding step, ~half of new prunes are undone.
For real lottery-ticket compression you’d want sticky-disabled crossover: a connection is enabled in the child iff it’s enabled in both parents (or in the fitter parent only, etc). That would let pruning accumulate generation over generation. With the current crossover, the effective per-gen prune rate is roughly per_conn_remove_prob × (1 - crossover_resurrection), and the resurrection fraction is large.
Takeaway
The seed-size lever (C5b/C5c/C5d) absolutely dominates the prune lever (C6) for this configuration. At seed 256 with prune, you get to ~85.2%; at seed 256 with no prune, you get to ~85.7%; at seed 512 with no prune, you get to 87.1%. Bigger initial population beats trying to compress. Sticky-disabled crossover might change this verdict; deferred.
C7: Macro add_patch_burst from seed 64
Date: 2026-05-13
Setup: same as C4 with ADD_PATCH_BURST_PROB=0.20, ADD_PATCH_BURST_COUNT=8. New mutation add_patch_burst inserts 8 patches at once (all head_weight=0) when the per-generation gate fires. Tests whether the per-+1 fitness-noise floor can be sidestepped with macro architectural jumps.
Result (top-5 test):
| id | fitness | test | M | F | K | E | patches | conn |
|---|---|---|---|---|---|---|---|---|
| 750 | 0.8244 | 0.7574 | 0.8977 | 0.8109 | 0.7906 | 0.6368 | 65 | 5,006 |
| 767 | 0.8233 | 0.7570 | 0.8972 | 0.8094 | 0.7869 | 0.6387 | 65 | 5,008 |
| 732 | 0.8186 | 0.7576 | 0.8968 | 0.8095 | 0.7906 | 0.6385 | 65 | 5,007 |
| 709 | 0.8170 | 0.7585 | 0.8978 | 0.8085 | 0.7909 | 0.6405 | 65 | 5,007 |
| 774 | 0.8165 | 0.7539 | 0.8978 | 0.8083 | 0.7832 | 0.6327 | 64 | 5,005 |
Top-5 mean 75.69% — essentially indistinguishable from C3/C4/C5a (~75-76%).
Crucially, top individuals all stayed at 64-65 patches while avg_patches rose to 77.2 by gen 49. The bursts are firing (avg climbs from 64 → 77) but the resulting macro-mutants don’t reach the top of the population — selection prefers their 64-patch peers.
Why macro adds also fail
Even with head_weight=0 making the insertion behavior-preserving, the 8 new patches are cold — random indices, random internal weights. They contribute nothing useful initially. Their host individual now has 72 patches consuming compute capacity but only 64 trained. In the next 10K-step training window the new patches start to train, but they can’t catch the maturity of the original 64. Fitness of the 72-patch host is at or slightly below the 64-patch peers, so selection prefers the original.
Pattern across all “evolve patch count” attempts
| Experiment | Per-gen patch delta | Top patches | Top test |
|---|---|---|---|
| C3 (add 0.05, head N(0,.1)) | +1 conditional | 64 | 0.759 |
| C4 (add 0.05, head 0) | +1 conditional | 64-66 | 0.749 |
| C5a (add 0.20, head 0) | +1 conditional × 4 rate | 65-67 | 0.752 |
| C7 (burst 8, head 0) | +8 conditional | 64-65 | 0.757 |
| C5b (seed 128, no add) | 0 | 128 | 0.824 |
| C5c (seed 256, no add) | 0 | 256 | 0.857 |
| C5d (seed 512, no add) | 0 | 512 | 0.871 |
Patch count is not evolvable on this 50-gen × 10K-step budget. The structural problem isn’t insertion mechanics — it’s that fresh patches need training time, and selection happens before they get it. Patches that don’t yet contribute drag their host below mature-only peers.
What would unblock patch-count evolution?
Three plausible mechanisms, none implemented:
- Pre-trained patch insertions. Add patches whose indices and weights are sampled from a successful template (e.g., a random translation of an existing patch in the same genome). The new patch contributes immediately because its features are aligned to known-useful ones.
- Speciation that protects newcomers. Ecological niching where individuals with similar topology compete only within their niche. A 72-patch individual competes with other 72-patch individuals, not against 64s. This is exactly what NEAT speciation does in classic implementations — and it’s also Group B’s “ecological niche” hypothesis. Worth testing.
- Longer generation windows. If
evolve_interval = 100Kinstead of10K, new patches have 10× more training time before they’re judged. May or may not help — depends on whether 100K steps is enough for an 8-patch cohort to catch the 64-patch crowd.
For the typed-species hypothesis, this means the right way to evolve patch count is by ecological speciation, not direct fitness-driven mutation. C8 should test this.
C8: Ecological speciation across 4 datasets
Date: 2026-05-13
Binary: src/bin/group_c_niches.rs
Hypothesis: with 5 independent niches (4 pure-task + 1 mixed), each trained on its own data distribution and seeded identically (128 patches, half spatial / half random-index), each niche should evolve toward a different patch geometry. In particular, Group B’s KMNIST inversion (spatial locality flipped relative to MNIST) should appear as a difference in evolved patch distributions.
Setup: 5 niches, 300K steps each, pop 50, seed 128 patches, mutate_patch_indices_prob=0.30 per generation. Patch geometry stats computed at log intervals — (row_std, col_std) of per-pixel positions across the whole population, and edge_frac = fraction of patches with at least one pixel within 5 of the image border.
Per-niche accuracy (best individual, own task; 0% on others means zero cross-task transfer since outputs for unseen classes were never trained):
| Niche | Own-task test | C5b joint comparison | Gain from specialization |
|---|---|---|---|
| mnist | 96.8% | 95.0% | +1.8pp |
| fashion | 86.9% | 82.8% | +4.1pp |
| kmnist | 90.2% | 86.4% | +3.8pp |
| emnist | 78.3% | 73.5% | +4.8pp |
| mixed | 78.8% | (—) | (mixed niche, 300K vs C5b’s 500K) |
Pure-task niches beat joint training on their own task by 2-5pp. EMNIST gains the most (47 classes, the most under-capacitied task). Cross-task accuracy is identically 0 — consistent with main-stream Experiment 16’s zero-transfer finding.
The geometry result — Group B confirmed by evolution
| Niche | row_std | col_std | edge_frac | Group B prediction |
|---|---|---|---|---|
| mnist | 6.53 | 7.03 | 0.700 | spatial +0.6-0.9 *** → keep spatial ✓ |
| fashion | 8.12 | 8.11 | 1.000 | ~null → mixed/distributed ✓ (distributed) |
| kmnist | 8.08 | 8.14 | 1.000 | spatial −1.21 *** → inverted, distributed ✓ |
| emnist | 7.17 | 6.78 | 0.857 | spatial +1.03 *** → keep spatial ✓ (partial) |
| mixed | 7.92 | 8.10 | 1.000 | (averaging) — pulled to distributed by 3/4 |
Initial conditions: each population started with 50% PatchInit::Spatial and 50% PatchInit::Random. Expected initial (row_std, col_std) ≈ (7.45, 7.45); expected initial edge_frac ≈ 0.83.
MNIST drifted down in edge_frac (0.83 → 0.70) and down in row_std (7.45 → 6.53) — selection preserved spatial 5×5 patches over random-index. The KMNIST niche drifted up in edge_frac (0.83 → 1.00) — selection purged spatial patches in favor of random-index. Fashion did the same. EMNIST stayed nearer the spatial side (edge_frac 0.857) consistent with Group B finding spatial locality is positive for EMNIST too.
This is the typed-species hypothesis confirmed in dynamics: niching reproduces Group B’s per-task locality findings without being told what they are. KMNIST is the most frequent outlier in Group B’s transferability tally; KMNIST is also the niche that most aggressively rejects spatial patches in C8.
EMNIST’s anisotropy (row_std=7.17 > col_std=6.78) is curious — could reflect that EMNIST characters have a vertical-stroke bias (printed letters lean vertical) so column-position discriminative information is more spatially concentrated than row. Group B B33 noted EMNIST follows the rectangular-wide preference (+0.98 ***) — both findings consistent with a “vertical-stroke” reading of EMNIST.
What C8 says about the integration story
Group B’s strongest mechanistic claim — that the right patch geometry is task-conditional and KMNIST inverts — was the explicit motivation for Group C. C8 demonstrates that:
- The integration is doing real architectural work. Niching produces visibly different patch populations, not just different connection weights.
- The discovery mechanism is fitness-driven selection over many generations, not direct mutation of geometry parameters.
mutate_patch_indicesis the engine; selection is what makes different geometries dominant in different niches. - Manual experimentation in Group B is replaceable by speciation in Group C. What took 35 Group B experiments to map (per-task locality directions) emerges as a population-level property in 30 minutes of niche training.
This satisfies the Group C charter and is the natural stopping point for this phase.
Open questions for later
- Anisotropy as a per-niche fingerprint (row_std vs col_std differing). Worth running multi-seed niches and seeing if anisotropy is reproducible.
- Patch count in niches. Re-test add_patch + add_patch_burst inside niches (where competition is restricted to similar topologies). Group B’s per-task data suggests EMNIST wants more patches than MNIST — would niching let evolution discover that?
- Cross-niche transfer. Take the MNIST niche’s best individual, transplant its patch geometry into a fresh genome, and train on Fashion. Does the geometry transfer cleanly, or does it need to re-evolve?
- What does each patch look like? Per-patch visualization (which 25 pixels does it weight?) could show whether MNIST’s niche has stroke-detector-like patches, KMNIST’s niche has more arbitrary feature detectors, etc. Group B’s mechanistic work (B31) related class-discriminability to evolved geometry — here we could do that empirically by mapping patches to their dominant input weights.
D1: Per-patch introspection (Group C / Phase D)
Date: 2026-05-13
Binary: src/bin/group_c_introspect.rs (niches with patch-viz dump at end)
Hypothesis: each niche’s evolved patches should look different. MNIST should converge on stroke/edge-like patches concentrated near the digit; KMNIST should have spatially scattered, near-random patches. The C8 geometry signature (edge_frac) should manifest visually.
Setup: same 5 niches as C8 (mnist, fashion, kmnist, emnist, mixed), 128 patches seeded 50/50 spatial+random, 300K steps per niche. At the end of each niche, dump three PGM files for the top individual:
*_patches.pgm: mosaic where each cell is a 28×28 weight map of one patch (signed grayscale, in-patch pixels brightness ∝ weight, out-of-patch pixels dark)*_coverage.pgm: 28×28 heatmap of how often each input pixel is referenced across the individual’s patches*_popcoverage.pgm: same heatmap aggregated over the full population
Coverage stats (from group_c_analyze_pgm):
| Niche | Centroid (r, c) | Spread (r, c) | Center % | Edge % | Top-5%-pixel mass |
|---|---|---|---|---|---|
| mnist | (13.97, 13.89) | (6.79, 7.21) | 37.1% | 62.9% | 11.2% |
| fashion | (13.44, 13.47) | (8.05, 8.16) | 23.6% | 76.4% | 10.5% |
| kmnist | (13.66, 13.71) | (8.11, 8.16) | 23.8% | 76.2% | 10.5% |
| emnist | (11.96, 12.30) | (7.19, 6.48) | 38.0% | 62.0% | 11.7% |
| mixed | (13.30, 13.67) | (6.88, 6.75) | 37.1% | 62.9% | 11.9% |
A uniform random distribution would put 25% of mass in the 14×14 center region. MNIST (37%) and EMNIST (38%) concentrate well above uniform — selection preserved spatial 5×5 patches whose pixels cluster in the central region. Fashion (24%) and KMNIST (24%) are at or below uniform — selection drove the population to random-index patches that spread evenly across the image.
This sharpens the C8 result: it’s not just “edge_frac ≈ 1.0 vs 0.7” but a concrete map of where the patches concentrate.
EMNIST anisotropy:
- Centroid (11.96, 12.30) — offset from image center (13.5, 13.5) toward top-left
col_std = 6.48 < row_std = 7.19— significantly tighter horizontally than vertically
EMNIST’s printed letters/digits have central vertical strokes; the discriminative pixels live in a horizontally-tight, slightly-above-center band. Patches concentrate there. This is consistent with Group B B33 (EMNIST follows the rectangular wide-preference: 3×9 wide patches beat 9×3 tall).
Per-task accuracy (rerun confirms C8 — D1 wasn’t intended as a fresh accuracy run but matches within noise):
| Niche | D1 own-task test | C8 own-task test |
|---|---|---|
| mnist | 97.0% | 96.8% |
| fashion | 86.5% | 86.9% |
| kmnist | 90.2% | 90.2% |
| emnist | 77.6% | 78.3% |
| mixed | 81.4% | 78.8% |
What D1 adds beyond C8
C8 showed each niche evolves to a different aggregate geometry (edge_frac). D1 measures what region of the input space each niche’s patches concentrate on, and visualizes the patches themselves. The two findings agree and the EMNIST anisotropy is a new, sharper signal: the niche evolved patches biased toward a specific band of the image consistent with the data’s class-discriminative geometry.
PGM files are in notes/group_c/runs/d1/*.pgm. Mosaic files are ~110KB each (16-col × 8-row grid of 28×28 weight maps with 1-pixel borders); coverage and population-coverage are 784-byte 28×28 heatmaps.
D-prep: bugfix pass (silent-no-op connections, sticky-disabled crossover, dead-patch compilation, cycle-breaking sanitize)
Four invariant fixes landed before D2 to keep long-running experiments safe:
add_connectionexcludes patch nodes as targets. Patches’ fan-in is viaPatchTopo.indices; aConnectionGenetargeting a patch is silently ignored by the forward pass but inflatesconnection_count. Earlier Group C runs accumulated a small number of these inert connections.- Sticky-disabled crossover (NEAT-classic 0.75 rule, exposed as
EvolutionConfig.disable_inheritance_prob, default 0.0). When a matching connection gene has one parent disabled and the other enabled, the child inherits disabled with this probability. Lets pruning accumulate across generations instead of getting undone by the lesser parent’senabled = truehalf the time. Default 0.0 preserves prior behavior. - Dead-patch compilation skip. Patches with zero enabled outgoing connections are no longer added to
PatchTopo; the topo position still computes via the emptyconn_rangesloop (yielding 0). Saves a tiny amount of compute and keepsPatchToporeflecting “live” patches. Genome::sanitize()drops connections referencing absent nodes and breaks cycles in the enabled subgraph by iteratively disabling one inter-cycle edge per pass. Called at the end ofmutate()as a defensive guard. Without it, D2 panics during phenotype compilation: Kahn’s topo sort leaves cycle-trapped nodes out oftopo_order, then a connection referencing one of them indexes into the incompletenode_indexand triggers ano entry found for keypanic atphenotype.rs:209.
The cycle introduction mechanism (still incompletely characterized) involves crossover combining matching connection genes from both parents whose enabled/disabled patterns are individually acyclic but together close a cycle. sanitize() is the principled fix; identifying the exact cycle-creating gene combination is open work.
D2: Patch-count evolution inside niches
Date: 2026-05-13
Binary: src/bin/group_c_niche_growth.rs
Hypothesis: when competitors share topology and task (inside an ecological niche), the +1-patch marginal fitness signal might escape the joint-task fitness noise floor that blocked C3/C4/C5a/C7. Predict EMNIST grows patches the most (most under-capacitied at 128); MNIST grows least (saturated).
Setup: same 5 niches as C8/D1 with add_patch_prob=0.10, add_patch_burst_prob=0.05 (burst count = 4), seed 128 patches. 300K steps per niche, otherwise default Group C config.
Result (top-3 by fitness per niche):
| Niche | Top fit | Top test | M | F | K | E | patches | Cycles broken |
|---|---|---|---|---|---|---|---|---|
| mnist | 0.9697 | 0.968 own | 0.968 | — | — | — | 129 | 0 |
| fashion | 0.8837 | 0.869 own | — | 0.869 | — | — | 129 | 0 |
| kmnist | 0.9225 | 0.904 own | — | — | 0.904 | — | 128 | 1 |
| emnist | 0.7966 | 0.775 own | — | — | — | 0.775 | 128 | 0 |
| mixed | 0.8127 | 0.813 joint | 0.940 | 0.829 | 0.841 | 0.722 | 129 | 0 |
avg_patches across the run: 128.0 → 128.3-129.0 in all niches (slight upward drift over 30 generations).
EMNIST’s top-3 had patch counts 128, 132, 128 — i.e., a rank-2 individual reached 132 patches and didn’t get culled. In C7 (joint task, same add_patch_burst config), top individuals all stayed at 64 with macro-mutants culled out. In-niche competition relaxes that culling enough to keep a 132-patch macro-mutant alive in the top tier — but not enough to make it the best individual.
Cycle-breaker firings: 1 single cycle broken across the entire 5-niche, 1.5M-step run, all in KMNIST. The cycle bug is rare but catastrophic (panics phenotype compilation when it hits, as observed in D2 v1-v3). sanitize() makes it harmless.
Per-task accuracy vs C8/D1 baseline (own task, no add_patch):
| Niche | C8/D1 baseline | D2 (with add_patch) | Δ |
|---|---|---|---|
| mnist | 96.8% | 96.8% | ≈0 |
| fashion | 86.9% | 86.9% | ≈0 |
| kmnist | 90.2% | 90.4% | +0.2pp |
| emnist | 78.3% | 77.5% | −0.8pp |
Within noise across all niches. The patch-add mutations are firing (avg_patches drifts up by ~0.5-1.0 over the run) but the new patches don’t move the test accuracy needle — they enter, slightly drag fitness during training, and either get culled or persist as low-contribution patches.
Interpretation: niching doesn’t unblock patch-count evolution either
The C4-C7 finding generalizes: in-niche competition is not sufficient. The blocker isn’t really about “fitness noise floor at the joint task” — it’s specifically about training time. New patches need many thousands of steps to train up to usefulness, and selection happens before that, even when the comparison pool has similar topology.
What in-niche competition does relax: the 132-patch EMNIST individual reached rank 2 (vs being culled in C7’s joint task). Macro-mutants aren’t immediately killed in niches, but they also don’t reach the top.
The path to actually-evolvable patch count likely needs one of:
- Longer evolve intervals (50K+ steps between selections) — give new patches time to mature.
- Pre-trained patch insertions (e.g., translate-and-copy an existing successful patch).
- Network-level training-step counter tied to “patch maturity” used as a tiebreak in selection.
None of these are implemented. Recording D2 as a clean negative on the niching-unblocks-count hypothesis.
What D2 does establish
- The sanitize() defense (in particular the cycle-breaker) is load-bearing for long-running patch-add experiments. Without it, the run panics; with it, the run completes cleanly with a single cycle broken.
- Initial seed continues to dominate patch count for the practical purpose of reaching test accuracy. In-niche evolution affects geometry (D1/C8) and stabilizes index choice, but not count.
D3: Depth + niching
Date: 2026-05-13
Binary: src/bin/group_c_depth.rs
Hypothesis: insert a 32-node ReLU hidden layer between patches and outputs, run the C8 niches. Group B B25 found KMNIST gains +2.78pp from depth (with proper LR schedule); B34 found EMNIST loses even with proper schedule. Test whether the integrated, niched system reproduces these per-task depth findings.
Setup: Genome::new_with_patches(.., hidden_size = 32, ..) extension adds a 32-node ReLU hidden layer between patches and outputs. Patches → hidden (fully connected, He init), hidden → outputs (fully connected), bias → hidden and outputs. 128 patches seed (same as D1/C8), 300K steps per niche, same mutation config as D1 (add_patch_prob=0.0, patch-index evolution only).
Result (top individual per niche, test accuracy):
| Niche | D1 baseline | D3 (depth=32) | Δ | Group B prediction |
|---|---|---|---|---|
| mnist | 96.8% | 96.78% | ≈0 | null on saturated MNIST (B32) ✓ |
| fashion | 86.9% | 86.71% | ≈0 | (untested in Group B at proper schedule) |
| kmnist | 90.2% | 93.49% | +3.3pp | B25: +2.78pp ✓ |
| emnist | 78.3% | 75.56% | −2.7pp | B34: −1.11pp ✓ (sign matches, magnitude larger) |
| mixed | 81.4% | 78.57% | −2.8pp | averaging |
This is the cleanest Group B replication so far. Three out of four per-task signs match Group B’s depth findings exactly (KMNIST positive, EMNIST negative, MNIST null). Fashion is novel data — flat at the saturated ceiling. KMNIST’s +3.3pp is within 0.5pp of Group B’s +2.78pp; EMNIST’s −2.7pp is in the same sign as B34’s −1.11pp but larger in magnitude.
Connection efficiency: depth shrinks the network
D1 (no depth) at 128 patches: 9,933 connections. D3 (depth=32) at 128 patches: 6,669 connections (~33% fewer).
Architecture math:
- D1: 128 patches × 77 outputs + 77 bias = 9,856 + 77 = 9,933 patch→output connections.
- D3: 128 patches × 32 hidden + 32 hidden × 77 outputs + 32 bias→hidden + 77 bias→output = 4,096 + 2,464 + 32 + 77 = 6,669.
So depth not only helps KMNIST but also shrinks the network. KMNIST gets +3.3pp accuracy and -33% connections — a Pareto win. (EMNIST gets −2.7pp accuracy and -33% connections — a Pareto loss.)
The mixed niche illustrates the ecological argument
| Method | Mixed test |
|---|---|
| D1 no depth | 81.4% |
| D3 depth=32 | 78.6% |
Adding depth uniformly to the mixed niche hurts by 2.8pp. The intuition: a single 32-node hidden layer is one fixed architectural decision. It helps KMNIST and hurts EMNIST. On the mixed task with both, the net is negative.
This is the ecological-speciation argument in concrete form: per-task depth selection is one of the things ecological niches can do but a single network can’t. D3’s mixed niche underperforms D1’s mixed niche; D3’s KMNIST niche beats D1’s. Niching captures task-conditional architectural value that homogeneous training can’t.
Summary of Phase D
| Experiment | Question | Answer |
|---|---|---|
| D1 | What do the per-niche evolved patches look like? | Per-niche spatial concentration map quantifies where each niche’s patches live; EMNIST shows an off-center anisotropy band consistent with vertical-stroke discrimination |
| D2 | Does niching unblock patch-count evolution? | No. Top individuals stay near the seed (128); macro-mutants survive in niches but don’t reach the top |
| D3 | Does niching reproduce Group B’s per-task depth findings? | Yes, cleanly. KMNIST +3.3pp, EMNIST −2.7pp, MNIST null, mixed −2.8pp |
Plus four bugfixes (add_connection patch exclusion, sticky-disabled crossover, dead-patch compilation skip, cycle-breaking sanitize).
Group B’s two strongest cross-task findings (locality direction from C8/D1; depth direction from D3) are now both reproduced by the typed-species NEAT integration as emergent niche-level behaviors. The two structural blockers identified are (a) patch-count evolution remains hard at this gen budget regardless of niching or mutation flavor (C3-C7, D2), and (b) rare cycle bugs in NEAT crossover need a sanitize defense.