Group C Experiments

Raw structured experiment records for the typed-species NEAT integration (Group C) stream. Reproduced exactly as produced.

Structured records for each experiment.

C1: 4-way joint MLP baselines

Date: 2026-05-13 Binary: src/bin/group_c_baselines.rs Hypothesis: establish the floor that Group C must beat. Dense MLPs trained via the existing seeded-genome + Network::forward/backward on the joint 4-way task (77 classes: MNIST 10 + Fashion 10 + KMNIST 10 + EMNIST-balanced 47).

Setup: DatasetSplit::load with all 4 datasets, train_fraction=5/6 (~244K train / ~48.8K test), 3 epochs of online SGD per seed, lr 0.01→0.001 linear, 3 seeds per architecture. Architectures: [64], [128], [128, 64]. hidden_input_fraction = 1.0 (dense MLP from inputs).

Results (mean ± std over 3 seeds, test accuracy):

Arch	Overall	MNIST	Fashion	KMNIST	EMNIST	Conn
`[64]`	0.758 ± 0.005	0.907 ± 0.007	0.775 ± 0.016	0.798 ± 0.010	0.648 ± 0.007	55,245
`[128]`	0.774 ± 0.006	0.918 ± 0.006	0.780 ± 0.018	0.825 ± 0.007	0.665 ± 0.003	110,413
`[128, 64]`	0.355 ± 0.266	0.581 ± 0.412	0.437 ± 0.294	0.345 ± 0.259	0.198 ± 0.202	113,741

Per-dataset difficulty ordering (consistent across single-hidden arms)

MNIST (~91%) > KMNIST (~82%) > Fashion (~78%) > EMNIST (~65%). EMNIST is the hardest by ~17pp, which lines up with its 47-class space (2.5× MNIST’s class count, each class has ~half the training data) and Group B’s per-task findings.

`[128, 64]` collapse: initialization is the bottleneck

The [128, 64] arm is anomalous. Per-seed: 59.6% (slow recovery from saturated softmax), 6.9% (never escapes), 40.2% (partial recovery). Stddev 26.6pp — by far the largest of any arm.

Cause: Genome::new_seeded initializes connection weights as U(-1, 1). Calibrated for the existing system’s sparse start (hidden_input_fraction = 0.10), this puts σ ≈ 5.8 on the first hidden layer’s pre-activation when used dense. With two hidden layers and U(-1, 1) inter-layer weights, output logits start at σ ≈ 74 — softmax saturates near a single class. Whether SGD ever escapes depends on which class got the lucky logit, and the gradient through 2 saturated layers is small.

This means C1’s [128, 64] number is not a meaningful “depth helps?” data point; it’s a “this init can’t reach depth 2 at this LR” data point. Recording it for completeness and as motivation for either He init in new_seeded or sparse start in the baseline.

Floor that Group C must beat

For Group C comparisons, the operative floor is ~77% overall test accuracy (single-hidden-layer dense MLP at width 128, 3 epochs). The patch-matcher path got to 96.6% on MNIST alone with no hidden layer at all (C2), so the Group C hypothesis remains live — patches may dominate this dense baseline even before the evolutionary search starts.

C2: Integrated patch-matcher verifier (MNIST)

Date: 2026-05-13 Binary: src/bin/group_c_patch_verify.rs Hypothesis: with the integration’s mechanical core landed (NodeGene patch slot, PatchTopo in phenotype, forward/backward branch, Genome::new_with_patches), a Group-B-shaped patch genome trained through the integrated forward/backward should hit Group B’s spatial-patch numbers on MNIST (~95-96%).

Setup: 320 spatial 5×5 patches, linear classifier head, 50K MNIST train / 10K test, 3 epochs, lr=0.05 (constant), seed 0xC2.

Result:

Epoch	Test acc
1	0.9628
2	0.9649
3	0.9664

Hits Group B’s spatial-patch ceiling on the first epoch. End-to-end integration verified — the patch forward/backward branch is correct.

C3: First patch-evolved population (4-way joint)

Date: 2026-05-13 Binary: src/bin/group_c_evolve.rs Hypothesis: a population of patch-seeded genomes under NEAT-style evolution (patch index mutation + add_patch + connection ops, no scalar add_node) should reach or beat the dense MLP floor on the 4-way joint task with far fewer parameters.

Setup:

Population 50, seeded 50/50 spatial vs random-index 5×5 patches (N_SEED_PATCHES = 64).
500K steps, batch size 100, LR 0.05 → 0.005 linear, evolve every 10K steps.
Joint stream 25/25/25/25 across MNIST/Fashion/KMNIST/EMNIST-balanced (77 classes).
Mutation: add_patch_prob=0.05, mutate_patch_indices_prob=0.30, per_patch_index_swap_prob=0.02, add_connection_prob=0.05, add_node_prob=0 (no scalar splits), remove_connection_prob=0.05.
Seed 0xC301.

Result (top-5 individuals on test):

id	fitness	test	MNIST	Fashion	KMNIST	EMNIST	patches	conn
654	0.7835	0.7590	0.9075	0.8032	0.7887	0.6407	64	5,005
751	0.7807	0.7553	0.9081	0.8043	0.7858	0.6317	64	5,005
766	0.7805	0.7572	0.9075	0.8033	0.7858	0.6376	64	5,005
771	0.7801	0.7546	0.9022	0.8037	0.7836	0.6345	64	5,005
633	0.7795	0.7572	0.9079	0.8050	0.7856	0.6366	64	5,005

Vs C1 baselines:

Method	Test	Connections	Conn ratio
`[64]` MLP	0.758	55,245	1×
`[128]` MLP	0.774	110,413	2×
C3 patches	0.759	5,005	0.09×

Patches match the [64] dense MLP’s accuracy with 11× fewer connections and trail [128] by 1.5pp with 22× fewer connections. The patch primitive is dramatically more parameter-efficient than dense layers on the 4-way task.

Per-dataset structure: MNIST 91% > Fashion 80% ≈ KMNIST 79% > EMNIST 64%. Same difficulty ordering as the dense MLP baselines. EMNIST stays hardest by ~15pp.

Key dynamics observation: add_patch additions don’t survive selection

best_patches stayed at 64 for the entire 50-generation run. avg_patches floated between 64.0 and 64.2 — add_patch_prob=0.05 fires ~1 add per individual per generation on average, yet net growth is zero. New patches enter with random weights and need training time to be useful; in a 10K-step generation window they look strictly worse than mature patches and get culled in the next evolution step.

This is the NEAT-classic problem of “structural mutations look worse short-term,” solved there by behavior-preserving insertion (split keeps weight 1.0 + original on the path). The current add_patch_matcher uses head_weight = N(0, 0.1) for the new patch→target connection, so a new patch immediately adds random noise to the output and hurts fitness.

Fix for C4: initialize the new patch’s outgoing connection weight to 0.0. The patch contributes nothing at insertion time, fitness doesn’t drop, SGD then trains the connection weight upward if the patch’s features are useful, or leaves it near 0 if not. Behavior-preserving insertion, NEAT-style.

What C3 tells us about the typed-species hypothesis

C3 partially confirms the Group B hypothesis: patch index mutation works — selection preserves better patch placements over generations (fitness climbed from 0.66 to 0.81). But patch count mutation doesn’t work yet, so we haven’t tested whether evolution discovers the right number of patches per task. That’s the next experiment’s question.

Next: C4 candidates

Fix add_patch insertion — head weight = 0.0 (behavior-preserving). Re-run with no other changes; see if best_patches grows.
Larger seed, let pruning win — start at 128 or 256 patches, raise remove_connection_prob, see if evolution prunes to a smaller optimal set (lottery ticket).
Ecological speciation — split into 4 pure-task niches + mixed; test whether KMNIST’s niche converges to a different patch geometry than the others.

C4: Behavior-preserving add_patch (head_weight = 0)

Date: 2026-05-13 Binary: src/bin/group_c_evolve.rs (same as C3, with one-line fix to add_patch_matcher) Hypothesis: setting the new patch → target connection weight to 0 at insertion (instead of N(0, 0.1)) means the new patch contributes exactly 0 to the output, fitness doesn’t drop, and the patch survives long enough for SGD to find a use for it.

Setup: identical to C3 except head_weight = 0.0 in add_patch_matcher. Same seed.

Result (top-5 individuals on test):

id	fitness	test	MNIST	Fashion	KMNIST	EMNIST	patches	conn
776	0.7968	0.7432	0.8931	0.8048	0.7923	0.6045	64	5,006
704	0.7967	0.7513	0.8985	0.8065	0.7985	0.6185	66	5,007
764	0.7959	0.7475	0.9006	0.8043	0.7931	0.6115	65	5,008
759	0.7958	0.7526	0.9015	0.8084	0.7993	0.6188	66	5,008
772	0.7945	0.7503	0.9008	0.8067	0.7939	0.6171	64	5,006

Average top-5 test: ~0.749 (vs C3’s ~0.757). Patches moved (64-66 range; avg_patches 64.0 → 64.8) but only slightly. Net result: similar fitness, slightly lower test, patches did drift positive.

Why so little growth?

Three mechanisms throttle patch growth even with behavior-preserving insertion:

Bootstrap is slow, not blocked. head_weight = 0 doesn’t freeze the patch: ∂L/∂head_weight = δ_target × post_act(patch), and post_act(patch) is nonzero for random He-init weights, so SGD trains head_weight upward whenever the patch’s random feature happens to correlate with the loss surface. But the initial gradient is small (one connection’s worth out of 5,005), so the patch grows from a noise-level contribution slowly.
NEAT crossover loses disjoint genes from the lesser parent. When a 65-patch parent mates with a 64-patch parent, the new patch is a disjoint gene. In NEAT classic, disjoint genes are inherited only from the fitter parent. If the 65-patch parent is fitter, the patch survives; if not (and an immature added patch usually isn’t fitter), the offspring drops to 64. With 30% cull and tournament-of-3 selection, new-patch carriers need a real fitness advantage to persist, and an immature patch doesn’t yet have one.
Steady-state arithmetic. Add rate ≈ 0.05 × 50 = 2.5 patches/gen across the population. Breeding overwrites a fraction of new patches each gen. Net per-generation growth is fractional, consistent with the observed +0.8 over 50 generations.

C4’s finding: behavior-preserving insertion is necessary but not sufficient. The NEAT-crossover-keeps-only-fitter-disjoints rule is the bigger blocker — even a non-harmful add gets bred out without a fitness advantage.

Next: C5a / C5b

Two follow-up experiments to disentangle this:

C5a — same as C4 with add_patch_prob = 0.20 (4× higher). If patch growth is rate-limited rather than blocked, this should show stronger drift.
C5b — same as C4 with N_SEED_PATCHES = 128. Tests whether the joint task wants more capacity than 64 patches (in which case test accuracy goes up, and growing patches via evolution is meaningful), or whether 64 is already at saturation (in which case the system is correctly stable and we should be looking at what the patches represent, not how many of them there are).

C5a: Higher add_patch_prob

Date: 2026-05-13 Setup: same as C4, ADD_PATCH_PROB=0.20 (4× C4), seed 50593.

Result (top-5 test):

id	fitness	test	M	F	K	E	patches
730	0.7781	0.7504	0.8942	0.8058	0.7945	0.6210	65
426	0.7778	0.7530	0.8956	0.8078	0.7956	0.6253	65
695	0.7768	0.7521	0.8957	0.8064	0.7943	0.6243	66
707	0.7768	0.7527	0.8950	0.8086	0.7969	0.6237	65
723	0.7760	0.7524	0.8957	0.8092	0.7963	0.6226	67

avg_patches went 64.0 → 65.3 (vs C4’s 64.8 and C3’s 64.1). Top-5 test ~75.2% — basically identical to C4 (~74.9%) and C3 (~75.7%) within noise. 4× add_patch_prob produced no meaningful accuracy gain.

The bottleneck isn’t insertion rate. Selection can’t tell a 65-patch individual apart from a 64-patch individual because the +1 patch’s marginal fitness contribution (~1/64 ≈ 1.5%) is below the per-generation fitness noise floor.

C5b: Larger seed (128 patches)

Date: 2026-05-13 Setup: same as C4, N_SEED_PATCHES=128 (2× C4), seed 50609.

Result (top-5 test):

id	fitness	test	M	F	K	E	patches	conn
752	0.8406	0.8202	0.9484	0.8275	0.8592	0.7273	128	9,933
784	0.8389	0.8181	0.9465	0.8281	0.8596	0.7223	128	9,933
537	0.8379	0.8244	0.9513	0.8284	0.8640	0.7338	128	9,933
656	0.8368	0.8245	0.9503	0.8273	0.8635	0.7353	128	9,933
769	0.8368	0.8189	0.9470	0.8260	0.8567	0.7269	128	9,933

Top-5 test mean 82.1%, best 82.45%. Beats every prior result.

Vs all prior runs

Method	Test	Conn	Conn ratio	MNIST	Fashion	KMNIST	EMNIST
`[64]` MLP	0.758	55,245	1×	0.907	0.775	0.798	0.648
`[128]` MLP	0.774	110,413	2×	0.918	0.780	0.825	0.665
C3 (64p)	0.759	5,005	0.09×	0.908	0.803	0.789	0.641
C4 (64p)	0.749	5,006	0.09×	0.898	0.806	0.794	0.612
C5a (64p, 4×add)	0.752	5,007	0.09×	0.896	0.807	0.795	0.624
C5b (128p)	0.824	9,933	0.18×	0.950	0.827	0.864	0.735

C5b is +5.0pp over [128] MLP at 11× fewer connections, and +6.5pp over the 64-patch population at 2× the connections. The biggest per-dataset gains: KMNIST +6pp and EMNIST +9-11pp, exactly the tasks with the most headroom.

What C5a + C5b together imply

Patch count matters: 2× seed → +6.5pp overall, +11pp on EMNIST. The 64-patch baseline was under-capacitied.
Per-generation +1 patch mutation can’t climb this gradient: even at 4× add rate (C5a) net growth stays around +1.3 patches over 50 gens.
The evolutionary architectural search is too slow given the fitness noise floor. A 1-patch change in a 64-patch network is statistically invisible to selection.

Next: C5c

Seed 256 patches and re-run. If C5c continues climbing, the joint task wants even more capacity and we should be running large initial seeds + pruning (lottery ticket pattern). If C5c plateaus at C5b’s number, 128 is approximately right for this 50-individual / 500K-step / 4-way config.

C5c: Larger seed (256 patches)

Date: 2026-05-13 Setup: same as C4, N_SEED_PATCHES=256, seed 50625.

Result (top-5 test):

id	fitness	test	M	F	K	E	patches	conn
661	0.8952	0.8577	0.9647	0.8629	0.9026	0.7740	256	19,790
762	0.8942	0.8573	0.9656	0.8615	0.9020	0.7737	256	19,790
732	0.8932	0.8565	0.9655	0.8607	0.9010	0.7726	256	19,790
664	0.8916	0.8585	0.9653	0.8596	0.9054	0.7762	256	19,789
744	0.8907	0.8560	0.9653	0.8620	0.9006	0.7711	256	19,790

Top-5 mean 85.72%, best 85.85%. Still climbing but the curve is flattening.

Capacity scaling (C3/C4 + C5a/C5b/C5c)

Patches	Conn	Test	ΔTest from prior	MNIST	Fashion	KMNIST	EMNIST
64 (C3)	5K	0.759	—	0.908	0.803	0.789	0.641
128 (C5b)	10K	0.824	+6.5pp	0.950	0.827	0.864	0.735
256 (C5c)	20K	0.857	+3.6pp	0.965	0.863	0.905	0.776

Each 2× in patches costs 2× connections and gives ~half the prior gain. Returns are diminishing but not yet near zero. MNIST is essentially saturated at 96.5%; Fashion / KMNIST / EMNIST still have headroom compared to Group B’s single-task ceilings (~88% / ~96% / ~82%).

Connection pruning is dead

best_conn=19790 and avg_conn=19790 for almost all of C5c. With remove_connection_prob=0.05 per-genome per generation (rather than per-connection), only ~125 disable events fire across the entire 50-gen × 50-pop run, against ~19,800 connections. Effective prune rate <1%. The lottery-ticket hypothesis can’t be tested at this prune rate.

C5d — seed 512 (one more doubling); find the plateau.
C6 — bump remove_connection_prob to a per-connection rate. Test whether evolution can compress a 256/512-seeded population back down.
C7 — macro-mutation: an add_patch_burst that adds 8-16 patches at once. Tests whether the per-+1 fitness-noise problem can be sidestepped by larger jumps.

C5d: Larger seed (512 patches)

Date: 2026-05-13 Setup: same as C4, N_SEED_PATCHES=512, seed 50641.

Result (top-5 test):

id	fitness	test	M	F	K	E	patches	conn
750	0.8865	0.8709	0.9706	0.8639	0.9220	0.7943	513	39,502
570	0.8838	0.8738	0.9719	0.8697	0.9223	0.7980	513	39,502
784	0.8837	0.8690	0.9685	0.8653	0.9166	0.7928	512	39,501
772	0.8831	0.8716	0.9698	0.8674	0.9199	0.7959	513	39,501
748	0.8828	0.8715	0.9698	0.8674	0.9209	0.7952	513	39,502

Top-5 mean 87.14%, best 87.38%.

Capacity scaling — log-linear with halving gains

Patches	Conn	Test	ΔTest from prior
64 (C3)	5K	0.759	—
128 (C5b)	10K	0.824	+6.5pp
256 (C5c)	20K	0.857	+3.3pp
512 (C5d)	40K	0.871	+1.4pp

Each doubling roughly halves the gain. Extrapolated asymptote ~88% — the 4-way joint task’s practical ceiling at this LR/steps configuration with a single linear classifier on top of patches. 1024 patches would project to +0.7pp.

Per-dataset comparison to Group B single-task ceilings

Task	C5d acc	Group B single-task best	Gap
MNIST	0.972	~0.987 ([128] MLP), ~0.997 ([128,64])	1.5pp (need depth)
Fashion	0.870	~0.88	~1pp (near ceiling)
KMNIST	0.922	~0.957 (M=64 multilayer)	3.5pp (depth or geom)
EMNIST	0.798	~0.826 (M=0)	2.8pp (more patches)

The under-capacities are well-targeted by Group B’s task-conditional findings: MNIST and KMNIST want depth, EMNIST wants more raw patches. The patches-only architecture has a structural ceiling that depth would help break.

Connection prune signal is dead at default settings

Across all C5* runs, avg_conn is essentially flat: 5005/9933/19790/39502 for seed 64/128/256/512. remove_connection_prob=0.05 per-genome-per-gen produces <1% disable rate over the run. Lottery-ticket evolution can’t function here. Fixed in C6 with a new per_conn_remove_prob (per-connection, per-generation independent draws).

C6: Per-connection pruning from seed 256

Date: 2026-05-13 Setup: same as C4 with N_SEED_PATCHES=256 and PER_CONN_REMOVE_PROB=0.005 (per-conn, per-gen independent draws). Seed 50657.

Result (top-5 test):

id	fitness	test	M	F	K	E	patches	conn
611	0.8579	0.8516	0.9636	0.8570	0.8912	0.7681	257	18,996
647	0.8578	0.8521	0.9621	0.8579	0.8893	0.7708	257	18,939
587	0.8542	0.8537	0.9614	0.8582	0.8916	0.7739	256	19,221
565	0.8534	0.8535	0.9627	0.8559	0.8925	0.7734	257	19,030
727	0.8531	0.8509	0.9633	0.8571	0.8887	0.7677	257	19,081

Top-5 mean 85.24%, conn 19,053 (vs C5c’s 85.72% / 19,790).

Trade: −0.5pp accuracy for −4% connections over 50 generations. Pruning is mechanistically functional but the gain is small.

Why so little pruning even at 0.5% per-conn-per-gen?

Expected disables per individual per gen: 20K × 0.005 = 100. Observed avg_conn drop: ~25/gen. The other ~75/gen are being resurrected.

Mechanism: NEAT crossover treats enabled as part of the matching ConnectionGene and inherits the value 50/50 from either parent. If parent A has the connection disabled (after pruning) and parent B has it enabled, the child gets one or the other with equal probability. So at every breeding step, ~half of new prunes are undone.

For real lottery-ticket compression you’d want sticky-disabled crossover: a connection is enabled in the child iff it’s enabled in both parents (or in the fitter parent only, etc). That would let pruning accumulate generation over generation. With the current crossover, the effective per-gen prune rate is roughly per_conn_remove_prob × (1 - crossover_resurrection), and the resurrection fraction is large.

Takeaway

The seed-size lever (C5b/C5c/C5d) absolutely dominates the prune lever (C6) for this configuration. At seed 256 with prune, you get to ~85.2%; at seed 256 with no prune, you get to ~85.7%; at seed 512 with no prune, you get to 87.1%. Bigger initial population beats trying to compress. Sticky-disabled crossover might change this verdict; deferred.

C7: Macro `add_patch_burst` from seed 64

Date: 2026-05-13 Setup: same as C4 with ADD_PATCH_BURST_PROB=0.20, ADD_PATCH_BURST_COUNT=8. New mutation add_patch_burst inserts 8 patches at once (all head_weight=0) when the per-generation gate fires. Tests whether the per-+1 fitness-noise floor can be sidestepped with macro architectural jumps.

Result (top-5 test):

id	fitness	test	M	F	K	E	patches	conn
750	0.8244	0.7574	0.8977	0.8109	0.7906	0.6368	65	5,006
767	0.8233	0.7570	0.8972	0.8094	0.7869	0.6387	65	5,008
732	0.8186	0.7576	0.8968	0.8095	0.7906	0.6385	65	5,007
709	0.8170	0.7585	0.8978	0.8085	0.7909	0.6405	65	5,007
774	0.8165	0.7539	0.8978	0.8083	0.7832	0.6327	64	5,005

Top-5 mean 75.69% — essentially indistinguishable from C3/C4/C5a (~75-76%).

Crucially, top individuals all stayed at 64-65 patches while avg_patches rose to 77.2 by gen 49. The bursts are firing (avg climbs from 64 → 77) but the resulting macro-mutants don’t reach the top of the population — selection prefers their 64-patch peers.

Why macro adds also fail

Even with head_weight=0 making the insertion behavior-preserving, the 8 new patches are cold — random indices, random internal weights. They contribute nothing useful initially. Their host individual now has 72 patches consuming compute capacity but only 64 trained. In the next 10K-step training window the new patches start to train, but they can’t catch the maturity of the original 64. Fitness of the 72-patch host is at or slightly below the 64-patch peers, so selection prefers the original.

Pattern across all “evolve patch count” attempts

Experiment	Per-gen patch delta	Top patches	Top test
C3 (add 0.05, head N(0,.1))	+1 conditional	64	0.759
C4 (add 0.05, head 0)	+1 conditional	64-66	0.749
C5a (add 0.20, head 0)	+1 conditional × 4 rate	65-67	0.752
C7 (burst 8, head 0)	+8 conditional	64-65	0.757
C5b (seed 128, no add)	0	128	0.824
C5c (seed 256, no add)	0	256	0.857
C5d (seed 512, no add)	0	512	0.871

Patch count is not evolvable on this 50-gen × 10K-step budget. The structural problem isn’t insertion mechanics — it’s that fresh patches need training time, and selection happens before they get it. Patches that don’t yet contribute drag their host below mature-only peers.

What would unblock patch-count evolution?

Three plausible mechanisms, none implemented:

Pre-trained patch insertions. Add patches whose indices and weights are sampled from a successful template (e.g., a random translation of an existing patch in the same genome). The new patch contributes immediately because its features are aligned to known-useful ones.
Speciation that protects newcomers. Ecological niching where individuals with similar topology compete only within their niche. A 72-patch individual competes with other 72-patch individuals, not against 64s. This is exactly what NEAT speciation does in classic implementations — and it’s also Group B’s “ecological niche” hypothesis. Worth testing.
Longer generation windows. If evolve_interval = 100K instead of 10K, new patches have 10× more training time before they’re judged. May or may not help — depends on whether 100K steps is enough for an 8-patch cohort to catch the 64-patch crowd.

For the typed-species hypothesis, this means the right way to evolve patch count is by ecological speciation, not direct fitness-driven mutation. C8 should test this.

C8: Ecological speciation across 4 datasets

Date: 2026-05-13 Binary: src/bin/group_c_niches.rs Hypothesis: with 5 independent niches (4 pure-task + 1 mixed), each trained on its own data distribution and seeded identically (128 patches, half spatial / half random-index), each niche should evolve toward a different patch geometry. In particular, Group B’s KMNIST inversion (spatial locality flipped relative to MNIST) should appear as a difference in evolved patch distributions.

Setup: 5 niches, 300K steps each, pop 50, seed 128 patches, mutate_patch_indices_prob=0.30 per generation. Patch geometry stats computed at log intervals — (row_std, col_std) of per-pixel positions across the whole population, and edge_frac = fraction of patches with at least one pixel within 5 of the image border.

Per-niche accuracy (best individual, own task; 0% on others means zero cross-task transfer since outputs for unseen classes were never trained):

Niche	Own-task test	C5b joint comparison	Gain from specialization
mnist	96.8%	95.0%	+1.8pp
fashion	86.9%	82.8%	+4.1pp
kmnist	90.2%	86.4%	+3.8pp
emnist	78.3%	73.5%	+4.8pp
mixed	78.8%	(—)	(mixed niche, 300K vs C5b’s 500K)

Pure-task niches beat joint training on their own task by 2-5pp. EMNIST gains the most (47 classes, the most under-capacitied task). Cross-task accuracy is identically 0 — consistent with main-stream Experiment 16’s zero-transfer finding.

The geometry result — Group B confirmed by evolution

Niche	row_std	col_std	edge_frac	Group B prediction
mnist	6.53	7.03	0.700	spatial +0.6-0.9 *** → keep spatial ✓
fashion	8.12	8.11	1.000	~null → mixed/distributed ✓ (distributed)
kmnist	8.08	8.14	1.000	spatial −1.21 *** → inverted, distributed ✓
emnist	7.17	6.78	0.857	spatial +1.03 *** → keep spatial ✓ (partial)
mixed	7.92	8.10	1.000	(averaging) — pulled to distributed by 3/4

Initial conditions: each population started with 50% PatchInit::Spatial and 50% PatchInit::Random. Expected initial (row_std, col_std) ≈ (7.45, 7.45); expected initial edge_frac ≈ 0.83.

MNIST drifted down in edge_frac (0.83 → 0.70) and down in row_std (7.45 → 6.53) — selection preserved spatial 5×5 patches over random-index. The KMNIST niche drifted up in edge_frac (0.83 → 1.00) — selection purged spatial patches in favor of random-index. Fashion did the same. EMNIST stayed nearer the spatial side (edge_frac 0.857) consistent with Group B finding spatial locality is positive for EMNIST too.

This is the typed-species hypothesis confirmed in dynamics: niching reproduces Group B’s per-task locality findings without being told what they are. KMNIST is the most frequent outlier in Group B’s transferability tally; KMNIST is also the niche that most aggressively rejects spatial patches in C8.

EMNIST’s anisotropy (row_std=7.17 > col_std=6.78) is curious — could reflect that EMNIST characters have a vertical-stroke bias (printed letters lean vertical) so column-position discriminative information is more spatially concentrated than row. Group B B33 noted EMNIST follows the rectangular-wide preference (+0.98 ***) — both findings consistent with a “vertical-stroke” reading of EMNIST.

What C8 says about the integration story

Group B’s strongest mechanistic claim — that the right patch geometry is task-conditional and KMNIST inverts — was the explicit motivation for Group C. C8 demonstrates that:

The integration is doing real architectural work. Niching produces visibly different patch populations, not just different connection weights.
The discovery mechanism is fitness-driven selection over many generations, not direct mutation of geometry parameters. mutate_patch_indices is the engine; selection is what makes different geometries dominant in different niches.
Manual experimentation in Group B is replaceable by speciation in Group C. What took 35 Group B experiments to map (per-task locality directions) emerges as a population-level property in 30 minutes of niche training.

This satisfies the Group C charter and is the natural stopping point for this phase.

Open questions for later

Anisotropy as a per-niche fingerprint (row_std vs col_std differing). Worth running multi-seed niches and seeing if anisotropy is reproducible.
Patch count in niches. Re-test add_patch + add_patch_burst inside niches (where competition is restricted to similar topologies). Group B’s per-task data suggests EMNIST wants more patches than MNIST — would niching let evolution discover that?
Cross-niche transfer. Take the MNIST niche’s best individual, transplant its patch geometry into a fresh genome, and train on Fashion. Does the geometry transfer cleanly, or does it need to re-evolve?
What does each patch look like? Per-patch visualization (which 25 pixels does it weight?) could show whether MNIST’s niche has stroke-detector-like patches, KMNIST’s niche has more arbitrary feature detectors, etc. Group B’s mechanistic work (B31) related class-discriminability to evolved geometry — here we could do that empirically by mapping patches to their dominant input weights.

D1: Per-patch introspection (Group C / Phase D)

Date: 2026-05-13 Binary: src/bin/group_c_introspect.rs (niches with patch-viz dump at end) Hypothesis: each niche’s evolved patches should look different. MNIST should converge on stroke/edge-like patches concentrated near the digit; KMNIST should have spatially scattered, near-random patches. The C8 geometry signature (edge_frac) should manifest visually.

Setup: same 5 niches as C8 (mnist, fashion, kmnist, emnist, mixed), 128 patches seeded 50/50 spatial+random, 300K steps per niche. At the end of each niche, dump three PGM files for the top individual:

*_patches.pgm: mosaic where each cell is a 28×28 weight map of one patch (signed grayscale, in-patch pixels brightness ∝ weight, out-of-patch pixels dark)
*_coverage.pgm: 28×28 heatmap of how often each input pixel is referenced across the individual’s patches
*_popcoverage.pgm: same heatmap aggregated over the full population

Coverage stats (from group_c_analyze_pgm):

Niche	Centroid (r, c)	Spread (r, c)	Center %	Edge %	Top-5%-pixel mass
mnist	(13.97, 13.89)	(6.79, 7.21)	37.1%	62.9%	11.2%
fashion	(13.44, 13.47)	(8.05, 8.16)	23.6%	76.4%	10.5%
kmnist	(13.66, 13.71)	(8.11, 8.16)	23.8%	76.2%	10.5%
emnist	(11.96, 12.30)	(7.19, 6.48)	38.0%	62.0%	11.7%
mixed	(13.30, 13.67)	(6.88, 6.75)	37.1%	62.9%	11.9%

A uniform random distribution would put 25% of mass in the 14×14 center region. MNIST (37%) and EMNIST (38%) concentrate well above uniform — selection preserved spatial 5×5 patches whose pixels cluster in the central region. Fashion (24%) and KMNIST (24%) are at or below uniform — selection drove the population to random-index patches that spread evenly across the image.

This sharpens the C8 result: it’s not just “edge_frac ≈ 1.0 vs 0.7” but a concrete map of where the patches concentrate.

EMNIST anisotropy:

Centroid (11.96, 12.30) — offset from image center (13.5, 13.5) toward top-left
col_std = 6.48 < row_std = 7.19 — significantly tighter horizontally than vertically

EMNIST’s printed letters/digits have central vertical strokes; the discriminative pixels live in a horizontally-tight, slightly-above-center band. Patches concentrate there. This is consistent with Group B B33 (EMNIST follows the rectangular wide-preference: 3×9 wide patches beat 9×3 tall).

Per-task accuracy (rerun confirms C8 — D1 wasn’t intended as a fresh accuracy run but matches within noise):

Niche	D1 own-task test	C8 own-task test
mnist	97.0%	96.8%
fashion	86.5%	86.9%
kmnist	90.2%	90.2%
emnist	77.6%	78.3%
mixed	81.4%	78.8%

What D1 adds beyond C8

C8 showed each niche evolves to a different aggregate geometry (edge_frac). D1 measures what region of the input space each niche’s patches concentrate on, and visualizes the patches themselves. The two findings agree and the EMNIST anisotropy is a new, sharper signal: the niche evolved patches biased toward a specific band of the image consistent with the data’s class-discriminative geometry.

PGM files are in notes/group_c/runs/d1/*.pgm. Mosaic files are ~110KB each (16-col × 8-row grid of 28×28 weight maps with 1-pixel borders); coverage and population-coverage are 784-byte 28×28 heatmaps.

D-prep: bugfix pass (silent-no-op connections, sticky-disabled crossover, dead-patch compilation, cycle-breaking sanitize)

Four invariant fixes landed before D2 to keep long-running experiments safe:

add_connection excludes patch nodes as targets. Patches’ fan-in is via PatchTopo.indices; a ConnectionGene targeting a patch is silently ignored by the forward pass but inflates connection_count. Earlier Group C runs accumulated a small number of these inert connections.
Sticky-disabled crossover (NEAT-classic 0.75 rule, exposed as EvolutionConfig.disable_inheritance_prob, default 0.0). When a matching connection gene has one parent disabled and the other enabled, the child inherits disabled with this probability. Lets pruning accumulate across generations instead of getting undone by the lesser parent’s enabled = true half the time. Default 0.0 preserves prior behavior.
Dead-patch compilation skip. Patches with zero enabled outgoing connections are no longer added to PatchTopo; the topo position still computes via the empty conn_ranges loop (yielding 0). Saves a tiny amount of compute and keeps PatchTopo reflecting “live” patches.
Genome::sanitize() drops connections referencing absent nodes and breaks cycles in the enabled subgraph by iteratively disabling one inter-cycle edge per pass. Called at the end of mutate() as a defensive guard. Without it, D2 panics during phenotype compilation: Kahn’s topo sort leaves cycle-trapped nodes out of topo_order, then a connection referencing one of them indexes into the incomplete node_index and triggers a no entry found for key panic at phenotype.rs:209.

The cycle introduction mechanism (still incompletely characterized) involves crossover combining matching connection genes from both parents whose enabled/disabled patterns are individually acyclic but together close a cycle. sanitize() is the principled fix; identifying the exact cycle-creating gene combination is open work.

D2: Patch-count evolution inside niches

Date: 2026-05-13 Binary: src/bin/group_c_niche_growth.rs Hypothesis: when competitors share topology and task (inside an ecological niche), the +1-patch marginal fitness signal might escape the joint-task fitness noise floor that blocked C3/C4/C5a/C7. Predict EMNIST grows patches the most (most under-capacitied at 128); MNIST grows least (saturated).

Setup: same 5 niches as C8/D1 with add_patch_prob=0.10, add_patch_burst_prob=0.05 (burst count = 4), seed 128 patches. 300K steps per niche, otherwise default Group C config.

Result (top-3 by fitness per niche):

Niche	Top fit	Top test	M	F	K	E	patches	Cycles broken
mnist	0.9697	0.968 own	0.968	—	—	—	129	0
fashion	0.8837	0.869 own	—	0.869	—	—	129	0
kmnist	0.9225	0.904 own	—	—	0.904	—	128	1
emnist	0.7966	0.775 own	—	—	—	0.775	128	0
mixed	0.8127	0.813 joint	0.940	0.829	0.841	0.722	129	0

avg_patches across the run: 128.0 → 128.3-129.0 in all niches (slight upward drift over 30 generations).

EMNIST’s top-3 had patch counts 128, 132, 128 — i.e., a rank-2 individual reached 132 patches and didn’t get culled. In C7 (joint task, same add_patch_burst config), top individuals all stayed at 64 with macro-mutants culled out. In-niche competition relaxes that culling enough to keep a 132-patch macro-mutant alive in the top tier — but not enough to make it the best individual.

Cycle-breaker firings: 1 single cycle broken across the entire 5-niche, 1.5M-step run, all in KMNIST. The cycle bug is rare but catastrophic (panics phenotype compilation when it hits, as observed in D2 v1-v3). sanitize() makes it harmless.

Per-task accuracy vs C8/D1 baseline (own task, no add_patch):

Niche	C8/D1 baseline	D2 (with add_patch)	Δ
mnist	96.8%	96.8%	≈0
fashion	86.9%	86.9%	≈0
kmnist	90.2%	90.4%	+0.2pp
emnist	78.3%	77.5%	−0.8pp

Within noise across all niches. The patch-add mutations are firing (avg_patches drifts up by ~0.5-1.0 over the run) but the new patches don’t move the test accuracy needle — they enter, slightly drag fitness during training, and either get culled or persist as low-contribution patches.

Interpretation: niching doesn’t unblock patch-count evolution either

The C4-C7 finding generalizes: in-niche competition is not sufficient. The blocker isn’t really about “fitness noise floor at the joint task” — it’s specifically about training time. New patches need many thousands of steps to train up to usefulness, and selection happens before that, even when the comparison pool has similar topology.

What in-niche competition does relax: the 132-patch EMNIST individual reached rank 2 (vs being culled in C7’s joint task). Macro-mutants aren’t immediately killed in niches, but they also don’t reach the top.

The path to actually-evolvable patch count likely needs one of:

Longer evolve intervals (50K+ steps between selections) — give new patches time to mature.
Pre-trained patch insertions (e.g., translate-and-copy an existing successful patch).
Network-level training-step counter tied to “patch maturity” used as a tiebreak in selection.

None of these are implemented. Recording D2 as a clean negative on the niching-unblocks-count hypothesis.

What D2 does establish

The sanitize() defense (in particular the cycle-breaker) is load-bearing for long-running patch-add experiments. Without it, the run panics; with it, the run completes cleanly with a single cycle broken.
Initial seed continues to dominate patch count for the practical purpose of reaching test accuracy. In-niche evolution affects geometry (D1/C8) and stabilizes index choice, but not count.

D3: Depth + niching

Date: 2026-05-13 Binary: src/bin/group_c_depth.rs Hypothesis: insert a 32-node ReLU hidden layer between patches and outputs, run the C8 niches. Group B B25 found KMNIST gains +2.78pp from depth (with proper LR schedule); B34 found EMNIST loses even with proper schedule. Test whether the integrated, niched system reproduces these per-task depth findings.

Setup: Genome::new_with_patches(.., hidden_size = 32, ..) extension adds a 32-node ReLU hidden layer between patches and outputs. Patches → hidden (fully connected, He init), hidden → outputs (fully connected), bias → hidden and outputs. 128 patches seed (same as D1/C8), 300K steps per niche, same mutation config as D1 (add_patch_prob=0.0, patch-index evolution only).

Result (top individual per niche, test accuracy):

Niche	D1 baseline	D3 (depth=32)	Δ	Group B prediction
mnist	96.8%	96.78%	≈0	null on saturated MNIST (B32) ✓
fashion	86.9%	86.71%	≈0	(untested in Group B at proper schedule)
kmnist	90.2%	93.49%	+3.3pp	B25: +2.78pp ✓
emnist	78.3%	75.56%	−2.7pp	B34: −1.11pp ✓ (sign matches, magnitude larger)
mixed	81.4%	78.57%	−2.8pp	averaging

This is the cleanest Group B replication so far. Three out of four per-task signs match Group B’s depth findings exactly (KMNIST positive, EMNIST negative, MNIST null). Fashion is novel data — flat at the saturated ceiling. KMNIST’s +3.3pp is within 0.5pp of Group B’s +2.78pp; EMNIST’s −2.7pp is in the same sign as B34’s −1.11pp but larger in magnitude.

Connection efficiency: depth shrinks the network

D1 (no depth) at 128 patches: 9,933 connections. D3 (depth=32) at 128 patches: 6,669 connections (~33% fewer).

Architecture math:

D1: 128 patches × 77 outputs + 77 bias = 9,856 + 77 = 9,933 patch→output connections.
D3: 128 patches × 32 hidden + 32 hidden × 77 outputs + 32 bias→hidden + 77 bias→output = 4,096 + 2,464 + 32 + 77 = 6,669.

So depth not only helps KMNIST but also shrinks the network. KMNIST gets +3.3pp accuracy and -33% connections — a Pareto win. (EMNIST gets −2.7pp accuracy and -33% connections — a Pareto loss.)

The mixed niche illustrates the ecological argument

Method	Mixed test
D1 no depth	81.4%
D3 depth=32	78.6%

Adding depth uniformly to the mixed niche hurts by 2.8pp. The intuition: a single 32-node hidden layer is one fixed architectural decision. It helps KMNIST and hurts EMNIST. On the mixed task with both, the net is negative.

This is the ecological-speciation argument in concrete form: per-task depth selection is one of the things ecological niches can do but a single network can’t. D3’s mixed niche underperforms D1’s mixed niche; D3’s KMNIST niche beats D1’s. Niching captures task-conditional architectural value that homogeneous training can’t.

Summary of Phase D

Experiment	Question	Answer
D1	What do the per-niche evolved patches look like?	Per-niche spatial concentration map quantifies where each niche’s patches live; EMNIST shows an off-center anisotropy band consistent with vertical-stroke discrimination
D2	Does niching unblock patch-count evolution?	No. Top individuals stay near the seed (128); macro-mutants survive in niches but don’t reach the top
D3	Does niching reproduce Group B’s per-task depth findings?	Yes, cleanly. KMNIST +3.3pp, EMNIST −2.7pp, MNIST null, mixed −2.8pp

Plus four bugfixes (add_connection patch exclusion, sticky-disabled crossover, dead-patch compilation skip, cycle-breaking sanitize).

Group B’s two strongest cross-task findings (locality direction from C8/D1; depth direction from D3) are now both reproduced by the typed-species NEAT integration as emergent niche-level behaviors. The two structural blockers identified are (a) patch-count evolution remains hard at this gen budget regardless of niching or mutation flavor (C3-C7, D2), and (b) rare cycle bugs in NEAT crossover need a sanitize defense.

C1: 4-way joint MLP baselines

Per-dataset difficulty ordering (consistent across single-hidden arms)

[128, 64] collapse: initialization is the bottleneck

Floor that Group C must beat

C2: Integrated patch-matcher verifier (MNIST)

C3: First patch-evolved population (4-way joint)

Key dynamics observation: add_patch additions don’t survive selection

What C3 tells us about the typed-species hypothesis

Next: C4 candidates

C4: Behavior-preserving add_patch (head_weight = 0)

Why so little growth?

Next: C5a / C5b

C5a: Higher add_patch_prob

C5b: Larger seed (128 patches)

Vs all prior runs

What C5a + C5b together imply

Next: C5c

C5c: Larger seed (256 patches)

Capacity scaling (C3/C4 + C5a/C5b/C5c)

Connection pruning is dead

Next

C5d: Larger seed (512 patches)

Capacity scaling — log-linear with halving gains

Per-dataset comparison to Group B single-task ceilings

Connection prune signal is dead at default settings

C6: Per-connection pruning from seed 256

Why so little pruning even at 0.5% per-conn-per-gen?

Takeaway

C7: Macro add_patch_burst from seed 64

Why macro adds also fail

Pattern across all “evolve patch count” attempts

What would unblock patch-count evolution?

C8: Ecological speciation across 4 datasets

The geometry result — Group B confirmed by evolution

What C8 says about the integration story

Open questions for later

D1: Per-patch introspection (Group C / Phase D)

What D1 adds beyond C8

D-prep: bugfix pass (silent-no-op connections, sticky-disabled crossover, dead-patch compilation, cycle-breaking sanitize)

D2: Patch-count evolution inside niches

Interpretation: niching doesn’t unblock patch-count evolution either

What D2 does establish

D3: Depth + niching

Connection efficiency: depth shrinks the network

The mixed niche illustrates the ecological argument

Summary of Phase D

`[128, 64]` collapse: initialization is the bottleneck

C7: Macro `add_patch_burst` from seed 64