Raw structured experiment records for the typed-species (Group B) stream. Reproduced exactly as produced.

Group B Experiments

Structured experiment records. Each entry: hypothesis, setup, result, takeaway.

B1: Prototype matcher (raw dot product)

Date: 2026-05-07 Binary: src/bin/proto.rs Hypothesis: A single dot product against the per-class mean training image carries enough signal to classify MNIST above chance.

Setup:

50K MNIST training images → 10 prototypes (per-class pixel mean, no normalization)
10K held-out test images (last 10K of train-images-idx3-ubyte)
Classification: argmax over 10 raw dot products

Result: 64.99% accuracy (6499/10000)

Per-class accuracy: 0:97.3, 1:48.1, 2:70.9, 3:73.0, 4:53.6, 5:0.0, 6:80.1, 7:64.8, 8:92.2, 9:65.7

Failure mode: raw dot product is dominated by prototype pixel mass. The “8” and “0” prototypes have the most active pixels and score high against everything. Class 5 hits 0% because the “0” prototype is denser than the “5” prototype in the regions where digit 5 has ink — the 5 image scores higher against “0” than “5”.

Takeaway: there is real signal in dot-product matching, but the unnormalized version is structurally biased toward dense templates. Any honest patch-matcher experiment has to beat the normalized version of this baseline.

Next: B2 — same matcher with cosine similarity (or zero-mean prototypes) to set a fair calibration floor.

B2: Normalized prototype matcher

Date: 2026-05-07 Binary: src/bin/proto_norm.rs Hypothesis: cosine similarity removes the magnitude bias that crushed B1, exposing the actual shape-matching signal.

Setup: same 50K/10K split as B1. Three variants in one binary:

raw dot product (B1 sanity check)
cosine similarity: <p,x> / (||p|| ||x||)
centered prototypes + cosine: subtract grand-mean prototype before normalizing

Results:

Variant	Accuracy
Raw dot product	64.99%
Cosine similarity	83.53%
Centered + cosine	58.89%

Per-class accuracy (cosine): 0:89.2 1:94.5 2:81.6 3:83.1 4:81.5 5:69.1 6:91.5 7:84.5 8:78.0 9:80.4

Failure modes (cosine): confusion is now local and shape-driven — 5↔3 (125 errors), 4↔9 (125), 8↔3 (76). The class-5-always-predicts-0 catastrophe is gone.

Why centered+cosine got worse: centering the prototype without also centering the test image creates a mismatched comparison — centered prototypes have negative values where raw images can only have zero. Image centering would fix it; not pursued because B2a already gave us the clean answer.

Takeaway: 83.53% is the honest calibration floor. Any patch-matcher result that doesn’t clear this is meaningless. Useful reference points still missing: trained linear classifier (~92% expected), small MLP (~98%), single conv layer (>98%).

B3: Prototype features → trained linear discriminator

Date: 2026-05-07 Binary: src/bin/proto_clf.rs Hypothesis: feeding the 10 cosine-similarity scores (B2) into a trained linear softmax classifier instead of taking argmax extracts more signal and clears 83.5% comfortably.

Setup: 784 → 10 frozen cosine-prototype nodes → 10×10 linear classifier + bias, softmax + CE, online SGD (lr=0.5, 5 epochs, seed=0xB3).

Result: 84.85% (+1.32pp over B2 argmax). Train accuracy peaked at 80.9%; test peaked at 85.87% in epoch 3 then mildly drifted.

Per-class accuracy: 0:92.2 1:97.6 2:73.5 3:78.5 4:92.4 5:68.5 6:94.5 7:94.1 8:76.7 9:77.7

Learned weights: strongly diagonal (+39 to +57 on diag, mostly small negatives off-diag). The classifier essentially rediscovered argmax with tiny corrections.

Takeaway: whole-image cosine similarity against 10 class means is information-bottlenecked at ~85%. A linear discriminator cannot recover information lost in the 784→10 projection. Breaking past this requires either richer features (more prototypes per class, local patches) or an entirely different feature primitive. The next interesting move is local patches — the original Group B hypothesis — since the failure modes (5↔3, 4↔9, 8↔3) are precisely where local feature detection should help.

Next: B4 candidate — same architecture but with 16-32 randomly-placed 5×5 patch matchers as features instead of 10 whole-image prototypes. If that clears 85%, the patch hypothesis has signal.

B4: Random local patches (frozen) → trained discriminator

Date: 2026-05-07 Binary: src/bin/proto_patches.rs Hypothesis: locality + nonlinearity are sufficient inductive bias to clear the 85% whole-image cosine ceiling, even with random unlearned patches.

Setup: 32 patches × 5×5, He-init random weights, bias=0, uniform random positions. ReLU on patch outputs. Patches frozen. 32→10 linear softmax classifier, online SGD, lr=0.1, 10 epochs, seed=0xB4.

Result: 67.56% (best across epochs: 68.60%) — worse than B2 (84%) and B3 (85%).

Per-class accuracy: 0:78.8 1:91.6 2:62.3 3:58.0 4:46.7 5:64.7 6:65.8 7:82.8 8:50.4 9:71.6

Key diagnostics:

Only 10.9 of 32 patches fire per image on average — most positions land on background.
Total patch params: 832 (vs. B3’s 7840), most of which never engage.
Random weights have no semantic content; downstream classifier sees noise.

Takeaway: locality + nonlinearity is not sufficient inductive bias on its own. The content of the filters matters. This matches the historical observation that LeNet-1 needed learned convolutional filters, not random ones — random conv weights were already known to fail.

Negative result framing: B4 cleanly rules out “any local features will do.” The next experiment must test whether learned patches break the ceiling.

Next: B5 — same architecture as B4 but with patch weights trainable via backprop. Same patch count and size to isolate the effect of learning from the effect of architecture.

(Diverted: did a cheaper intermediate check first — meaningful but frozen patches.)

B5: Prototype-slice patches (frozen) → trained discriminator

Date: 2026-05-07 Binary: src/bin/proto_slices.rs Hypothesis: B4 failed on filter content, not on locality. Patches that are literal 5×5 slices of class-mean prototypes (frozen) feeding a trained linear discriminator should clear the 85% bottleneck.

Setup: For each class c and slice i (0..N), random (top, left) in [0..23]², copy prototype[c][top..top+5, left..left+5] as patch weights. Cosine similarity at that position. Linear classifier on top of 10N features, online SGD lr=0.1, 10 epochs, seed=0xB5. Swept N ∈ {1, 2, 4, 8, 12, 16, 24, 32, 48, 64}.

Results (best test accuracy across epochs):

N/class	Features	Best %
1	10	55.27
2	20	75.58
4	40	85.18
8	80	89.77
12	120	90.43
16	160	91.24
24	240	91.11
32	320	92.22
48	480	92.76
64	640	93.67

Key observations:

Crosses B3 (84.85%) at N≈4, decisively beats it from N=8 onward.
Compared to B4 (random local, 32 features, 67.56%): B5 at matched feature count is ~80% — meaningful content adds ~12pp at fixed scale.
No magic number. Smooth log-shaped curve, diminishing returns past N=12-16.
Best Group B result so far (93.67%), still with frozen feature layer.

Confound: at N=32, B5 has 8,000 patch params vs. B4’s 800. Some of the ~25pp lift is raw capacity, not just content.

Takeaway: locality + meaningful content > global + meaningful, and both crush locality + random. Confirms patch-matcher hypothesis but doesn’t fully isolate the inductive-bias contribution from the parameter-count contribution.

Next: B6 — 320 random patches matched to B5’s N=32 feature count, to isolate “more capacity” from “meaningful content”.

B6: Random patches, sweep over count

Date: 2026-05-07 Binary: src/bin/proto_patches_sweep.rs Hypothesis: B5’s lift over B4 came from raw feature count, not from meaningful filter content. If random patches at matched count perform similarly to prototype-slice patches, B5’s “meaningful content matters” reading was overstated.

Setup: He-init random 5×5 patches, ReLU, frozen. Linear classifier on top trained with online SGD, lr=0.1, 10 epochs, seed=0xB6. Swept patch count ∈ {32, 80, 160, 320, 640}.

Results:

Patches	Params	Best %
32	800	66.26
80	2000	82.07
160	4000	88.85
320	8000	92.65
640	16000	94.01

Comparison to B5 (prototype-slice) at matched feature counts:

80: B6 82.1% vs B5 89.8% (B5 wins by 7.7pp)
160: B6 88.9% vs B5 91.2% (B5 wins by 2.3pp)
320: B6 92.7% vs B5 92.2% (B6 ahead by 0.5pp)
640: B6 94.0% vs B5 93.7% (B6 ahead by 0.3pp)

Takeaway: B5’s apparent “meaningful content matters” effect was real only at small feature counts. At ≥320 features, random patches catch up; at 640 they slightly exceed prototype-slice patches. The linear classifier finds good combinations from noisy features once capacity is high enough.

Reframing of B5: meaningful local features have a sample-efficiency advantage that vanishes with sufficient feature count, not a fundamental quality advantage.

B7: Trained patches (random init, backprop)

Date: 2026-05-07 Binary: src/bin/proto_patches_trained.rs Hypothesis: SGD-trained patches from random init beat both random-frozen and prototype-slice-frozen at every feature count, especially at small N. Validates the original Group B hypothesis (patch matchers as a learnable typed species).

Setup: Same architecture as B6 (random fixed positions, ReLU patch outputs, linear classifier), but patch weights AND biases train via backprop alongside the classifier. He init for both layers, online SGD lr=0.05 (lower than B6 because two layers train), 10 epochs, seed=0xB7. Swept patch count ∈ {32, 80, 160, 320, 640}.

Results:

Patches	Total params	Best %
32	1162	85.64
80	2890	93.46
160	5770	96.04
320	11530	97.13
640	23050	97.27

Gap over random (B6) at matched count:

32: +19.4pp
80: +11.4pp
160: +7.2pp
320: +4.5pp
640: +3.3pp

Takeaway: trained patches dominate everywhere. The gap shrinks at scale (random-with-many-patches eventually approaches usable) but doesn’t close in this range. 97.27% with 640 5×5 trained patches is close to the main NEAT system’s [128] dense-hidden result (98.7%), achieved with locality as inductive bias rather than dense connectivity.

Status: original Group B hypothesis (patch matchers as a learnable typed species) supported. Ready for integration into the main NEAT stream as a typed-node mutation, if/when that’s the priority.

B8: Patch size sweep (3×3 through 7×7)

Date: 2026-05-07 Binary: src/bin/proto_patches_size.rs Hypothesis: 5×5 was an arbitrary default carried over from B4/B7. Different patch sizes have different parameter efficiencies — smaller patches give more spatial coverage per parameter, larger patches give bigger receptive fields per detector. The cross-product of size and count reveals which trades are worth making.

Setup: same architecture and training as B7 (trained patches, random He init, ReLU, linear discriminator, online SGD, lr=0.05, 10 epochs, seed=0xB8). Crossed sweep: patch size ∈ {3, 4, 5, 6, 7} × patch count ∈ {32, 80, 160, 320, 640}. 25 configurations total.

Results — best test accuracy by (count, size):

N	3×3	4×4	5×5	6×6	7×7
32	79.95	81.32	86.93	89.15	91.52
80	90.49	93.37	92.33	94.32	96.33
160	92.28	94.65	95.44	96.14	96.82
320	95.85	96.69	97.12	97.08	97.53
640	96.54	96.96	97.70	97.71	98.04

Parameter-efficient frontier (best config at each budget):

≈Params	Config	Accuracy
1.6K	3×3 × 80	90.49
3K	3×3 × 160 / 5×5 × 80	92.3
6K	3×3 × 320	95.85
12K	5×5 × 320	97.12
20K	7×7 × 320	97.53
38K	7×7 × 640	98.04

Key findings:

At low parameter budgets, smaller patches at higher count win — coverage beats receptive field when params are scarce.
At ~12K+ params, 5×5/6×6/7×7 all reach 97%, with 7×7 leading on absolute accuracy but at higher param cost.
3×3 has a hard ceiling around 96.5% — receptive field too small to capture enough digit structure regardless of count.
All sizes saturate at 96.5%-98% by N=640. Architectural ceiling not removed by more patch parameters of any size.

Takeaway: patch size is a meaningful axis (not a hyperparameter to fix). For typed-NEAT integration, the genome should support mutations adding patches of varied sizes and let evolution pick the mix. Default single-size choice if forced: 5×5 — never far from optimal across the budget range, with 4×4 close behind. Multi-scale > single-scale.

Status: enriches the B7 result with parameter-efficiency data. Confirms 5×5 was a reasonable default, identifies the size-vs-count tradeoff structure for downstream design decisions.

B9: Rectangular patches (single-seed sweep)

Date: 2026-05-07 Binary: src/bin/proto_patches_rect.rs Hypothesis: rectangular patches at matched parameter counts may differ from square ones; horizontal vs. vertical orientation may also matter.

Setup: same training as B7/B8. 11 shapes covering matched-area tiers (16/21, 24/27, 35/36) with horizontal/vertical mirror pairs. Two patch counts (N=160, N=320). Single seed (0xB9).

Headline single-seed results (N=320):

5×7 reached 97.66% at 14730 params — the best result of the sweep
3×9 vs 9×3 gap: +0.87pp (wide wins)
5×7 vs 7×5 gap: +0.38pp (wide wins)
4×6 vs 6×4 gap: +0.01pp, 3×7 vs 7×3 gap: +0.06pp (no preference)

Status: suggestive but unreliable. With per-config std ~0.15pp, sub-0.3pp gaps are noise. Triggered B9-stats.

B9-stats: Rectangular patches with paired multi-seed stats

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_stats.rs Hypothesis: confirm or refute the B9 single-seed gaps with multi-seed paired comparisons. Stand up reusable stats helpers (SampleStats, PairedComparison).

Setup: 8 shapes × 5 seeds (0xB9–0xBD) at N=320. Paired design — same seeds across all shapes so within-seed noise cancels in differences. All statistics constant-time given running sums.

Per-shape mean ± std (5 seeds): 5×5 97.06±0.14, 6×6 97.34±0.16, 4×6 97.04±0.13, 6×4 96.86±0.13, 3×9 97.10±0.19, 9×3 96.60±0.16, 5×7 97.38±0.19, 7×5 97.35±0.11.

Paired wide-minus-tall (Δ pp, t, d_z, sig):

Pair	Δ	t	d_z	Sig
4×6 / 6×4	+0.180	1.63	0.73	ns
3×9 / 9×3	+0.506	4.22	1.89	*******
5×7 / 7×5	+0.038	0.37	0.17	ns

Key corrections to B9:

The B9 “5×7 N=320 = 97.66%” headline was the lucky-seed max; mean is 97.38%, statistically tied with 6×6. Single-seed best-of-sweep is unreliable.
The B9 5×7 > 7×5 finding was noise (multi-seed Δ=+0.04, ns).
The B9 3×9 ≫ 9×3 finding survived rigorously: very-large effect (d_z=1.89), highly significant (***).

Methodology lesson: any quantitative claim about a difference under ~0.3pp needs multi-seed paired stats. Stats infrastructure now reusable.

Status: established that the wide preference is real at extreme aspect ratios on MNIST. Triggered B10 (Fashion replication) to test the digit-stroke mechanistic hypothesis.

B10: Rectangular patches on Fashion-MNIST

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_stats_fashion.rs Hypothesis: if 3×9 ≫ 9×3 on MNIST is digit-stroke-specific, Fashion should show a different pattern. If it persists, the effect is general to 28×28 grayscale image classification.

Setup: identical to B9-stats — 8 shapes × 5 seeds (0xB9–0xBD) at N=320, same training regime. Only the data path changed.

Per-shape mean ± std (Fashion, 5 seeds, N=320): 5×5 86.45±0.33, 6×6 86.47±0.28, 4×6 86.53±0.43, 6×4 86.23±0.22, 3×9 86.37±0.17, 9×3 85.44±0.59, 5×7 86.37±0.30, 7×5 86.13±0.34.

Paired wide-minus-tall (Fashion):

Pair	Δ	t	d_z	Sig
4×6 / 6×4	+0.300	1.18	0.53	ns
3×9 / 9×3	+0.930	3.32	1.48	*******
5×7 / 7×5	+0.236	1.08	0.48	ns

Cross-task comparison:

Pair	MNIST Δ	Fashion Δ
4×6 vs 6×4	+0.18	+0.30
3×9 vs 9×3	+0.51	+0.93
5×7 vs 7×5	+0.04	+0.24

Findings:

The 1:3 wide-preference effect replicates and strengthens on Fashion (+0.93pp vs. +0.51pp).
All three pairs trend wide > tall on both datasets; moderate-aspect pairs are non-significant on both.
9×3 is genuinely unstable on Fashion — std=0.59, the highest of any shape.
Fashion is harder overall (~86% vs ~97%) and noisier (std ~2× larger).

Refuted hypothesis: digit-stroke geometry as the mechanism (B9 single-seed framing).

Refined hypothesis: kernels should be perpendicular to the dominant feature orientation. Both MNIST and Fashion have predominantly vertical structural features; tall patches lie along these features and waste capacity, while wide patches cut across them and capture transitions. Classical filter-design wisdom rediscovered via SGD.

Status: hypothesis is now directly testable — rotating the input 90° should flip the preference (B11 candidate), and a more granular aspect-ratio sweep should show the effect grows monotonically with ratio extremity (B12 candidate).

B11: Rotated MNIST

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_rotated.rs Hypothesis: if the wide preference is “kernels perpendicular to the dominant feature orientation,” rotating MNIST 90° CW should flip the dominant feature orientation and the preference should reverse — tall > wide on rotated data.

Setup: identical to B9-stats (8 shapes × 5 seeds at N=320) except every image is rotated 90° CW (new[c][27-r] = old[r][c]) at load time.

Per-shape mean ± std: 5×5 97.04±0.15, 6×6 97.45±0.14, 4×6 97.02±0.15, 6×4 96.99±0.22, 3×9 96.67±0.21, 9×3 97.07±0.23, 5×7 97.34±0.06, 7×5 97.44±0.13.

Paired wide-minus-tall (rotated):

Pair	Δ	t	d_z	Sig	vs. upright
4×6 / 6×4	+0.022	0.25	0.11	ns	(was +0.18 ns)
3×9 / 9×3	−0.402	−2.27	−1.02	*	was +0.51 , sign flipped*
5×7 / 7×5	−0.106	−1.77	−0.79	ns	was +0.04 ns, direction flipped

Finding: the 3×9 vs 9×3 effect cleanly reversed sign on rotated MNIST (Δ went from +0.51pp *** to −0.40pp *). The pattern moves with the data — confirming the mechanistic claim that this is feature-orientation-driven, not architectural bias or placement geometry.

Magnitude is smaller after rotation (0.40 vs 0.51): plausibly because rotated digits are slightly off-distribution (training-data conventions are axis-specific), or because vertical-feature dominance has a partly geometric component that rotation doesn’t fully invert. The sign — what the hypothesis predicts — is unambiguous.

Status: mechanistic hypothesis from B10 confirmed. The wide-vs-tall asymmetry is genuinely orientation-driven.

B12: Extreme aspect ratios

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_aspect.rs Hypothesis: the wide preference grows monotonically with aspect-ratio extremity — gap should increase 1:3 → 1:9 → 1:15 → 1:21.

Setup: 9 shapes × 5 seeds at N=320 on MNIST. Mirror pairs at 1:3 (3×9/9×3, reference), 1:9 (1×9/9×1), 1:15, 1:21. Plus 5×5 baseline.

Per-shape mean ± std: 5×5 97.06±0.14, 3×9 97.10±0.19, 9×3 96.60±0.16, 1×9 94.38±0.16, 9×1 94.36±0.16, 1×15 95.21±0.23, 15×1 94.70±0.06, 1×21 95.31±0.09, 21×1 94.62±0.25.

Paired wide-minus-tall:

Pair	Aspect	Δ	t	d_z	Sig
3×9 / 9×3	1:3	+0.506	4.22	1.89	***
1×9 / 9×1	1:9	+0.012	0.11	0.05	ns
1×15 / 15×1	1:15	+0.518	4.17	1.86	***
1×21 / 21×1	1:21	+0.690	4.62	2.07	***

Finding: monotonicity prediction refuted. There is a null at 1:9 sandwiched between significant effects at 1:3 and 1:15+. The pattern is large → null → large → larger.

The 1×9 vs 9×1 pair is the only one with both zero perpendicular extent AND short length (9 px). All other pairs have either ≥3 px perpendicular thickness (3×9, 9×3) or ≥15 px length (1×15, 1×21). Either property recovers the effect; having neither kills it.

Refined mechanistic story (post-B12): the wide preference at extreme aspect ratios is at least two distinct phenomena:

For thick rectangular patches (≥3 px perpendicular extent): B10’s perpendicular-to-feature argument.
For 1-pixel-thick long strips: cross-section sampling — horizontal strips capture intensity profile across digit width (discriminative), vertical strips capture intensity profile across digit height (less discriminative because all digits are ~similar height).

Both produce wide > tall on MNIST, but for different reasons.

Other observations:

All single-pixel-thick patches cap at ~94-95%, regardless of length. Bandwidth-limited.
Adding a single perpendicular row (1×9 → 3×9) jumps accuracy from 94.4% to 97.1%. 1D → 2D internal structure is a phase transition for usefulness.
9×3 in B12 reproduces B9-stats’s 9×3 (96.60% mean, std 0.16) exactly — reproducibility check passed.

Status: closes the rectangular-patches arc. Real mechanistic finding (B11 confirmed) with a richer-than-expected structure (B12 surprise: not single-mechanism). Worth banking; future work could disentangle the two mechanisms more cleanly but marginal value vs other Group B questions is small.

B13 / A3: Random-index “patches” — does spatial contiguity matter?

Date: 2026-05-08 Binary: src/bin/proto_patches_random_idx.rs Hypothesis: at 320 trained patches, spatial contiguity is irrelevant — “patches” are just sparse linear features and SGD finds good combinations regardless of input pixel layout.

Setup: head-to-head spatial 5×5 vs 25-random-pixel-index “patches” at N=320, identical training, 5 seeds paired.

Result: hypothesis refuted.

Config	Mean ± std
Spatial 5×5	96.996 ± 0.246
Indexed 25	96.386 ± 0.147

Δ = +0.610pp, t = +7.91, d_z = +3.54 (***) — spatial wins decisively.

Takeaway: the locality inductive bias matters even at 320 trained patches. Triggered B16 (cross-task) and B17 (across sizes).

B14 / A1: Multi-scale patches

Date: 2026-05-08 Binary: src/bin/proto_patches_multiscale.rs Hypothesis: mixing patch sizes (1/3 each at 3×3, 5×5, 7×7) beats single-scale 5×5 at matched parameter count.

Setup: single-scale 5×5 vs mixed thirds at two patch counts (N=240, N=480). 5 seeds paired.

Results:

N	Single 5×5	Mixed 3/5/7	Δ (mixed − single)	t	d_z	Sig
240	96.40 ± 0.16	96.93 ± 0.22	+0.530	+4.40	+1.97	***
480	97.58 ± 0.14	97.46 ± 0.15	−0.114	−1.36	−0.61	ns

Takeaway: multi-scale wins decisively at low N (+0.53pp ***), but the advantage vanishes at high N. Multi-scale is a low-capacity phenomenon — receptive-field diversity helps when no single size has enough patches to fully exploit it. Rhymes with B5/B6 pattern (smarter feature design helps when capacity is scarce; doesn’t matter when abundant).

B15 / A2: Multi-layer patches (hidden ReLU layer)

Date: 2026-05-08 Binary: src/bin/proto_patches_multilayer.rs Hypothesis: adding a hidden ReLU layer between patches and linear classifier breaks past the ~97% ceiling — the linear head was the bottleneck.

Setup: 5×5 × 320 patches → M ReLU hidden → 10 linear softmax. M ∈ {0, 32, 64, 128}. Same training as B7/B8. 5 seeds.

Results:

M	Mean ± std	Total params	Δ vs M=0
0	97.00 ± 0.25	11,530	—
32	95.30 ± 0.45	18,922	−1.69pp
64	95.29 ± 0.54	29,514	−1.71pp
128	95.44 ± 0.16	50,698	−1.55pp

Takeaway: depth hurts uniformly at fixed training budget. The hidden layer’s extra parameters can’t be properly trained at 10 epochs / fixed LR. Opposite of main NEAT stream’s depth result ([128, 64] beat [128]) because main stream has 1.8M steps with LR decay vs our 500K with fixed LR. Conditional negative — depth in this minimal framework needs retuned hyperparameters before being declared dead.

B16: A3 on Fashion-MNIST

Date: 2026-05-08 Binary: src/bin/proto_patches_random_idx_fashion.rs Hypothesis: B13’s spatial-contiguity advantage replicates cross-task on Fashion-MNIST.

Setup: identical to B13, only data path changed.

Result:

Config	Mean ± std
Spatial 5×5	86.36 ± 0.33
Indexed 25	86.48 ± 0.19

Δ = −0.124pp, t = −0.71, d_z = −0.32 (ns) — sign even slightly negative.

Takeaway: the locality advantage is MNIST-specific. Does not replicate on Fashion. Inverse of the rectangular-patch finding (which replicated and strengthened on Fashion). Plausible mechanism: MNIST has very high local pixel correlation in stroke regions; Fashion has more textural variation, so adjacent pixels carry less correlated information. Random-index “patches” become effectively global pixel fingerprints, which is competitive with local-feature detection on Fashion.

B17: A3 across patch sizes (MNIST)

Date: 2026-05-08 Binary: src/bin/proto_patches_random_idx_sizes.rs Hypothesis: B13’s contiguity advantage isn’t 5×5-specific — it should exist at 3×3 and 7×7 too.

Setup: spatial vs indexed at sizes ∈ {3, 5, 7}. 320 patches each, 5 seeds paired.

Results:

Size	Spatial	Indexed	Δ pp	t	d_z	Sig
3×3	95.50 ± 0.28	94.91 ± 0.25	+0.596	+4.26	+1.91	***
5×5	97.21 ± 0.14	96.30 ± 0.12	+0.912	+18.30	+8.18	*******
7×7	97.62 ± 0.19	96.92 ± 0.14	+0.694	+7.30	+3.26	***

Takeaway: contiguity advantage is robust at every patch size on MNIST, peaking at 5×5 (d_z=+8.18 — extraordinarily consistent). Magnitudes 0.6-0.9pp.

Methodological footnote: B13 reported Δ=+0.61pp at 5×5; B17 with a different seed-offset scheme (different positions / weight inits, same base seeds) gave +0.91pp at the same configuration. Even paired multi-seed Δ has ~0.2-0.3pp uncertainty in its precise magnitude — sign and significance are robust, but exact values need more samples to nail down.

Status: closes the A3 thread. Locality is a real, significant, robust advantage on MNIST across patch sizes — and is absent on Fashion. The task-specific transferability profile is itself the most informative finding.

B18: Task difficulty calibration

Date: 2026-05-08 Binary: src/bin/calibrate_tasks.rs Hypothesis: identify a harder workhorse task to replace MNIST, which is saturating around 97-98% and eating differential signal in architectural comparisons.

Setup: trained 5×5 patches at N ∈ {80, 320}, single seed, on 9 task variants spanning single datasets (MNIST, Fashion, KMNIST, EMNIST balanced), pairs (M+F, M+K, K+F), triple (M+F+K), and the full quad. Required new src/data/mixed.rs loader.

Headline results: KMNIST is the cleanest single-task workhorse (8.3pp spread N=80→320, ~90% ceiling). MNIST+KMNIST hits 92.62% at N=320 — exactly the user’s target zone. EMNIST balanced (47 classes) gives the largest single-task spread (10.1pp) but introduces class-count confounds.

Status: established the new task battery for B19+.

B19: Locality on KMNIST

Date: 2026-05-08 Binary: src/bin/locality_kmnist.rs Hypothesis: KMNIST has stroke-like local structure similar to MNIST, so spatial 5×5 should still beat 25-random-pixel-index.

Setup: identical to B13 except task = KMNIST. 320 patches × 5 seeds.

Result: hypothesis refuted in dramatic fashion.

Config	Mean ± std
Spatial 5×5	90.322 ± 0.215
Indexed 25	91.534 ± 0.353

Δ = −1.212pp, t = −7.98, d_z = −3.54, *** — sign flipped from MNIST. Indexed beats spatial by even more than spatial beat indexed on MNIST.

Takeaway: spatial locality is harmful on cursive Japanese characters. The MNIST locality advantage doesn’t generalize even to other 28×28 grayscale stroke data — cursive vs printed makes the difference.

B20: Locality on EMNIST balanced

Date: 2026-05-08 Binary: src/bin/locality_emnist.rs Hypothesis: EMNIST is printed letters+digits — should follow the MNIST pattern (spatial wins).

Setup: identical to B13 except task = EMNIST balanced (47 classes, ~94K train images). 320 patches × 5 seeds.

Result:

Config	Mean ± std
Spatial 5×5	77.212 ± 0.366
Indexed 25	76.185 ± 0.089

Δ = +1.027pp, t = +5.53, d_z = +2.47, *** — same direction as MNIST.

Takeaway: spatial wins on printed-character data regardless of class count (10 → 47). The MNIST and EMNIST patterns are consistent. Together with B19’s flip on KMNIST, this gives a mechanistic reading: printed-stroke characters have local pixel correlations that 5×5 receptive fields exploit; cursive characters apparently don’t.

B21: Multi-scale on KMNIST

Date: 2026-05-08 Binary: src/bin/multiscale_kmnist.rs Hypothesis: B14’s “mixing 3/5/7 wins at low N” effect on MNIST should replicate on KMNIST, where there’s more headroom.

Setup: single-scale 5×5 vs mixed thirds at three patch counts. 5 paired seeds.

Results:

N	Single 5×5	Mixed 3/5/7	Δ	t	Sig
120	85.75 ± 0.31	86.06 ± 0.31	+0.31	+1.35	ns
240	89.71 ± 0.19	89.46 ± 0.17	−0.25	−1.78	ns
480	91.84 ± 0.25	92.12 ± 0.38	+0.28	+1.56	ns

All three patch counts non-significant. On MNIST B14 had Δ=+0.53pp *** at N=240.

Takeaway: the multi-scale-wins-at-low-N pattern was MNIST-specific. Receptive-field diversity didn’t help on KMNIST at any tested capacity. Mildly disappointing for the typed-species “evolve a mix” hypothesis.

B22: Locality on MNIST+KMNIST mix

Date: 2026-05-08 Binary: src/bin/locality_mnist_kmnist.rs Hypothesis: combining tasks where locality has opposite-sign effects (MNIST +, KMNIST −) — does one dominate, do they average to null, or does interaction emerge?

Setup: identical to B13 except task = MNIST + KMNIST (20 classes, 100K train). 5 paired seeds at N=320.

Result:

Config	Mean ± std
Spatial 5×5	92.971 ± 0.497
Indexed 25	92.879 ± 0.101

Δ = +0.092pp, t = +0.42, d_z = +0.19, ns — essentially zero.

Takeaway: opposite-direction effects from each constituent task cancel almost perfectly when mixed. The patch architecture sees both data distributions and finds intermediate behavior. No emergent property from mixing — it’s just an average. Useful confirmation that the task-specificity is genuinely about the data structure, not architecture-task interaction.

B23: Locality across patch sizes on KMNIST

Date: 2026-05-08 Binary: src/bin/locality_kmnist_sizes.rs Hypothesis: B19’s KMNIST flip is size-robust (not specific to 5×5).

Setup: spatial vs indexed at sizes ∈ {3, 5, 7}, 320 patches × 5 paired seeds.

Results:

Size	Spatial	Indexed	Δ pp	t	d_z	Sig
3×3	86.90 ± 0.25	88.25 ± 0.28	−1.35	−6.82	−3.05	***
5×5	90.33 ± 0.14	91.50 ± 0.16	−1.16	−13.43	−6.00	***
7×7	91.73 ± 0.33	93.12 ± 0.21	−1.38	−6.95	−3.11	***

Takeaway: KMNIST locality flip is robust at every patch size, with comparable magnitude (1.16-1.38pp) and all *. **MNIST B17 and KMNIST B23 are mirror images — same setup, opposite-signed effects of similar magnitude. The locality direction tracks data structure (printed vs cursive characters), not architectural choice.

B24: Multilayer on KMNIST

Date: 2026-05-08 Binary: src/bin/multilayer_kmnist.rs Hypothesis: B15’s multilayer hurt on MNIST might have been driven by MNIST saturation (linear classifier already nearly optimal). KMNIST has 7pp more headroom — the hidden layer should help here if A2’s negative was capacity-driven.

Setup: 5×5 × 320 patches → M ReLU hidden → softmax. M ∈ {0, 64, 128}. 5 seeds.

Results:

M	Mean ± std	Δ vs M=0
0	90.32 ± 0.22	—
64	88.47 ± 1.35	−1.85pp
128	89.17 ± 0.44	−1.15pp

Takeaway: hidden layer hurts on KMNIST too, with similar magnitude as MNIST. B15’s negative result is task-general, not MNIST-specific. The under-training hypothesis is supported — at fixed training budget (10 epochs, fixed LR), additional parameters can’t be properly trained. For depth to help, the training schedule needs to grow with the architecture (more epochs, LR decay).

Status: confirms B15’s reading. Conditional negative becomes robust negative.

Synthesis: post B18-B24 transferability picture

Finding	MNIST	Holds on harder tasks?
Patch matchers as a primitive	works	Yes (B18 calibration)
Multi-scale at low N (B14)	+0.53pp ***	NO — all ns on KMNIST
Spatial locality (B13/B17)	+0.6-0.9pp *** at every size	NO — flips sign on KMNIST, null on Fashion, replicates on EMNIST
Multilayer hurt at fixed budget (B15)	−1.5-1.7pp	YES — replicates on KMNIST
Rectangular wide preference (B9-stats/B10/B11)	+0.51pp ***	YES (Fashion replicates, rotation flips)

Two task-general findings out of five tested. The locality finding’s sign flip is the most striking single result of Group B to date — it transforms what looked like a general patch-architecture property into a data-distribution-dependent one.

B25: Multilayer on KMNIST with proper training schedule

Date: 2026-05-08 Binary: src/bin/multilayer_kmnist_schedule.rs Hypothesis: B15 / B24 found multilayer hurts at fixed 10-epoch / fixed-LR budget; mechanism was under-training. With 20 epochs + linear LR decay 0.05→0.005, does depth pay off?

Setup: identical to B24 except 20 epochs and decaying LR. M ∈ {0, 64}, 5 paired seeds.

Result: hypothesis confirmed dramatically.

Config	Mean ± std
M=0 (linear)	92.96 ± 0.16
M=64 (multilayer)	95.74 ± 0.17

Δ = +2.78pp, t = +24.50, d_z = +10.96, *** — multilayer helps enormously with proper schedule.

Comparison to B24 (10 epochs fixed LR): M=0 was 90.32%, M=64 was 88.47% (Δ = −1.85pp). Both improve with schedule, but M=64 improves by +7.3pp vs M=0’s +2.6pp.

Major correction: B15’s “depth hurts” reading was a training-budget artifact. With proper schedule, depth is the biggest single architectural win on KMNIST.

B27: Pixel-correlation probe (null result)

Date: 2026-05-08 Binary: src/bin/pixel_correlations.rs Hypothesis: MNIST/EMNIST have higher local pixel correlation than KMNIST, explaining the locality direction.

Result: refuted. MNIST and KMNIST have nearly identical adjacent-pixel correlations (r=0.808 vs 0.789). Fashion has the highest of any dataset (r=0.846 H, 0.898 V) yet locality is null there.

Takeaway: simple pairwise pixel adjacency doesn’t predict locality direction. The mechanism is more subtle — see B31.

B28: Scaling sweep on KMNIST and EMNIST

Date: 2026-05-08 Binary: src/bin/scaling_kmnist_emnist.rs Hypothesis: establish the canonical N-sweep curve on the new workhorses (B7-equivalent).

Result: filled in N ∈ {32, 160, 640} alongside B18’s {80, 320}.

N	KMNIST	EMNIST
32	72.14%	52.40%
80	81.88%	67.56%
160	87.59%	74.12%
320	90.19%	77.69%
640	92.46%	79.67%

KMNIST: 20pp spread (72→92%). EMNIST: 27pp spread (52→80%) but lower ceiling.

B29: Rectangular patches on KMNIST

Date: 2026-05-08 Binary: src/bin/rect_kmnist.rs Hypothesis: B9-stats / B10 / B11’s wide-preference at 1:3 aspect was task-general. Does it hold on KMNIST?

Result: NO — sign flipped.

Pair	Δ pp	t	Sig
4×6 / 6×4	−0.27	−1.32	ns
3×9 / 9×3	−0.40	−2.07	*
5×7 / 7×5	+0.18	+0.70	ns

On KMNIST, tall narrow patches beat wide flat ones at extreme aspect (-0.40pp *). Opposite of MNIST/Fashion/rotated-MNIST.

Takeaway: rectangular wide-preference is mostly task-general but flips on cursive Japanese characters — consistent with KMNIST having a different dominant feature orientation than MNIST/Fashion.

B30: Multi-scale on EMNIST

Date: 2026-05-08 Binary: src/bin/multiscale_emnist.rs Hypothesis: KMNIST’s null result (B21) was specific to KMNIST. EMNIST is printed letters+digits like MNIST, so multi-scale should replicate.

Result: NO — all ns on EMNIST too.

N	Single 5×5	Mixed 3/5/7	Δ	Sig
120	70.73	70.15	−0.58	ns
240	75.62	75.75	+0.13	ns
480	78.40	78.62	+0.22	ns

Takeaway: B14’s multi-scale advantage on MNIST is genuinely MNIST-only. Doesn’t replicate on either KMNIST or EMNIST.

B31: Per-pixel class-discriminability and spatial structure

Date: 2026-05-08 Binary: src/bin/pixel_discriminability.rs Hypothesis: spatial autocorrelation of class-discriminability at the patch scale predicts locality direction.

Setup: for each pixel position, compute F-like ratio of between-class variance to within-class variance. Compute spatial autocorrelation of this discriminability map at distances 1, 2, and 5.

Result: hypothesis confirmed cleanly.

Dataset	autoc d=1	autoc d=2	autoc d=5	Locality
MNIST	0.903	0.728	+0.320	spatial +0.6-0.9 ***
Fashion	0.852	0.652	+0.125	null
KMNIST	0.869	0.601	−0.067	spatial −1.21 ***
EMNIST	0.940	0.780	+0.371	spatial +1.03 ***

The spatial autocorrelation at d=5 ranks the four datasets in exactly the same order as the locality effect.

Takeaway: locality direction is a measurable data property — discoverable without training. KMNIST’s class-discriminative information is not spatially clustered at the patch scale, so spatial 5×5 patches can’t reliably catch concentrated info; random-index patches do better. The cleanest mechanistic finding Group B has produced.

B32: Multilayer on MNIST with proper schedule

Date: 2026-05-08 Binary: src/bin/multilayer_mnist_schedule.rs Hypothesis: B25’s depth+schedule reversal extends to MNIST.

Result: null.

Config	Mean ± std
M=0	97.89 ± 0.17
M=64	97.998 ± 0.15

Δ = +0.11pp, t = +0.89, ns. With proper schedule, depth is null on MNIST — the linear baseline was already near saturation around 98% for this patch capacity. Schedule fixes the under-training, but there’s no additional gain to extract.

B33: Rectangular patches on EMNIST balanced

Date: 2026-05-08 Binary: src/bin/rect_emnist.rs Hypothesis: EMNIST is printed letters+digits — should follow MNIST’s wide-preference.

Result: strongly confirms MNIST pattern.

Pair	Δ pp	t	Sig
4×6 / 6×4	+0.03	+0.37	ns
3×9 / 9×3	+0.98	+4.45	*******
5×7 / 7×5	+0.87	+2.06	*

4-task picture for rectangular wide-preference: MNIST +0.51 ***, Fashion +0.93 ***, EMNIST +0.98 ***, KMNIST −0.40 *. Rectangular wide-preference holds on 3 of 4 tasks; KMNIST is the only outlier.

B34: Multilayer on EMNIST with proper schedule

Date: 2026-05-08 Binary: src/bin/multilayer_emnist_schedule.rs Hypothesis: depth helps more when there’s more headroom. EMNIST has 22pp of headroom vs KMNIST’s 10pp — should help more.

Result: hypothesis refuted — depth hurts on EMNIST.

Config	Mean ± std
M=0	82.15 ± 0.26
M=64	81.04 ± 0.29

Δ = −1.11pp, t = −6.96, d_z = −3.11, *** — multilayer hurts.

Reframing: simple “depth scales with headroom” is wrong. EMNIST has 47 visually distinct classes; the linear classifier on raw patch features is approximately optimal at this capacity, and adding non-linearity doesn’t add value. KMNIST’s 10 cursive classes share visual components and benefit from compositional features.

B35: Wider multilayer on EMNIST

Date: 2026-05-08 Binary: src/bin/multilayer_emnist_wide.rs Hypothesis: B34’s hurt was due to M=64 < 47-class output bottleneck. Wider hidden layers should remove that bottleneck.

Setup: same schedule, M ∈ {0, 128, 256}. 5 seeds.

Result: bottleneck hypothesis disproved.

M	Mean ± std	Δ vs M=0
0	82.15 ± 0.26	—
128	81.54 ± 0.12	−0.62 ***
256	81.68 ± 0.12	−0.47

Multilayer still hurts at M=128 and M=256, both well above 47 classes. Depth’s harmfulness on EMNIST isn’t a capacity issue.

Takeaway: depth is task-specific in a way that doesn’t track simple variables (headroom, class count, hidden:output ratio). Some tasks benefit from compositional features (KMNIST), others don’t (MNIST, EMNIST), and the prediction requires understanding the task’s discriminative-feature structure — not just its difficulty.

Synthesis: post B25-B35 transferability picture

Finding	MNIST	KMNIST	EMNIST	Fashion	General?
Patch matchers as primitive	✓	✓	✓	✓	all 4
Rectangular wide-pref	✓	✗ flips	✓	✓	3 of 4
Spatial locality	✓	✗ flips	✓	~ null	2 of 4 (B31 predicts)
Multi-scale	✓	~ ns	~ ns	(untested)	MNIST only
Depth+schedule helps	~ ns	✓ +2.78***	✗ hurts	(untested)	KMNIST only

Of 5 architectural findings tested across multiple datasets, only the bare patch primitive is fully task-general. Every detail is conditional. KMNIST is the most frequent outlier (flips locality, flips rectangular preference, only place depth+schedule clearly helps). For typed-species NEAT integration, the genome should evolve patch geometry, placement strategy, depth, and training schedule per-task rather than locking in MNIST-derived defaults.