Raw structured experiment records for the typed-species (Group B) stream. Reproduced exactly as produced.

Group B Experiments

Structured experiment records. Each entry: hypothesis, setup, result, takeaway.


B1: Prototype matcher (raw dot product)

Date: 2026-05-07 Binary: src/bin/proto.rs Hypothesis: A single dot product against the per-class mean training image carries enough signal to classify MNIST above chance.

Setup:

Result: 64.99% accuracy (6499/10000)

Per-class accuracy: 0:97.3, 1:48.1, 2:70.9, 3:73.0, 4:53.6, 5:0.0, 6:80.1, 7:64.8, 8:92.2, 9:65.7

Failure mode: raw dot product is dominated by prototype pixel mass. The “8” and “0” prototypes have the most active pixels and score high against everything. Class 5 hits 0% because the “0” prototype is denser than the “5” prototype in the regions where digit 5 has ink — the 5 image scores higher against “0” than “5”.

Takeaway: there is real signal in dot-product matching, but the unnormalized version is structurally biased toward dense templates. Any honest patch-matcher experiment has to beat the normalized version of this baseline.

Next: B2 — same matcher with cosine similarity (or zero-mean prototypes) to set a fair calibration floor.


B2: Normalized prototype matcher

Date: 2026-05-07 Binary: src/bin/proto_norm.rs Hypothesis: cosine similarity removes the magnitude bias that crushed B1, exposing the actual shape-matching signal.

Setup: same 50K/10K split as B1. Three variants in one binary:

Results:

Variant Accuracy
Raw dot product 64.99%
Cosine similarity 83.53%
Centered + cosine 58.89%

Per-class accuracy (cosine): 0:89.2 1:94.5 2:81.6 3:83.1 4:81.5 5:69.1 6:91.5 7:84.5 8:78.0 9:80.4

Failure modes (cosine): confusion is now local and shape-driven — 5↔3 (125 errors), 4↔9 (125), 8↔3 (76). The class-5-always-predicts-0 catastrophe is gone.

Why centered+cosine got worse: centering the prototype without also centering the test image creates a mismatched comparison — centered prototypes have negative values where raw images can only have zero. Image centering would fix it; not pursued because B2a already gave us the clean answer.

Takeaway: 83.53% is the honest calibration floor. Any patch-matcher result that doesn’t clear this is meaningless. Useful reference points still missing: trained linear classifier (~92% expected), small MLP (~98%), single conv layer (>98%).


B3: Prototype features → trained linear discriminator

Date: 2026-05-07 Binary: src/bin/proto_clf.rs Hypothesis: feeding the 10 cosine-similarity scores (B2) into a trained linear softmax classifier instead of taking argmax extracts more signal and clears 83.5% comfortably.

Setup: 784 → 10 frozen cosine-prototype nodes → 10×10 linear classifier + bias, softmax + CE, online SGD (lr=0.5, 5 epochs, seed=0xB3).

Result: 84.85% (+1.32pp over B2 argmax). Train accuracy peaked at 80.9%; test peaked at 85.87% in epoch 3 then mildly drifted.

Per-class accuracy: 0:92.2 1:97.6 2:73.5 3:78.5 4:92.4 5:68.5 6:94.5 7:94.1 8:76.7 9:77.7

Learned weights: strongly diagonal (+39 to +57 on diag, mostly small negatives off-diag). The classifier essentially rediscovered argmax with tiny corrections.

Takeaway: whole-image cosine similarity against 10 class means is information-bottlenecked at ~85%. A linear discriminator cannot recover information lost in the 784→10 projection. Breaking past this requires either richer features (more prototypes per class, local patches) or an entirely different feature primitive. The next interesting move is local patches — the original Group B hypothesis — since the failure modes (5↔3, 4↔9, 8↔3) are precisely where local feature detection should help.

Next: B4 candidate — same architecture but with 16-32 randomly-placed 5×5 patch matchers as features instead of 10 whole-image prototypes. If that clears 85%, the patch hypothesis has signal.


B4: Random local patches (frozen) → trained discriminator

Date: 2026-05-07 Binary: src/bin/proto_patches.rs Hypothesis: locality + nonlinearity are sufficient inductive bias to clear the 85% whole-image cosine ceiling, even with random unlearned patches.

Setup: 32 patches × 5×5, He-init random weights, bias=0, uniform random positions. ReLU on patch outputs. Patches frozen. 32→10 linear softmax classifier, online SGD, lr=0.1, 10 epochs, seed=0xB4.

Result: 67.56% (best across epochs: 68.60%) — worse than B2 (84%) and B3 (85%).

Per-class accuracy: 0:78.8 1:91.6 2:62.3 3:58.0 4:46.7 5:64.7 6:65.8 7:82.8 8:50.4 9:71.6

Key diagnostics:

Takeaway: locality + nonlinearity is not sufficient inductive bias on its own. The content of the filters matters. This matches the historical observation that LeNet-1 needed learned convolutional filters, not random ones — random conv weights were already known to fail.

Negative result framing: B4 cleanly rules out “any local features will do.” The next experiment must test whether learned patches break the ceiling.

Next: B5 — same architecture as B4 but with patch weights trainable via backprop. Same patch count and size to isolate the effect of learning from the effect of architecture.

(Diverted: did a cheaper intermediate check first — meaningful but frozen patches.)


B5: Prototype-slice patches (frozen) → trained discriminator

Date: 2026-05-07 Binary: src/bin/proto_slices.rs Hypothesis: B4 failed on filter content, not on locality. Patches that are literal 5×5 slices of class-mean prototypes (frozen) feeding a trained linear discriminator should clear the 85% bottleneck.

Setup: For each class c and slice i (0..N), random (top, left) in [0..23]², copy prototype[c][top..top+5, left..left+5] as patch weights. Cosine similarity at that position. Linear classifier on top of 10N features, online SGD lr=0.1, 10 epochs, seed=0xB5. Swept N ∈ {1, 2, 4, 8, 12, 16, 24, 32, 48, 64}.

Results (best test accuracy across epochs):

N/class Features Best %
1 10 55.27
2 20 75.58
4 40 85.18
8 80 89.77
12 120 90.43
16 160 91.24
24 240 91.11
32 320 92.22
48 480 92.76
64 640 93.67

Key observations:

Confound: at N=32, B5 has 8,000 patch params vs. B4’s 800. Some of the ~25pp lift is raw capacity, not just content.

Takeaway: locality + meaningful content > global + meaningful, and both crush locality + random. Confirms patch-matcher hypothesis but doesn’t fully isolate the inductive-bias contribution from the parameter-count contribution.

Next: B6 — 320 random patches matched to B5’s N=32 feature count, to isolate “more capacity” from “meaningful content”.


B6: Random patches, sweep over count

Date: 2026-05-07 Binary: src/bin/proto_patches_sweep.rs Hypothesis: B5’s lift over B4 came from raw feature count, not from meaningful filter content. If random patches at matched count perform similarly to prototype-slice patches, B5’s “meaningful content matters” reading was overstated.

Setup: He-init random 5×5 patches, ReLU, frozen. Linear classifier on top trained with online SGD, lr=0.1, 10 epochs, seed=0xB6. Swept patch count ∈ {32, 80, 160, 320, 640}.

Results:

Patches Params Best %
32 800 66.26
80 2000 82.07
160 4000 88.85
320 8000 92.65
640 16000 94.01

Comparison to B5 (prototype-slice) at matched feature counts:

Takeaway: B5’s apparent “meaningful content matters” effect was real only at small feature counts. At ≥320 features, random patches catch up; at 640 they slightly exceed prototype-slice patches. The linear classifier finds good combinations from noisy features once capacity is high enough.

Reframing of B5: meaningful local features have a sample-efficiency advantage that vanishes with sufficient feature count, not a fundamental quality advantage.


B7: Trained patches (random init, backprop)

Date: 2026-05-07 Binary: src/bin/proto_patches_trained.rs Hypothesis: SGD-trained patches from random init beat both random-frozen and prototype-slice-frozen at every feature count, especially at small N. Validates the original Group B hypothesis (patch matchers as a learnable typed species).

Setup: Same architecture as B6 (random fixed positions, ReLU patch outputs, linear classifier), but patch weights AND biases train via backprop alongside the classifier. He init for both layers, online SGD lr=0.05 (lower than B6 because two layers train), 10 epochs, seed=0xB7. Swept patch count ∈ {32, 80, 160, 320, 640}.

Results:

Patches Total params Best %
32 1162 85.64
80 2890 93.46
160 5770 96.04
320 11530 97.13
640 23050 97.27

Gap over random (B6) at matched count:

Takeaway: trained patches dominate everywhere. The gap shrinks at scale (random-with-many-patches eventually approaches usable) but doesn’t close in this range. 97.27% with 640 5×5 trained patches is close to the main NEAT system’s [128] dense-hidden result (98.7%), achieved with locality as inductive bias rather than dense connectivity.

Status: original Group B hypothesis (patch matchers as a learnable typed species) supported. Ready for integration into the main NEAT stream as a typed-node mutation, if/when that’s the priority.


B8: Patch size sweep (3×3 through 7×7)

Date: 2026-05-07 Binary: src/bin/proto_patches_size.rs Hypothesis: 5×5 was an arbitrary default carried over from B4/B7. Different patch sizes have different parameter efficiencies — smaller patches give more spatial coverage per parameter, larger patches give bigger receptive fields per detector. The cross-product of size and count reveals which trades are worth making.

Setup: same architecture and training as B7 (trained patches, random He init, ReLU, linear discriminator, online SGD, lr=0.05, 10 epochs, seed=0xB8). Crossed sweep: patch size ∈ {3, 4, 5, 6, 7} × patch count ∈ {32, 80, 160, 320, 640}. 25 configurations total.

Results — best test accuracy by (count, size):

N 3×3 4×4 5×5 6×6 7×7
32 79.95 81.32 86.93 89.15 91.52
80 90.49 93.37 92.33 94.32 96.33
160 92.28 94.65 95.44 96.14 96.82
320 95.85 96.69 97.12 97.08 97.53
640 96.54 96.96 97.70 97.71 98.04

Parameter-efficient frontier (best config at each budget):

≈Params Config Accuracy
1.6K 3×3 × 80 90.49
3K 3×3 × 160 / 5×5 × 80 92.3
6K 3×3 × 320 95.85
12K 5×5 × 320 97.12
20K 7×7 × 320 97.53
38K 7×7 × 640 98.04

Key findings:

  1. At low parameter budgets, smaller patches at higher count win — coverage beats receptive field when params are scarce.
  2. At ~12K+ params, 5×5/6×6/7×7 all reach 97%, with 7×7 leading on absolute accuracy but at higher param cost.
  3. 3×3 has a hard ceiling around 96.5% — receptive field too small to capture enough digit structure regardless of count.
  4. All sizes saturate at 96.5%-98% by N=640. Architectural ceiling not removed by more patch parameters of any size.

Takeaway: patch size is a meaningful axis (not a hyperparameter to fix). For typed-NEAT integration, the genome should support mutations adding patches of varied sizes and let evolution pick the mix. Default single-size choice if forced: 5×5 — never far from optimal across the budget range, with 4×4 close behind. Multi-scale > single-scale.

Status: enriches the B7 result with parameter-efficiency data. Confirms 5×5 was a reasonable default, identifies the size-vs-count tradeoff structure for downstream design decisions.


B9: Rectangular patches (single-seed sweep)

Date: 2026-05-07 Binary: src/bin/proto_patches_rect.rs Hypothesis: rectangular patches at matched parameter counts may differ from square ones; horizontal vs. vertical orientation may also matter.

Setup: same training as B7/B8. 11 shapes covering matched-area tiers (16/21, 24/27, 35/36) with horizontal/vertical mirror pairs. Two patch counts (N=160, N=320). Single seed (0xB9).

Headline single-seed results (N=320):

Status: suggestive but unreliable. With per-config std ~0.15pp, sub-0.3pp gaps are noise. Triggered B9-stats.


B9-stats: Rectangular patches with paired multi-seed stats

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_stats.rs Hypothesis: confirm or refute the B9 single-seed gaps with multi-seed paired comparisons. Stand up reusable stats helpers (SampleStats, PairedComparison).

Setup: 8 shapes × 5 seeds (0xB9–0xBD) at N=320. Paired design — same seeds across all shapes so within-seed noise cancels in differences. All statistics constant-time given running sums.

Per-shape mean ± std (5 seeds): 5×5 97.06±0.14, 6×6 97.34±0.16, 4×6 97.04±0.13, 6×4 96.86±0.13, 3×9 97.10±0.19, 9×3 96.60±0.16, 5×7 97.38±0.19, 7×5 97.35±0.11.

Paired wide-minus-tall (Δ pp, t, d_z, sig):

Pair Δ t d_z Sig
4×6 / 6×4 +0.180 1.63 0.73 ns
3×9 / 9×3 +0.506 4.22 1.89 ***
5×7 / 7×5 +0.038 0.37 0.17 ns

Key corrections to B9:

  1. The B9 “5×7 N=320 = 97.66%” headline was the lucky-seed max; mean is 97.38%, statistically tied with 6×6. Single-seed best-of-sweep is unreliable.
  2. The B9 5×7 > 7×5 finding was noise (multi-seed Δ=+0.04, ns).
  3. The B9 3×9 ≫ 9×3 finding survived rigorously: very-large effect (d_z=1.89), highly significant (***).

Methodology lesson: any quantitative claim about a difference under ~0.3pp needs multi-seed paired stats. Stats infrastructure now reusable.

Status: established that the wide preference is real at extreme aspect ratios on MNIST. Triggered B10 (Fashion replication) to test the digit-stroke mechanistic hypothesis.


B10: Rectangular patches on Fashion-MNIST

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_stats_fashion.rs Hypothesis: if 3×9 ≫ 9×3 on MNIST is digit-stroke-specific, Fashion should show a different pattern. If it persists, the effect is general to 28×28 grayscale image classification.

Setup: identical to B9-stats — 8 shapes × 5 seeds (0xB9–0xBD) at N=320, same training regime. Only the data path changed.

Per-shape mean ± std (Fashion, 5 seeds, N=320): 5×5 86.45±0.33, 6×6 86.47±0.28, 4×6 86.53±0.43, 6×4 86.23±0.22, 3×9 86.37±0.17, 9×3 85.44±0.59, 5×7 86.37±0.30, 7×5 86.13±0.34.

Paired wide-minus-tall (Fashion):

Pair Δ t d_z Sig
4×6 / 6×4 +0.300 1.18 0.53 ns
3×9 / 9×3 +0.930 3.32 1.48 ***
5×7 / 7×5 +0.236 1.08 0.48 ns

Cross-task comparison:

Pair MNIST Δ Fashion Δ
4×6 vs 6×4 +0.18 +0.30
3×9 vs 9×3 +0.51 +0.93
5×7 vs 7×5 +0.04 +0.24

Findings:

  1. The 1:3 wide-preference effect replicates and strengthens on Fashion (+0.93pp vs. +0.51pp).
  2. All three pairs trend wide > tall on both datasets; moderate-aspect pairs are non-significant on both.
  3. 9×3 is genuinely unstable on Fashion — std=0.59, the highest of any shape.
  4. Fashion is harder overall (~86% vs ~97%) and noisier (std ~2× larger).

Refuted hypothesis: digit-stroke geometry as the mechanism (B9 single-seed framing).

Refined hypothesis: kernels should be perpendicular to the dominant feature orientation. Both MNIST and Fashion have predominantly vertical structural features; tall patches lie along these features and waste capacity, while wide patches cut across them and capture transitions. Classical filter-design wisdom rediscovered via SGD.

Status: hypothesis is now directly testable — rotating the input 90° should flip the preference (B11 candidate), and a more granular aspect-ratio sweep should show the effect grows monotonically with ratio extremity (B12 candidate).


B11: Rotated MNIST

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_rotated.rs Hypothesis: if the wide preference is “kernels perpendicular to the dominant feature orientation,” rotating MNIST 90° CW should flip the dominant feature orientation and the preference should reverse — tall > wide on rotated data.

Setup: identical to B9-stats (8 shapes × 5 seeds at N=320) except every image is rotated 90° CW (new[c][27-r] = old[r][c]) at load time.

Per-shape mean ± std: 5×5 97.04±0.15, 6×6 97.45±0.14, 4×6 97.02±0.15, 6×4 96.99±0.22, 3×9 96.67±0.21, 9×3 97.07±0.23, 5×7 97.34±0.06, 7×5 97.44±0.13.

Paired wide-minus-tall (rotated):

Pair Δ t d_z Sig vs. upright
4×6 / 6×4 +0.022 0.25 0.11 ns (was +0.18 ns)
3×9 / 9×3 −0.402 −2.27 −1.02 * was +0.51 ***, sign flipped
5×7 / 7×5 −0.106 −1.77 −0.79 ns was +0.04 ns, direction flipped

Finding: the 3×9 vs 9×3 effect cleanly reversed sign on rotated MNIST (Δ went from +0.51pp *** to −0.40pp *). The pattern moves with the data — confirming the mechanistic claim that this is feature-orientation-driven, not architectural bias or placement geometry.

Magnitude is smaller after rotation (0.40 vs 0.51): plausibly because rotated digits are slightly off-distribution (training-data conventions are axis-specific), or because vertical-feature dominance has a partly geometric component that rotation doesn’t fully invert. The sign — what the hypothesis predicts — is unambiguous.

Status: mechanistic hypothesis from B10 confirmed. The wide-vs-tall asymmetry is genuinely orientation-driven.


B12: Extreme aspect ratios

Date: 2026-05-07 Binary: src/bin/proto_patches_rect_aspect.rs Hypothesis: the wide preference grows monotonically with aspect-ratio extremity — gap should increase 1:3 → 1:9 → 1:15 → 1:21.

Setup: 9 shapes × 5 seeds at N=320 on MNIST. Mirror pairs at 1:3 (3×9/9×3, reference), 1:9 (1×9/9×1), 1:15, 1:21. Plus 5×5 baseline.

Per-shape mean ± std: 5×5 97.06±0.14, 3×9 97.10±0.19, 9×3 96.60±0.16, 1×9 94.38±0.16, 9×1 94.36±0.16, 1×15 95.21±0.23, 15×1 94.70±0.06, 1×21 95.31±0.09, 21×1 94.62±0.25.

Paired wide-minus-tall:

Pair Aspect Δ t d_z Sig
3×9 / 9×3 1:3 +0.506 4.22 1.89 ***
1×9 / 9×1 1:9 +0.012 0.11 0.05 ns
1×15 / 15×1 1:15 +0.518 4.17 1.86 ***
1×21 / 21×1 1:21 +0.690 4.62 2.07 ***

Finding: monotonicity prediction refuted. There is a null at 1:9 sandwiched between significant effects at 1:3 and 1:15+. The pattern is large → null → large → larger.

The 1×9 vs 9×1 pair is the only one with both zero perpendicular extent AND short length (9 px). All other pairs have either ≥3 px perpendicular thickness (3×9, 9×3) or ≥15 px length (1×15, 1×21). Either property recovers the effect; having neither kills it.

Refined mechanistic story (post-B12): the wide preference at extreme aspect ratios is at least two distinct phenomena:

  1. For thick rectangular patches (≥3 px perpendicular extent): B10’s perpendicular-to-feature argument.
  2. For 1-pixel-thick long strips: cross-section sampling — horizontal strips capture intensity profile across digit width (discriminative), vertical strips capture intensity profile across digit height (less discriminative because all digits are ~similar height).

Both produce wide > tall on MNIST, but for different reasons.

Other observations:

Status: closes the rectangular-patches arc. Real mechanistic finding (B11 confirmed) with a richer-than-expected structure (B12 surprise: not single-mechanism). Worth banking; future work could disentangle the two mechanisms more cleanly but marginal value vs other Group B questions is small.


B13 / A3: Random-index “patches” — does spatial contiguity matter?

Date: 2026-05-08 Binary: src/bin/proto_patches_random_idx.rs Hypothesis: at 320 trained patches, spatial contiguity is irrelevant — “patches” are just sparse linear features and SGD finds good combinations regardless of input pixel layout.

Setup: head-to-head spatial 5×5 vs 25-random-pixel-index “patches” at N=320, identical training, 5 seeds paired.

Result: hypothesis refuted.

Config Mean ± std
Spatial 5×5 96.996 ± 0.246
Indexed 25 96.386 ± 0.147

Δ = +0.610pp, t = +7.91, d_z = +3.54 (***) — spatial wins decisively.

Takeaway: the locality inductive bias matters even at 320 trained patches. Triggered B16 (cross-task) and B17 (across sizes).


B14 / A1: Multi-scale patches

Date: 2026-05-08 Binary: src/bin/proto_patches_multiscale.rs Hypothesis: mixing patch sizes (1/3 each at 3×3, 5×5, 7×7) beats single-scale 5×5 at matched parameter count.

Setup: single-scale 5×5 vs mixed thirds at two patch counts (N=240, N=480). 5 seeds paired.

Results:

N Single 5×5 Mixed 3/5/7 Δ (mixed − single) t d_z Sig
240 96.40 ± 0.16 96.93 ± 0.22 +0.530 +4.40 +1.97 ***
480 97.58 ± 0.14 97.46 ± 0.15 −0.114 −1.36 −0.61 ns

Takeaway: multi-scale wins decisively at low N (+0.53pp ***), but the advantage vanishes at high N. Multi-scale is a low-capacity phenomenon — receptive-field diversity helps when no single size has enough patches to fully exploit it. Rhymes with B5/B6 pattern (smarter feature design helps when capacity is scarce; doesn’t matter when abundant).


B15 / A2: Multi-layer patches (hidden ReLU layer)

Date: 2026-05-08 Binary: src/bin/proto_patches_multilayer.rs Hypothesis: adding a hidden ReLU layer between patches and linear classifier breaks past the ~97% ceiling — the linear head was the bottleneck.

Setup: 5×5 × 320 patches → M ReLU hidden → 10 linear softmax. M ∈ {0, 32, 64, 128}. Same training as B7/B8. 5 seeds.

Results:

M Mean ± std Total params Δ vs M=0
0 97.00 ± 0.25 11,530
32 95.30 ± 0.45 18,922 −1.69pp
64 95.29 ± 0.54 29,514 −1.71pp
128 95.44 ± 0.16 50,698 −1.55pp

Takeaway: depth hurts uniformly at fixed training budget. The hidden layer’s extra parameters can’t be properly trained at 10 epochs / fixed LR. Opposite of main NEAT stream’s depth result ([128, 64] beat [128]) because main stream has 1.8M steps with LR decay vs our 500K with fixed LR. Conditional negative — depth in this minimal framework needs retuned hyperparameters before being declared dead.


B16: A3 on Fashion-MNIST

Date: 2026-05-08 Binary: src/bin/proto_patches_random_idx_fashion.rs Hypothesis: B13’s spatial-contiguity advantage replicates cross-task on Fashion-MNIST.

Setup: identical to B13, only data path changed.

Result:

Config Mean ± std
Spatial 5×5 86.36 ± 0.33
Indexed 25 86.48 ± 0.19

Δ = −0.124pp, t = −0.71, d_z = −0.32 (ns) — sign even slightly negative.

Takeaway: the locality advantage is MNIST-specific. Does not replicate on Fashion. Inverse of the rectangular-patch finding (which replicated and strengthened on Fashion). Plausible mechanism: MNIST has very high local pixel correlation in stroke regions; Fashion has more textural variation, so adjacent pixels carry less correlated information. Random-index “patches” become effectively global pixel fingerprints, which is competitive with local-feature detection on Fashion.


B17: A3 across patch sizes (MNIST)

Date: 2026-05-08 Binary: src/bin/proto_patches_random_idx_sizes.rs Hypothesis: B13’s contiguity advantage isn’t 5×5-specific — it should exist at 3×3 and 7×7 too.

Setup: spatial vs indexed at sizes ∈ {3, 5, 7}. 320 patches each, 5 seeds paired.

Results:

Size Spatial Indexed Δ pp t d_z Sig
3×3 95.50 ± 0.28 94.91 ± 0.25 +0.596 +4.26 +1.91 ***
5×5 97.21 ± 0.14 96.30 ± 0.12 +0.912 +18.30 +8.18 ***
7×7 97.62 ± 0.19 96.92 ± 0.14 +0.694 +7.30 +3.26 ***

Takeaway: contiguity advantage is robust at every patch size on MNIST, peaking at 5×5 (d_z=+8.18 — extraordinarily consistent). Magnitudes 0.6-0.9pp.

Methodological footnote: B13 reported Δ=+0.61pp at 5×5; B17 with a different seed-offset scheme (different positions / weight inits, same base seeds) gave +0.91pp at the same configuration. Even paired multi-seed Δ has ~0.2-0.3pp uncertainty in its precise magnitude — sign and significance are robust, but exact values need more samples to nail down.

Status: closes the A3 thread. Locality is a real, significant, robust advantage on MNIST across patch sizes — and is absent on Fashion. The task-specific transferability profile is itself the most informative finding.


B18: Task difficulty calibration

Date: 2026-05-08 Binary: src/bin/calibrate_tasks.rs Hypothesis: identify a harder workhorse task to replace MNIST, which is saturating around 97-98% and eating differential signal in architectural comparisons.

Setup: trained 5×5 patches at N ∈ {80, 320}, single seed, on 9 task variants spanning single datasets (MNIST, Fashion, KMNIST, EMNIST balanced), pairs (M+F, M+K, K+F), triple (M+F+K), and the full quad. Required new src/data/mixed.rs loader.

Headline results: KMNIST is the cleanest single-task workhorse (8.3pp spread N=80→320, ~90% ceiling). MNIST+KMNIST hits 92.62% at N=320 — exactly the user’s target zone. EMNIST balanced (47 classes) gives the largest single-task spread (10.1pp) but introduces class-count confounds.

Status: established the new task battery for B19+.


B19: Locality on KMNIST

Date: 2026-05-08 Binary: src/bin/locality_kmnist.rs Hypothesis: KMNIST has stroke-like local structure similar to MNIST, so spatial 5×5 should still beat 25-random-pixel-index.

Setup: identical to B13 except task = KMNIST. 320 patches × 5 seeds.

Result: hypothesis refuted in dramatic fashion.

Config Mean ± std
Spatial 5×5 90.322 ± 0.215
Indexed 25 91.534 ± 0.353

Δ = −1.212pp, t = −7.98, d_z = −3.54, *** — sign flipped from MNIST. Indexed beats spatial by even more than spatial beat indexed on MNIST.

Takeaway: spatial locality is harmful on cursive Japanese characters. The MNIST locality advantage doesn’t generalize even to other 28×28 grayscale stroke data — cursive vs printed makes the difference.


B20: Locality on EMNIST balanced

Date: 2026-05-08 Binary: src/bin/locality_emnist.rs Hypothesis: EMNIST is printed letters+digits — should follow the MNIST pattern (spatial wins).

Setup: identical to B13 except task = EMNIST balanced (47 classes, ~94K train images). 320 patches × 5 seeds.

Result:

Config Mean ± std
Spatial 5×5 77.212 ± 0.366
Indexed 25 76.185 ± 0.089

Δ = +1.027pp, t = +5.53, d_z = +2.47, *** — same direction as MNIST.

Takeaway: spatial wins on printed-character data regardless of class count (10 → 47). The MNIST and EMNIST patterns are consistent. Together with B19’s flip on KMNIST, this gives a mechanistic reading: printed-stroke characters have local pixel correlations that 5×5 receptive fields exploit; cursive characters apparently don’t.


B21: Multi-scale on KMNIST

Date: 2026-05-08 Binary: src/bin/multiscale_kmnist.rs Hypothesis: B14’s “mixing 3/5/7 wins at low N” effect on MNIST should replicate on KMNIST, where there’s more headroom.

Setup: single-scale 5×5 vs mixed thirds at three patch counts. 5 paired seeds.

Results:

N Single 5×5 Mixed 3/5/7 Δ t Sig
120 85.75 ± 0.31 86.06 ± 0.31 +0.31 +1.35 ns
240 89.71 ± 0.19 89.46 ± 0.17 −0.25 −1.78 ns
480 91.84 ± 0.25 92.12 ± 0.38 +0.28 +1.56 ns

All three patch counts non-significant. On MNIST B14 had Δ=+0.53pp *** at N=240.

Takeaway: the multi-scale-wins-at-low-N pattern was MNIST-specific. Receptive-field diversity didn’t help on KMNIST at any tested capacity. Mildly disappointing for the typed-species “evolve a mix” hypothesis.


B22: Locality on MNIST+KMNIST mix

Date: 2026-05-08 Binary: src/bin/locality_mnist_kmnist.rs Hypothesis: combining tasks where locality has opposite-sign effects (MNIST +, KMNIST −) — does one dominate, do they average to null, or does interaction emerge?

Setup: identical to B13 except task = MNIST + KMNIST (20 classes, 100K train). 5 paired seeds at N=320.

Result:

Config Mean ± std
Spatial 5×5 92.971 ± 0.497
Indexed 25 92.879 ± 0.101

Δ = +0.092pp, t = +0.42, d_z = +0.19, ns — essentially zero.

Takeaway: opposite-direction effects from each constituent task cancel almost perfectly when mixed. The patch architecture sees both data distributions and finds intermediate behavior. No emergent property from mixing — it’s just an average. Useful confirmation that the task-specificity is genuinely about the data structure, not architecture-task interaction.


B23: Locality across patch sizes on KMNIST

Date: 2026-05-08 Binary: src/bin/locality_kmnist_sizes.rs Hypothesis: B19’s KMNIST flip is size-robust (not specific to 5×5).

Setup: spatial vs indexed at sizes ∈ {3, 5, 7}, 320 patches × 5 paired seeds.

Results:

Size Spatial Indexed Δ pp t d_z Sig
3×3 86.90 ± 0.25 88.25 ± 0.28 −1.35 −6.82 −3.05 ***
5×5 90.33 ± 0.14 91.50 ± 0.16 −1.16 −13.43 −6.00 ***
7×7 91.73 ± 0.33 93.12 ± 0.21 −1.38 −6.95 −3.11 ***

Takeaway: KMNIST locality flip is robust at every patch size, with comparable magnitude (1.16-1.38pp) and all *. **MNIST B17 and KMNIST B23 are mirror images — same setup, opposite-signed effects of similar magnitude. The locality direction tracks data structure (printed vs cursive characters), not architectural choice.


B24: Multilayer on KMNIST

Date: 2026-05-08 Binary: src/bin/multilayer_kmnist.rs Hypothesis: B15’s multilayer hurt on MNIST might have been driven by MNIST saturation (linear classifier already nearly optimal). KMNIST has 7pp more headroom — the hidden layer should help here if A2’s negative was capacity-driven.

Setup: 5×5 × 320 patches → M ReLU hidden → softmax. M ∈ {0, 64, 128}. 5 seeds.

Results:

M Mean ± std Δ vs M=0
0 90.32 ± 0.22
64 88.47 ± 1.35 −1.85pp
128 89.17 ± 0.44 −1.15pp

Takeaway: hidden layer hurts on KMNIST too, with similar magnitude as MNIST. B15’s negative result is task-general, not MNIST-specific. The under-training hypothesis is supported — at fixed training budget (10 epochs, fixed LR), additional parameters can’t be properly trained. For depth to help, the training schedule needs to grow with the architecture (more epochs, LR decay).

Status: confirms B15’s reading. Conditional negative becomes robust negative.


Synthesis: post B18-B24 transferability picture

Finding MNIST Holds on harder tasks?
Patch matchers as a primitive works Yes (B18 calibration)
Multi-scale at low N (B14) +0.53pp *** NO — all ns on KMNIST
Spatial locality (B13/B17) +0.6-0.9pp *** at every size NO — flips sign on KMNIST, null on Fashion, replicates on EMNIST
Multilayer hurt at fixed budget (B15) −1.5-1.7pp YES — replicates on KMNIST
Rectangular wide preference (B9-stats/B10/B11) +0.51pp *** YES (Fashion replicates, rotation flips)

Two task-general findings out of five tested. The locality finding’s sign flip is the most striking single result of Group B to date — it transforms what looked like a general patch-architecture property into a data-distribution-dependent one.


B25: Multilayer on KMNIST with proper training schedule

Date: 2026-05-08 Binary: src/bin/multilayer_kmnist_schedule.rs Hypothesis: B15 / B24 found multilayer hurts at fixed 10-epoch / fixed-LR budget; mechanism was under-training. With 20 epochs + linear LR decay 0.05→0.005, does depth pay off?

Setup: identical to B24 except 20 epochs and decaying LR. M ∈ {0, 64}, 5 paired seeds.

Result: hypothesis confirmed dramatically.

Config Mean ± std
M=0 (linear) 92.96 ± 0.16
M=64 (multilayer) 95.74 ± 0.17

Δ = +2.78pp, t = +24.50, d_z = +10.96, *** — multilayer helps enormously with proper schedule.

Comparison to B24 (10 epochs fixed LR): M=0 was 90.32%, M=64 was 88.47% (Δ = −1.85pp). Both improve with schedule, but M=64 improves by +7.3pp vs M=0’s +2.6pp.

Major correction: B15’s “depth hurts” reading was a training-budget artifact. With proper schedule, depth is the biggest single architectural win on KMNIST.


B27: Pixel-correlation probe (null result)

Date: 2026-05-08 Binary: src/bin/pixel_correlations.rs Hypothesis: MNIST/EMNIST have higher local pixel correlation than KMNIST, explaining the locality direction.

Result: refuted. MNIST and KMNIST have nearly identical adjacent-pixel correlations (r=0.808 vs 0.789). Fashion has the highest of any dataset (r=0.846 H, 0.898 V) yet locality is null there.

Takeaway: simple pairwise pixel adjacency doesn’t predict locality direction. The mechanism is more subtle — see B31.


B28: Scaling sweep on KMNIST and EMNIST

Date: 2026-05-08 Binary: src/bin/scaling_kmnist_emnist.rs Hypothesis: establish the canonical N-sweep curve on the new workhorses (B7-equivalent).

Result: filled in N ∈ {32, 160, 640} alongside B18’s {80, 320}.

N KMNIST EMNIST
32 72.14% 52.40%
80 81.88% 67.56%
160 87.59% 74.12%
320 90.19% 77.69%
640 92.46% 79.67%

KMNIST: 20pp spread (72→92%). EMNIST: 27pp spread (52→80%) but lower ceiling.


B29: Rectangular patches on KMNIST

Date: 2026-05-08 Binary: src/bin/rect_kmnist.rs Hypothesis: B9-stats / B10 / B11’s wide-preference at 1:3 aspect was task-general. Does it hold on KMNIST?

Result: NO — sign flipped.

Pair Δ pp t Sig
4×6 / 6×4 −0.27 −1.32 ns
3×9 / 9×3 −0.40 −2.07 *
5×7 / 7×5 +0.18 +0.70 ns

On KMNIST, tall narrow patches beat wide flat ones at extreme aspect (-0.40pp *). Opposite of MNIST/Fashion/rotated-MNIST.

Takeaway: rectangular wide-preference is mostly task-general but flips on cursive Japanese characters — consistent with KMNIST having a different dominant feature orientation than MNIST/Fashion.


B30: Multi-scale on EMNIST

Date: 2026-05-08 Binary: src/bin/multiscale_emnist.rs Hypothesis: KMNIST’s null result (B21) was specific to KMNIST. EMNIST is printed letters+digits like MNIST, so multi-scale should replicate.

Result: NO — all ns on EMNIST too.

N Single 5×5 Mixed 3/5/7 Δ Sig
120 70.73 70.15 −0.58 ns
240 75.62 75.75 +0.13 ns
480 78.40 78.62 +0.22 ns

Takeaway: B14’s multi-scale advantage on MNIST is genuinely MNIST-only. Doesn’t replicate on either KMNIST or EMNIST.


B31: Per-pixel class-discriminability and spatial structure

Date: 2026-05-08 Binary: src/bin/pixel_discriminability.rs Hypothesis: spatial autocorrelation of class-discriminability at the patch scale predicts locality direction.

Setup: for each pixel position, compute F-like ratio of between-class variance to within-class variance. Compute spatial autocorrelation of this discriminability map at distances 1, 2, and 5.

Result: hypothesis confirmed cleanly.

Dataset autoc d=1 autoc d=2 autoc d=5 Locality
MNIST 0.903 0.728 +0.320 spatial +0.6-0.9 ***
Fashion 0.852 0.652 +0.125 null
KMNIST 0.869 0.601 −0.067 spatial −1.21 ***
EMNIST 0.940 0.780 +0.371 spatial +1.03 ***

The spatial autocorrelation at d=5 ranks the four datasets in exactly the same order as the locality effect.

Takeaway: locality direction is a measurable data property — discoverable without training. KMNIST’s class-discriminative information is not spatially clustered at the patch scale, so spatial 5×5 patches can’t reliably catch concentrated info; random-index patches do better. The cleanest mechanistic finding Group B has produced.


B32: Multilayer on MNIST with proper schedule

Date: 2026-05-08 Binary: src/bin/multilayer_mnist_schedule.rs Hypothesis: B25’s depth+schedule reversal extends to MNIST.

Result: null.

Config Mean ± std
M=0 97.89 ± 0.17
M=64 97.998 ± 0.15

Δ = +0.11pp, t = +0.89, ns. With proper schedule, depth is null on MNIST — the linear baseline was already near saturation around 98% for this patch capacity. Schedule fixes the under-training, but there’s no additional gain to extract.


B33: Rectangular patches on EMNIST balanced

Date: 2026-05-08 Binary: src/bin/rect_emnist.rs Hypothesis: EMNIST is printed letters+digits — should follow MNIST’s wide-preference.

Result: strongly confirms MNIST pattern.

Pair Δ pp t Sig
4×6 / 6×4 +0.03 +0.37 ns
3×9 / 9×3 +0.98 +4.45 ***
5×7 / 7×5 +0.87 +2.06 *

4-task picture for rectangular wide-preference: MNIST +0.51 ***, Fashion +0.93 ***, EMNIST +0.98 ***, KMNIST −0.40 *. Rectangular wide-preference holds on 3 of 4 tasks; KMNIST is the only outlier.


B34: Multilayer on EMNIST with proper schedule

Date: 2026-05-08 Binary: src/bin/multilayer_emnist_schedule.rs Hypothesis: depth helps more when there’s more headroom. EMNIST has 22pp of headroom vs KMNIST’s 10pp — should help more.

Result: hypothesis refuted — depth hurts on EMNIST.

Config Mean ± std
M=0 82.15 ± 0.26
M=64 81.04 ± 0.29

Δ = −1.11pp, t = −6.96, d_z = −3.11, *** — multilayer hurts.

Reframing: simple “depth scales with headroom” is wrong. EMNIST has 47 visually distinct classes; the linear classifier on raw patch features is approximately optimal at this capacity, and adding non-linearity doesn’t add value. KMNIST’s 10 cursive classes share visual components and benefit from compositional features.


B35: Wider multilayer on EMNIST

Date: 2026-05-08 Binary: src/bin/multilayer_emnist_wide.rs Hypothesis: B34’s hurt was due to M=64 < 47-class output bottleneck. Wider hidden layers should remove that bottleneck.

Setup: same schedule, M ∈ {0, 128, 256}. 5 seeds.

Result: bottleneck hypothesis disproved.

M Mean ± std Δ vs M=0
0 82.15 ± 0.26
128 81.54 ± 0.12 −0.62 ***
256 81.68 ± 0.12 −0.47

Multilayer still hurts at M=128 and M=256, both well above 47 classes. Depth’s harmfulness on EMNIST isn’t a capacity issue.

Takeaway: depth is task-specific in a way that doesn’t track simple variables (headroom, class count, hidden:output ratio). Some tasks benefit from compositional features (KMNIST), others don’t (MNIST, EMNIST), and the prediction requires understanding the task’s discriminative-feature structure — not just its difficulty.


Synthesis: post B25-B35 transferability picture

Finding MNIST KMNIST EMNIST Fashion General?
Patch matchers as primitive all 4
Rectangular wide-pref ✗ flips 3 of 4
Spatial locality ✗ flips ~ null 2 of 4 (B31 predicts)
Multi-scale ~ ns ~ ns (untested) MNIST only
Depth+schedule helps ~ ns ✓ +2.78*** ✗ hurts (untested) KMNIST only

Of 5 architectural findings tested across multiple datasets, only the bare patch primitive is fully task-general. Every detail is conditional. KMNIST is the most frequent outlier (flips locality, flips rectangular preference, only place depth+schedule clearly helps). For typed-species NEAT integration, the genome should evolve patch geometry, placement strategy, depth, and training schedule per-task rather than locking in MNIST-derived defaults.