Raw structured experiment records for the typed-species (Group B) stream. Reproduced exactly as produced.
Group B Experiments
Structured experiment records. Each entry: hypothesis, setup, result, takeaway.
B1: Prototype matcher (raw dot product)
Date: 2026-05-07
Binary: src/bin/proto.rs
Hypothesis: A single dot product against the per-class mean training image carries enough signal to classify MNIST above chance.
Setup:
- 50K MNIST training images → 10 prototypes (per-class pixel mean, no normalization)
- 10K held-out test images (last 10K of
train-images-idx3-ubyte) - Classification: argmax over 10 raw dot products
Result: 64.99% accuracy (6499/10000)
Per-class accuracy: 0:97.3, 1:48.1, 2:70.9, 3:73.0, 4:53.6, 5:0.0, 6:80.1, 7:64.8, 8:92.2, 9:65.7
Failure mode: raw dot product is dominated by prototype pixel mass. The “8” and “0” prototypes have the most active pixels and score high against everything. Class 5 hits 0% because the “0” prototype is denser than the “5” prototype in the regions where digit 5 has ink — the 5 image scores higher against “0” than “5”.
Takeaway: there is real signal in dot-product matching, but the unnormalized version is structurally biased toward dense templates. Any honest patch-matcher experiment has to beat the normalized version of this baseline.
Next: B2 — same matcher with cosine similarity (or zero-mean prototypes) to set a fair calibration floor.
B2: Normalized prototype matcher
Date: 2026-05-07
Binary: src/bin/proto_norm.rs
Hypothesis: cosine similarity removes the magnitude bias that crushed B1, exposing the actual shape-matching signal.
Setup: same 50K/10K split as B1. Three variants in one binary:
- raw dot product (B1 sanity check)
- cosine similarity:
<p,x> / (||p|| ||x||) - centered prototypes + cosine: subtract grand-mean prototype before normalizing
Results:
| Variant | Accuracy |
|---|---|
| Raw dot product | 64.99% |
| Cosine similarity | 83.53% |
| Centered + cosine | 58.89% |
Per-class accuracy (cosine): 0:89.2 1:94.5 2:81.6 3:83.1 4:81.5 5:69.1 6:91.5 7:84.5 8:78.0 9:80.4
Failure modes (cosine): confusion is now local and shape-driven — 5↔3 (125 errors), 4↔9 (125), 8↔3 (76). The class-5-always-predicts-0 catastrophe is gone.
Why centered+cosine got worse: centering the prototype without also centering the test image creates a mismatched comparison — centered prototypes have negative values where raw images can only have zero. Image centering would fix it; not pursued because B2a already gave us the clean answer.
Takeaway: 83.53% is the honest calibration floor. Any patch-matcher result that doesn’t clear this is meaningless. Useful reference points still missing: trained linear classifier (~92% expected), small MLP (~98%), single conv layer (>98%).
B3: Prototype features → trained linear discriminator
Date: 2026-05-07
Binary: src/bin/proto_clf.rs
Hypothesis: feeding the 10 cosine-similarity scores (B2) into a trained linear softmax classifier instead of taking argmax extracts more signal and clears 83.5% comfortably.
Setup: 784 → 10 frozen cosine-prototype nodes → 10×10 linear classifier + bias, softmax + CE, online SGD (lr=0.5, 5 epochs, seed=0xB3).
Result: 84.85% (+1.32pp over B2 argmax). Train accuracy peaked at 80.9%; test peaked at 85.87% in epoch 3 then mildly drifted.
Per-class accuracy: 0:92.2 1:97.6 2:73.5 3:78.5 4:92.4 5:68.5 6:94.5 7:94.1 8:76.7 9:77.7
Learned weights: strongly diagonal (+39 to +57 on diag, mostly small negatives off-diag). The classifier essentially rediscovered argmax with tiny corrections.
Takeaway: whole-image cosine similarity against 10 class means is information-bottlenecked at ~85%. A linear discriminator cannot recover information lost in the 784→10 projection. Breaking past this requires either richer features (more prototypes per class, local patches) or an entirely different feature primitive. The next interesting move is local patches — the original Group B hypothesis — since the failure modes (5↔3, 4↔9, 8↔3) are precisely where local feature detection should help.
Next: B4 candidate — same architecture but with 16-32 randomly-placed 5×5 patch matchers as features instead of 10 whole-image prototypes. If that clears 85%, the patch hypothesis has signal.
B4: Random local patches (frozen) → trained discriminator
Date: 2026-05-07
Binary: src/bin/proto_patches.rs
Hypothesis: locality + nonlinearity are sufficient inductive bias to clear the 85% whole-image cosine ceiling, even with random unlearned patches.
Setup: 32 patches × 5×5, He-init random weights, bias=0, uniform random positions. ReLU on patch outputs. Patches frozen. 32→10 linear softmax classifier, online SGD, lr=0.1, 10 epochs, seed=0xB4.
Result: 67.56% (best across epochs: 68.60%) — worse than B2 (84%) and B3 (85%).
Per-class accuracy: 0:78.8 1:91.6 2:62.3 3:58.0 4:46.7 5:64.7 6:65.8 7:82.8 8:50.4 9:71.6
Key diagnostics:
- Only 10.9 of 32 patches fire per image on average — most positions land on background.
- Total patch params: 832 (vs. B3’s 7840), most of which never engage.
- Random weights have no semantic content; downstream classifier sees noise.
Takeaway: locality + nonlinearity is not sufficient inductive bias on its own. The content of the filters matters. This matches the historical observation that LeNet-1 needed learned convolutional filters, not random ones — random conv weights were already known to fail.
Negative result framing: B4 cleanly rules out “any local features will do.” The next experiment must test whether learned patches break the ceiling.
Next: B5 — same architecture as B4 but with patch weights trainable via backprop. Same patch count and size to isolate the effect of learning from the effect of architecture.
(Diverted: did a cheaper intermediate check first — meaningful but frozen patches.)
B5: Prototype-slice patches (frozen) → trained discriminator
Date: 2026-05-07
Binary: src/bin/proto_slices.rs
Hypothesis: B4 failed on filter content, not on locality. Patches that are literal 5×5 slices of class-mean prototypes (frozen) feeding a trained linear discriminator should clear the 85% bottleneck.
Setup: For each class c and slice i (0..N), random (top, left) in [0..23]², copy prototype[c][top..top+5, left..left+5] as patch weights. Cosine similarity at that position. Linear classifier on top of 10N features, online SGD lr=0.1, 10 epochs, seed=0xB5. Swept N ∈ {1, 2, 4, 8, 12, 16, 24, 32, 48, 64}.
Results (best test accuracy across epochs):
| N/class | Features | Best % |
|---|---|---|
| 1 | 10 | 55.27 |
| 2 | 20 | 75.58 |
| 4 | 40 | 85.18 |
| 8 | 80 | 89.77 |
| 12 | 120 | 90.43 |
| 16 | 160 | 91.24 |
| 24 | 240 | 91.11 |
| 32 | 320 | 92.22 |
| 48 | 480 | 92.76 |
| 64 | 640 | 93.67 |
Key observations:
- Crosses B3 (84.85%) at N≈4, decisively beats it from N=8 onward.
- Compared to B4 (random local, 32 features, 67.56%): B5 at matched feature count is ~80% — meaningful content adds ~12pp at fixed scale.
- No magic number. Smooth log-shaped curve, diminishing returns past N=12-16.
- Best Group B result so far (93.67%), still with frozen feature layer.
Confound: at N=32, B5 has 8,000 patch params vs. B4’s 800. Some of the ~25pp lift is raw capacity, not just content.
Takeaway: locality + meaningful content > global + meaningful, and both crush locality + random. Confirms patch-matcher hypothesis but doesn’t fully isolate the inductive-bias contribution from the parameter-count contribution.
Next: B6 — 320 random patches matched to B5’s N=32 feature count, to isolate “more capacity” from “meaningful content”.
B6: Random patches, sweep over count
Date: 2026-05-07
Binary: src/bin/proto_patches_sweep.rs
Hypothesis: B5’s lift over B4 came from raw feature count, not from meaningful filter content. If random patches at matched count perform similarly to prototype-slice patches, B5’s “meaningful content matters” reading was overstated.
Setup: He-init random 5×5 patches, ReLU, frozen. Linear classifier on top trained with online SGD, lr=0.1, 10 epochs, seed=0xB6. Swept patch count ∈ {32, 80, 160, 320, 640}.
Results:
| Patches | Params | Best % |
|---|---|---|
| 32 | 800 | 66.26 |
| 80 | 2000 | 82.07 |
| 160 | 4000 | 88.85 |
| 320 | 8000 | 92.65 |
| 640 | 16000 | 94.01 |
Comparison to B5 (prototype-slice) at matched feature counts:
- 80: B6 82.1% vs B5 89.8% (B5 wins by 7.7pp)
- 160: B6 88.9% vs B5 91.2% (B5 wins by 2.3pp)
- 320: B6 92.7% vs B5 92.2% (B6 ahead by 0.5pp)
- 640: B6 94.0% vs B5 93.7% (B6 ahead by 0.3pp)
Takeaway: B5’s apparent “meaningful content matters” effect was real only at small feature counts. At ≥320 features, random patches catch up; at 640 they slightly exceed prototype-slice patches. The linear classifier finds good combinations from noisy features once capacity is high enough.
Reframing of B5: meaningful local features have a sample-efficiency advantage that vanishes with sufficient feature count, not a fundamental quality advantage.
B7: Trained patches (random init, backprop)
Date: 2026-05-07
Binary: src/bin/proto_patches_trained.rs
Hypothesis: SGD-trained patches from random init beat both random-frozen and prototype-slice-frozen at every feature count, especially at small N. Validates the original Group B hypothesis (patch matchers as a learnable typed species).
Setup: Same architecture as B6 (random fixed positions, ReLU patch outputs, linear classifier), but patch weights AND biases train via backprop alongside the classifier. He init for both layers, online SGD lr=0.05 (lower than B6 because two layers train), 10 epochs, seed=0xB7. Swept patch count ∈ {32, 80, 160, 320, 640}.
Results:
| Patches | Total params | Best % |
|---|---|---|
| 32 | 1162 | 85.64 |
| 80 | 2890 | 93.46 |
| 160 | 5770 | 96.04 |
| 320 | 11530 | 97.13 |
| 640 | 23050 | 97.27 |
Gap over random (B6) at matched count:
- 32: +19.4pp
- 80: +11.4pp
- 160: +7.2pp
- 320: +4.5pp
- 640: +3.3pp
Takeaway: trained patches dominate everywhere. The gap shrinks at scale (random-with-many-patches eventually approaches usable) but doesn’t close in this range. 97.27% with 640 5×5 trained patches is close to the main NEAT system’s [128] dense-hidden result (98.7%), achieved with locality as inductive bias rather than dense connectivity.
Status: original Group B hypothesis (patch matchers as a learnable typed species) supported. Ready for integration into the main NEAT stream as a typed-node mutation, if/when that’s the priority.
B8: Patch size sweep (3×3 through 7×7)
Date: 2026-05-07
Binary: src/bin/proto_patches_size.rs
Hypothesis: 5×5 was an arbitrary default carried over from B4/B7. Different patch sizes have different parameter efficiencies — smaller patches give more spatial coverage per parameter, larger patches give bigger receptive fields per detector. The cross-product of size and count reveals which trades are worth making.
Setup: same architecture and training as B7 (trained patches, random He init, ReLU, linear discriminator, online SGD, lr=0.05, 10 epochs, seed=0xB8). Crossed sweep: patch size ∈ {3, 4, 5, 6, 7} × patch count ∈ {32, 80, 160, 320, 640}. 25 configurations total.
Results — best test accuracy by (count, size):
| N | 3×3 | 4×4 | 5×5 | 6×6 | 7×7 |
|---|---|---|---|---|---|
| 32 | 79.95 | 81.32 | 86.93 | 89.15 | 91.52 |
| 80 | 90.49 | 93.37 | 92.33 | 94.32 | 96.33 |
| 160 | 92.28 | 94.65 | 95.44 | 96.14 | 96.82 |
| 320 | 95.85 | 96.69 | 97.12 | 97.08 | 97.53 |
| 640 | 96.54 | 96.96 | 97.70 | 97.71 | 98.04 |
Parameter-efficient frontier (best config at each budget):
| ≈Params | Config | Accuracy |
|---|---|---|
| 1.6K | 3×3 × 80 | 90.49 |
| 3K | 3×3 × 160 / 5×5 × 80 | 92.3 |
| 6K | 3×3 × 320 | 95.85 |
| 12K | 5×5 × 320 | 97.12 |
| 20K | 7×7 × 320 | 97.53 |
| 38K | 7×7 × 640 | 98.04 |
Key findings:
- At low parameter budgets, smaller patches at higher count win — coverage beats receptive field when params are scarce.
- At ~12K+ params, 5×5/6×6/7×7 all reach 97%, with 7×7 leading on absolute accuracy but at higher param cost.
- 3×3 has a hard ceiling around 96.5% — receptive field too small to capture enough digit structure regardless of count.
- All sizes saturate at 96.5%-98% by N=640. Architectural ceiling not removed by more patch parameters of any size.
Takeaway: patch size is a meaningful axis (not a hyperparameter to fix). For typed-NEAT integration, the genome should support mutations adding patches of varied sizes and let evolution pick the mix. Default single-size choice if forced: 5×5 — never far from optimal across the budget range, with 4×4 close behind. Multi-scale > single-scale.
Status: enriches the B7 result with parameter-efficiency data. Confirms 5×5 was a reasonable default, identifies the size-vs-count tradeoff structure for downstream design decisions.
B9: Rectangular patches (single-seed sweep)
Date: 2026-05-07
Binary: src/bin/proto_patches_rect.rs
Hypothesis: rectangular patches at matched parameter counts may differ from square ones; horizontal vs. vertical orientation may also matter.
Setup: same training as B7/B8. 11 shapes covering matched-area tiers (16/21, 24/27, 35/36) with horizontal/vertical mirror pairs. Two patch counts (N=160, N=320). Single seed (0xB9).
Headline single-seed results (N=320):
- 5×7 reached 97.66% at 14730 params — the best result of the sweep
- 3×9 vs 9×3 gap: +0.87pp (wide wins)
- 5×7 vs 7×5 gap: +0.38pp (wide wins)
- 4×6 vs 6×4 gap: +0.01pp, 3×7 vs 7×3 gap: +0.06pp (no preference)
Status: suggestive but unreliable. With per-config std ~0.15pp, sub-0.3pp gaps are noise. Triggered B9-stats.
B9-stats: Rectangular patches with paired multi-seed stats
Date: 2026-05-07
Binary: src/bin/proto_patches_rect_stats.rs
Hypothesis: confirm or refute the B9 single-seed gaps with multi-seed paired comparisons. Stand up reusable stats helpers (SampleStats, PairedComparison).
Setup: 8 shapes × 5 seeds (0xB9–0xBD) at N=320. Paired design — same seeds across all shapes so within-seed noise cancels in differences. All statistics constant-time given running sums.
Per-shape mean ± std (5 seeds): 5×5 97.06±0.14, 6×6 97.34±0.16, 4×6 97.04±0.13, 6×4 96.86±0.13, 3×9 97.10±0.19, 9×3 96.60±0.16, 5×7 97.38±0.19, 7×5 97.35±0.11.
Paired wide-minus-tall (Δ pp, t, d_z, sig):
| Pair | Δ | t | d_z | Sig |
|---|---|---|---|---|
| 4×6 / 6×4 | +0.180 | 1.63 | 0.73 | ns |
| 3×9 / 9×3 | +0.506 | 4.22 | 1.89 | *** |
| 5×7 / 7×5 | +0.038 | 0.37 | 0.17 | ns |
Key corrections to B9:
- The B9 “5×7 N=320 = 97.66%” headline was the lucky-seed max; mean is 97.38%, statistically tied with 6×6. Single-seed best-of-sweep is unreliable.
- The B9 5×7 > 7×5 finding was noise (multi-seed Δ=+0.04, ns).
- The B9 3×9 ≫ 9×3 finding survived rigorously: very-large effect (d_z=1.89), highly significant (***).
Methodology lesson: any quantitative claim about a difference under ~0.3pp needs multi-seed paired stats. Stats infrastructure now reusable.
Status: established that the wide preference is real at extreme aspect ratios on MNIST. Triggered B10 (Fashion replication) to test the digit-stroke mechanistic hypothesis.
B10: Rectangular patches on Fashion-MNIST
Date: 2026-05-07
Binary: src/bin/proto_patches_rect_stats_fashion.rs
Hypothesis: if 3×9 ≫ 9×3 on MNIST is digit-stroke-specific, Fashion should show a different pattern. If it persists, the effect is general to 28×28 grayscale image classification.
Setup: identical to B9-stats — 8 shapes × 5 seeds (0xB9–0xBD) at N=320, same training regime. Only the data path changed.
Per-shape mean ± std (Fashion, 5 seeds, N=320): 5×5 86.45±0.33, 6×6 86.47±0.28, 4×6 86.53±0.43, 6×4 86.23±0.22, 3×9 86.37±0.17, 9×3 85.44±0.59, 5×7 86.37±0.30, 7×5 86.13±0.34.
Paired wide-minus-tall (Fashion):
| Pair | Δ | t | d_z | Sig |
|---|---|---|---|---|
| 4×6 / 6×4 | +0.300 | 1.18 | 0.53 | ns |
| 3×9 / 9×3 | +0.930 | 3.32 | 1.48 | *** |
| 5×7 / 7×5 | +0.236 | 1.08 | 0.48 | ns |
Cross-task comparison:
| Pair | MNIST Δ | Fashion Δ |
|---|---|---|
| 4×6 vs 6×4 | +0.18 | +0.30 |
| 3×9 vs 9×3 | +0.51 | +0.93 |
| 5×7 vs 7×5 | +0.04 | +0.24 |
Findings:
- The 1:3 wide-preference effect replicates and strengthens on Fashion (+0.93pp vs. +0.51pp).
- All three pairs trend wide > tall on both datasets; moderate-aspect pairs are non-significant on both.
- 9×3 is genuinely unstable on Fashion — std=0.59, the highest of any shape.
- Fashion is harder overall (~86% vs ~97%) and noisier (std ~2× larger).
Refuted hypothesis: digit-stroke geometry as the mechanism (B9 single-seed framing).
Refined hypothesis: kernels should be perpendicular to the dominant feature orientation. Both MNIST and Fashion have predominantly vertical structural features; tall patches lie along these features and waste capacity, while wide patches cut across them and capture transitions. Classical filter-design wisdom rediscovered via SGD.
Status: hypothesis is now directly testable — rotating the input 90° should flip the preference (B11 candidate), and a more granular aspect-ratio sweep should show the effect grows monotonically with ratio extremity (B12 candidate).
B11: Rotated MNIST
Date: 2026-05-07
Binary: src/bin/proto_patches_rect_rotated.rs
Hypothesis: if the wide preference is “kernels perpendicular to the dominant feature orientation,” rotating MNIST 90° CW should flip the dominant feature orientation and the preference should reverse — tall > wide on rotated data.
Setup: identical to B9-stats (8 shapes × 5 seeds at N=320) except every image is rotated 90° CW (new[c][27-r] = old[r][c]) at load time.
Per-shape mean ± std: 5×5 97.04±0.15, 6×6 97.45±0.14, 4×6 97.02±0.15, 6×4 96.99±0.22, 3×9 96.67±0.21, 9×3 97.07±0.23, 5×7 97.34±0.06, 7×5 97.44±0.13.
Paired wide-minus-tall (rotated):
| Pair | Δ | t | d_z | Sig | vs. upright |
|---|---|---|---|---|---|
| 4×6 / 6×4 | +0.022 | 0.25 | 0.11 | ns | (was +0.18 ns) |
| 3×9 / 9×3 | −0.402 | −2.27 | −1.02 | * | was +0.51 ***, sign flipped |
| 5×7 / 7×5 | −0.106 | −1.77 | −0.79 | ns | was +0.04 ns, direction flipped |
Finding: the 3×9 vs 9×3 effect cleanly reversed sign on rotated MNIST (Δ went from +0.51pp *** to −0.40pp *). The pattern moves with the data — confirming the mechanistic claim that this is feature-orientation-driven, not architectural bias or placement geometry.
Magnitude is smaller after rotation (0.40 vs 0.51): plausibly because rotated digits are slightly off-distribution (training-data conventions are axis-specific), or because vertical-feature dominance has a partly geometric component that rotation doesn’t fully invert. The sign — what the hypothesis predicts — is unambiguous.
Status: mechanistic hypothesis from B10 confirmed. The wide-vs-tall asymmetry is genuinely orientation-driven.
B12: Extreme aspect ratios
Date: 2026-05-07
Binary: src/bin/proto_patches_rect_aspect.rs
Hypothesis: the wide preference grows monotonically with aspect-ratio extremity — gap should increase 1:3 → 1:9 → 1:15 → 1:21.
Setup: 9 shapes × 5 seeds at N=320 on MNIST. Mirror pairs at 1:3 (3×9/9×3, reference), 1:9 (1×9/9×1), 1:15, 1:21. Plus 5×5 baseline.
Per-shape mean ± std: 5×5 97.06±0.14, 3×9 97.10±0.19, 9×3 96.60±0.16, 1×9 94.38±0.16, 9×1 94.36±0.16, 1×15 95.21±0.23, 15×1 94.70±0.06, 1×21 95.31±0.09, 21×1 94.62±0.25.
Paired wide-minus-tall:
| Pair | Aspect | Δ | t | d_z | Sig |
|---|---|---|---|---|---|
| 3×9 / 9×3 | 1:3 | +0.506 | 4.22 | 1.89 | *** |
| 1×9 / 9×1 | 1:9 | +0.012 | 0.11 | 0.05 | ns |
| 1×15 / 15×1 | 1:15 | +0.518 | 4.17 | 1.86 | *** |
| 1×21 / 21×1 | 1:21 | +0.690 | 4.62 | 2.07 | *** |
Finding: monotonicity prediction refuted. There is a null at 1:9 sandwiched between significant effects at 1:3 and 1:15+. The pattern is large → null → large → larger.
The 1×9 vs 9×1 pair is the only one with both zero perpendicular extent AND short length (9 px). All other pairs have either ≥3 px perpendicular thickness (3×9, 9×3) or ≥15 px length (1×15, 1×21). Either property recovers the effect; having neither kills it.
Refined mechanistic story (post-B12): the wide preference at extreme aspect ratios is at least two distinct phenomena:
- For thick rectangular patches (≥3 px perpendicular extent): B10’s perpendicular-to-feature argument.
- For 1-pixel-thick long strips: cross-section sampling — horizontal strips capture intensity profile across digit width (discriminative), vertical strips capture intensity profile across digit height (less discriminative because all digits are ~similar height).
Both produce wide > tall on MNIST, but for different reasons.
Other observations:
- All single-pixel-thick patches cap at ~94-95%, regardless of length. Bandwidth-limited.
- Adding a single perpendicular row (1×9 → 3×9) jumps accuracy from 94.4% to 97.1%. 1D → 2D internal structure is a phase transition for usefulness.
- 9×3 in B12 reproduces B9-stats’s 9×3 (96.60% mean, std 0.16) exactly — reproducibility check passed.
Status: closes the rectangular-patches arc. Real mechanistic finding (B11 confirmed) with a richer-than-expected structure (B12 surprise: not single-mechanism). Worth banking; future work could disentangle the two mechanisms more cleanly but marginal value vs other Group B questions is small.
B13 / A3: Random-index “patches” — does spatial contiguity matter?
Date: 2026-05-08
Binary: src/bin/proto_patches_random_idx.rs
Hypothesis: at 320 trained patches, spatial contiguity is irrelevant — “patches” are just sparse linear features and SGD finds good combinations regardless of input pixel layout.
Setup: head-to-head spatial 5×5 vs 25-random-pixel-index “patches” at N=320, identical training, 5 seeds paired.
Result: hypothesis refuted.
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 96.996 ± 0.246 |
| Indexed 25 | 96.386 ± 0.147 |
Δ = +0.610pp, t = +7.91, d_z = +3.54 (***) — spatial wins decisively.
Takeaway: the locality inductive bias matters even at 320 trained patches. Triggered B16 (cross-task) and B17 (across sizes).
B14 / A1: Multi-scale patches
Date: 2026-05-08
Binary: src/bin/proto_patches_multiscale.rs
Hypothesis: mixing patch sizes (1/3 each at 3×3, 5×5, 7×7) beats single-scale 5×5 at matched parameter count.
Setup: single-scale 5×5 vs mixed thirds at two patch counts (N=240, N=480). 5 seeds paired.
Results:
| N | Single 5×5 | Mixed 3/5/7 | Δ (mixed − single) | t | d_z | Sig |
|---|---|---|---|---|---|---|
| 240 | 96.40 ± 0.16 | 96.93 ± 0.22 | +0.530 | +4.40 | +1.97 | *** |
| 480 | 97.58 ± 0.14 | 97.46 ± 0.15 | −0.114 | −1.36 | −0.61 | ns |
Takeaway: multi-scale wins decisively at low N (+0.53pp ***), but the advantage vanishes at high N. Multi-scale is a low-capacity phenomenon — receptive-field diversity helps when no single size has enough patches to fully exploit it. Rhymes with B5/B6 pattern (smarter feature design helps when capacity is scarce; doesn’t matter when abundant).
B15 / A2: Multi-layer patches (hidden ReLU layer)
Date: 2026-05-08
Binary: src/bin/proto_patches_multilayer.rs
Hypothesis: adding a hidden ReLU layer between patches and linear classifier breaks past the ~97% ceiling — the linear head was the bottleneck.
Setup: 5×5 × 320 patches → M ReLU hidden → 10 linear softmax. M ∈ {0, 32, 64, 128}. Same training as B7/B8. 5 seeds.
Results:
| M | Mean ± std | Total params | Δ vs M=0 |
|---|---|---|---|
| 0 | 97.00 ± 0.25 | 11,530 | — |
| 32 | 95.30 ± 0.45 | 18,922 | −1.69pp |
| 64 | 95.29 ± 0.54 | 29,514 | −1.71pp |
| 128 | 95.44 ± 0.16 | 50,698 | −1.55pp |
Takeaway: depth hurts uniformly at fixed training budget. The hidden layer’s extra parameters can’t be properly trained at 10 epochs / fixed LR. Opposite of main NEAT stream’s depth result ([128, 64] beat [128]) because main stream has 1.8M steps with LR decay vs our 500K with fixed LR. Conditional negative — depth in this minimal framework needs retuned hyperparameters before being declared dead.
B16: A3 on Fashion-MNIST
Date: 2026-05-08
Binary: src/bin/proto_patches_random_idx_fashion.rs
Hypothesis: B13’s spatial-contiguity advantage replicates cross-task on Fashion-MNIST.
Setup: identical to B13, only data path changed.
Result:
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 86.36 ± 0.33 |
| Indexed 25 | 86.48 ± 0.19 |
Δ = −0.124pp, t = −0.71, d_z = −0.32 (ns) — sign even slightly negative.
Takeaway: the locality advantage is MNIST-specific. Does not replicate on Fashion. Inverse of the rectangular-patch finding (which replicated and strengthened on Fashion). Plausible mechanism: MNIST has very high local pixel correlation in stroke regions; Fashion has more textural variation, so adjacent pixels carry less correlated information. Random-index “patches” become effectively global pixel fingerprints, which is competitive with local-feature detection on Fashion.
B17: A3 across patch sizes (MNIST)
Date: 2026-05-08
Binary: src/bin/proto_patches_random_idx_sizes.rs
Hypothesis: B13’s contiguity advantage isn’t 5×5-specific — it should exist at 3×3 and 7×7 too.
Setup: spatial vs indexed at sizes ∈ {3, 5, 7}. 320 patches each, 5 seeds paired.
Results:
| Size | Spatial | Indexed | Δ pp | t | d_z | Sig |
|---|---|---|---|---|---|---|
| 3×3 | 95.50 ± 0.28 | 94.91 ± 0.25 | +0.596 | +4.26 | +1.91 | *** |
| 5×5 | 97.21 ± 0.14 | 96.30 ± 0.12 | +0.912 | +18.30 | +8.18 | *** |
| 7×7 | 97.62 ± 0.19 | 96.92 ± 0.14 | +0.694 | +7.30 | +3.26 | *** |
Takeaway: contiguity advantage is robust at every patch size on MNIST, peaking at 5×5 (d_z=+8.18 — extraordinarily consistent). Magnitudes 0.6-0.9pp.
Methodological footnote: B13 reported Δ=+0.61pp at 5×5; B17 with a different seed-offset scheme (different positions / weight inits, same base seeds) gave +0.91pp at the same configuration. Even paired multi-seed Δ has ~0.2-0.3pp uncertainty in its precise magnitude — sign and significance are robust, but exact values need more samples to nail down.
Status: closes the A3 thread. Locality is a real, significant, robust advantage on MNIST across patch sizes — and is absent on Fashion. The task-specific transferability profile is itself the most informative finding.
B18: Task difficulty calibration
Date: 2026-05-08
Binary: src/bin/calibrate_tasks.rs
Hypothesis: identify a harder workhorse task to replace MNIST, which is saturating around 97-98% and eating differential signal in architectural comparisons.
Setup: trained 5×5 patches at N ∈ {80, 320}, single seed, on 9 task variants spanning single datasets (MNIST, Fashion, KMNIST, EMNIST balanced), pairs (M+F, M+K, K+F), triple (M+F+K), and the full quad. Required new src/data/mixed.rs loader.
Headline results: KMNIST is the cleanest single-task workhorse (8.3pp spread N=80→320, ~90% ceiling). MNIST+KMNIST hits 92.62% at N=320 — exactly the user’s target zone. EMNIST balanced (47 classes) gives the largest single-task spread (10.1pp) but introduces class-count confounds.
Status: established the new task battery for B19+.
B19: Locality on KMNIST
Date: 2026-05-08
Binary: src/bin/locality_kmnist.rs
Hypothesis: KMNIST has stroke-like local structure similar to MNIST, so spatial 5×5 should still beat 25-random-pixel-index.
Setup: identical to B13 except task = KMNIST. 320 patches × 5 seeds.
Result: hypothesis refuted in dramatic fashion.
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 90.322 ± 0.215 |
| Indexed 25 | 91.534 ± 0.353 |
Δ = −1.212pp, t = −7.98, d_z = −3.54, *** — sign flipped from MNIST. Indexed beats spatial by even more than spatial beat indexed on MNIST.
Takeaway: spatial locality is harmful on cursive Japanese characters. The MNIST locality advantage doesn’t generalize even to other 28×28 grayscale stroke data — cursive vs printed makes the difference.
B20: Locality on EMNIST balanced
Date: 2026-05-08
Binary: src/bin/locality_emnist.rs
Hypothesis: EMNIST is printed letters+digits — should follow the MNIST pattern (spatial wins).
Setup: identical to B13 except task = EMNIST balanced (47 classes, ~94K train images). 320 patches × 5 seeds.
Result:
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 77.212 ± 0.366 |
| Indexed 25 | 76.185 ± 0.089 |
Δ = +1.027pp, t = +5.53, d_z = +2.47, *** — same direction as MNIST.
Takeaway: spatial wins on printed-character data regardless of class count (10 → 47). The MNIST and EMNIST patterns are consistent. Together with B19’s flip on KMNIST, this gives a mechanistic reading: printed-stroke characters have local pixel correlations that 5×5 receptive fields exploit; cursive characters apparently don’t.
B21: Multi-scale on KMNIST
Date: 2026-05-08
Binary: src/bin/multiscale_kmnist.rs
Hypothesis: B14’s “mixing 3/5/7 wins at low N” effect on MNIST should replicate on KMNIST, where there’s more headroom.
Setup: single-scale 5×5 vs mixed thirds at three patch counts. 5 paired seeds.
Results:
| N | Single 5×5 | Mixed 3/5/7 | Δ | t | Sig |
|---|---|---|---|---|---|
| 120 | 85.75 ± 0.31 | 86.06 ± 0.31 | +0.31 | +1.35 | ns |
| 240 | 89.71 ± 0.19 | 89.46 ± 0.17 | −0.25 | −1.78 | ns |
| 480 | 91.84 ± 0.25 | 92.12 ± 0.38 | +0.28 | +1.56 | ns |
All three patch counts non-significant. On MNIST B14 had Δ=+0.53pp *** at N=240.
Takeaway: the multi-scale-wins-at-low-N pattern was MNIST-specific. Receptive-field diversity didn’t help on KMNIST at any tested capacity. Mildly disappointing for the typed-species “evolve a mix” hypothesis.
B22: Locality on MNIST+KMNIST mix
Date: 2026-05-08
Binary: src/bin/locality_mnist_kmnist.rs
Hypothesis: combining tasks where locality has opposite-sign effects (MNIST +, KMNIST −) — does one dominate, do they average to null, or does interaction emerge?
Setup: identical to B13 except task = MNIST + KMNIST (20 classes, 100K train). 5 paired seeds at N=320.
Result:
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 92.971 ± 0.497 |
| Indexed 25 | 92.879 ± 0.101 |
Δ = +0.092pp, t = +0.42, d_z = +0.19, ns — essentially zero.
Takeaway: opposite-direction effects from each constituent task cancel almost perfectly when mixed. The patch architecture sees both data distributions and finds intermediate behavior. No emergent property from mixing — it’s just an average. Useful confirmation that the task-specificity is genuinely about the data structure, not architecture-task interaction.
B23: Locality across patch sizes on KMNIST
Date: 2026-05-08
Binary: src/bin/locality_kmnist_sizes.rs
Hypothesis: B19’s KMNIST flip is size-robust (not specific to 5×5).
Setup: spatial vs indexed at sizes ∈ {3, 5, 7}, 320 patches × 5 paired seeds.
Results:
| Size | Spatial | Indexed | Δ pp | t | d_z | Sig |
|---|---|---|---|---|---|---|
| 3×3 | 86.90 ± 0.25 | 88.25 ± 0.28 | −1.35 | −6.82 | −3.05 | *** |
| 5×5 | 90.33 ± 0.14 | 91.50 ± 0.16 | −1.16 | −13.43 | −6.00 | *** |
| 7×7 | 91.73 ± 0.33 | 93.12 ± 0.21 | −1.38 | −6.95 | −3.11 | *** |
Takeaway: KMNIST locality flip is robust at every patch size, with comparable magnitude (1.16-1.38pp) and all *. **MNIST B17 and KMNIST B23 are mirror images — same setup, opposite-signed effects of similar magnitude. The locality direction tracks data structure (printed vs cursive characters), not architectural choice.
B24: Multilayer on KMNIST
Date: 2026-05-08
Binary: src/bin/multilayer_kmnist.rs
Hypothesis: B15’s multilayer hurt on MNIST might have been driven by MNIST saturation (linear classifier already nearly optimal). KMNIST has 7pp more headroom — the hidden layer should help here if A2’s negative was capacity-driven.
Setup: 5×5 × 320 patches → M ReLU hidden → softmax. M ∈ {0, 64, 128}. 5 seeds.
Results:
| M | Mean ± std | Δ vs M=0 |
|---|---|---|
| 0 | 90.32 ± 0.22 | — |
| 64 | 88.47 ± 1.35 | −1.85pp |
| 128 | 89.17 ± 0.44 | −1.15pp |
Takeaway: hidden layer hurts on KMNIST too, with similar magnitude as MNIST. B15’s negative result is task-general, not MNIST-specific. The under-training hypothesis is supported — at fixed training budget (10 epochs, fixed LR), additional parameters can’t be properly trained. For depth to help, the training schedule needs to grow with the architecture (more epochs, LR decay).
Status: confirms B15’s reading. Conditional negative becomes robust negative.
Synthesis: post B18-B24 transferability picture
| Finding | MNIST | Holds on harder tasks? |
|---|---|---|
| Patch matchers as a primitive | works | Yes (B18 calibration) |
| Multi-scale at low N (B14) | +0.53pp *** | NO — all ns on KMNIST |
| Spatial locality (B13/B17) | +0.6-0.9pp *** at every size | NO — flips sign on KMNIST, null on Fashion, replicates on EMNIST |
| Multilayer hurt at fixed budget (B15) | −1.5-1.7pp | YES — replicates on KMNIST |
| Rectangular wide preference (B9-stats/B10/B11) | +0.51pp *** | YES (Fashion replicates, rotation flips) |
Two task-general findings out of five tested. The locality finding’s sign flip is the most striking single result of Group B to date — it transforms what looked like a general patch-architecture property into a data-distribution-dependent one.
B25: Multilayer on KMNIST with proper training schedule
Date: 2026-05-08
Binary: src/bin/multilayer_kmnist_schedule.rs
Hypothesis: B15 / B24 found multilayer hurts at fixed 10-epoch / fixed-LR budget; mechanism was under-training. With 20 epochs + linear LR decay 0.05→0.005, does depth pay off?
Setup: identical to B24 except 20 epochs and decaying LR. M ∈ {0, 64}, 5 paired seeds.
Result: hypothesis confirmed dramatically.
| Config | Mean ± std |
|---|---|
| M=0 (linear) | 92.96 ± 0.16 |
| M=64 (multilayer) | 95.74 ± 0.17 |
Δ = +2.78pp, t = +24.50, d_z = +10.96, *** — multilayer helps enormously with proper schedule.
Comparison to B24 (10 epochs fixed LR): M=0 was 90.32%, M=64 was 88.47% (Δ = −1.85pp). Both improve with schedule, but M=64 improves by +7.3pp vs M=0’s +2.6pp.
Major correction: B15’s “depth hurts” reading was a training-budget artifact. With proper schedule, depth is the biggest single architectural win on KMNIST.
B27: Pixel-correlation probe (null result)
Date: 2026-05-08
Binary: src/bin/pixel_correlations.rs
Hypothesis: MNIST/EMNIST have higher local pixel correlation than KMNIST, explaining the locality direction.
Result: refuted. MNIST and KMNIST have nearly identical adjacent-pixel correlations (r=0.808 vs 0.789). Fashion has the highest of any dataset (r=0.846 H, 0.898 V) yet locality is null there.
Takeaway: simple pairwise pixel adjacency doesn’t predict locality direction. The mechanism is more subtle — see B31.
B28: Scaling sweep on KMNIST and EMNIST
Date: 2026-05-08
Binary: src/bin/scaling_kmnist_emnist.rs
Hypothesis: establish the canonical N-sweep curve on the new workhorses (B7-equivalent).
Result: filled in N ∈ {32, 160, 640} alongside B18’s {80, 320}.
| N | KMNIST | EMNIST |
|---|---|---|
| 32 | 72.14% | 52.40% |
| 80 | 81.88% | 67.56% |
| 160 | 87.59% | 74.12% |
| 320 | 90.19% | 77.69% |
| 640 | 92.46% | 79.67% |
KMNIST: 20pp spread (72→92%). EMNIST: 27pp spread (52→80%) but lower ceiling.
B29: Rectangular patches on KMNIST
Date: 2026-05-08
Binary: src/bin/rect_kmnist.rs
Hypothesis: B9-stats / B10 / B11’s wide-preference at 1:3 aspect was task-general. Does it hold on KMNIST?
Result: NO — sign flipped.
| Pair | Δ pp | t | Sig |
|---|---|---|---|
| 4×6 / 6×4 | −0.27 | −1.32 | ns |
| 3×9 / 9×3 | −0.40 | −2.07 | * |
| 5×7 / 7×5 | +0.18 | +0.70 | ns |
On KMNIST, tall narrow patches beat wide flat ones at extreme aspect (-0.40pp *). Opposite of MNIST/Fashion/rotated-MNIST.
Takeaway: rectangular wide-preference is mostly task-general but flips on cursive Japanese characters — consistent with KMNIST having a different dominant feature orientation than MNIST/Fashion.
B30: Multi-scale on EMNIST
Date: 2026-05-08
Binary: src/bin/multiscale_emnist.rs
Hypothesis: KMNIST’s null result (B21) was specific to KMNIST. EMNIST is printed letters+digits like MNIST, so multi-scale should replicate.
Result: NO — all ns on EMNIST too.
| N | Single 5×5 | Mixed 3/5/7 | Δ | Sig |
|---|---|---|---|---|
| 120 | 70.73 | 70.15 | −0.58 | ns |
| 240 | 75.62 | 75.75 | +0.13 | ns |
| 480 | 78.40 | 78.62 | +0.22 | ns |
Takeaway: B14’s multi-scale advantage on MNIST is genuinely MNIST-only. Doesn’t replicate on either KMNIST or EMNIST.
B31: Per-pixel class-discriminability and spatial structure
Date: 2026-05-08
Binary: src/bin/pixel_discriminability.rs
Hypothesis: spatial autocorrelation of class-discriminability at the patch scale predicts locality direction.
Setup: for each pixel position, compute F-like ratio of between-class variance to within-class variance. Compute spatial autocorrelation of this discriminability map at distances 1, 2, and 5.
Result: hypothesis confirmed cleanly.
| Dataset | autoc d=1 | autoc d=2 | autoc d=5 | Locality |
|---|---|---|---|---|
| MNIST | 0.903 | 0.728 | +0.320 | spatial +0.6-0.9 *** |
| Fashion | 0.852 | 0.652 | +0.125 | null |
| KMNIST | 0.869 | 0.601 | −0.067 | spatial −1.21 *** |
| EMNIST | 0.940 | 0.780 | +0.371 | spatial +1.03 *** |
The spatial autocorrelation at d=5 ranks the four datasets in exactly the same order as the locality effect.
Takeaway: locality direction is a measurable data property — discoverable without training. KMNIST’s class-discriminative information is not spatially clustered at the patch scale, so spatial 5×5 patches can’t reliably catch concentrated info; random-index patches do better. The cleanest mechanistic finding Group B has produced.
B32: Multilayer on MNIST with proper schedule
Date: 2026-05-08
Binary: src/bin/multilayer_mnist_schedule.rs
Hypothesis: B25’s depth+schedule reversal extends to MNIST.
Result: null.
| Config | Mean ± std |
|---|---|
| M=0 | 97.89 ± 0.17 |
| M=64 | 97.998 ± 0.15 |
Δ = +0.11pp, t = +0.89, ns. With proper schedule, depth is null on MNIST — the linear baseline was already near saturation around 98% for this patch capacity. Schedule fixes the under-training, but there’s no additional gain to extract.
B33: Rectangular patches on EMNIST balanced
Date: 2026-05-08
Binary: src/bin/rect_emnist.rs
Hypothesis: EMNIST is printed letters+digits — should follow MNIST’s wide-preference.
Result: strongly confirms MNIST pattern.
| Pair | Δ pp | t | Sig |
|---|---|---|---|
| 4×6 / 6×4 | +0.03 | +0.37 | ns |
| 3×9 / 9×3 | +0.98 | +4.45 | *** |
| 5×7 / 7×5 | +0.87 | +2.06 | * |
4-task picture for rectangular wide-preference: MNIST +0.51 ***, Fashion +0.93 ***, EMNIST +0.98 ***, KMNIST −0.40 *. Rectangular wide-preference holds on 3 of 4 tasks; KMNIST is the only outlier.
B34: Multilayer on EMNIST with proper schedule
Date: 2026-05-08
Binary: src/bin/multilayer_emnist_schedule.rs
Hypothesis: depth helps more when there’s more headroom. EMNIST has 22pp of headroom vs KMNIST’s 10pp — should help more.
Result: hypothesis refuted — depth hurts on EMNIST.
| Config | Mean ± std |
|---|---|
| M=0 | 82.15 ± 0.26 |
| M=64 | 81.04 ± 0.29 |
Δ = −1.11pp, t = −6.96, d_z = −3.11, *** — multilayer hurts.
Reframing: simple “depth scales with headroom” is wrong. EMNIST has 47 visually distinct classes; the linear classifier on raw patch features is approximately optimal at this capacity, and adding non-linearity doesn’t add value. KMNIST’s 10 cursive classes share visual components and benefit from compositional features.
B35: Wider multilayer on EMNIST
Date: 2026-05-08
Binary: src/bin/multilayer_emnist_wide.rs
Hypothesis: B34’s hurt was due to M=64 < 47-class output bottleneck. Wider hidden layers should remove that bottleneck.
Setup: same schedule, M ∈ {0, 128, 256}. 5 seeds.
Result: bottleneck hypothesis disproved.
| M | Mean ± std | Δ vs M=0 |
|---|---|---|
| 0 | 82.15 ± 0.26 | — |
| 128 | 81.54 ± 0.12 | −0.62 *** |
| 256 | 81.68 ± 0.12 | −0.47 |
Multilayer still hurts at M=128 and M=256, both well above 47 classes. Depth’s harmfulness on EMNIST isn’t a capacity issue.
Takeaway: depth is task-specific in a way that doesn’t track simple variables (headroom, class count, hidden:output ratio). Some tasks benefit from compositional features (KMNIST), others don’t (MNIST, EMNIST), and the prediction requires understanding the task’s discriminative-feature structure — not just its difficulty.
Synthesis: post B25-B35 transferability picture
| Finding | MNIST | KMNIST | EMNIST | Fashion | General? |
|---|---|---|---|---|---|
| Patch matchers as primitive | ✓ | ✓ | ✓ | ✓ | all 4 |
| Rectangular wide-pref | ✓ | ✗ flips | ✓ | ✓ | 3 of 4 |
| Spatial locality | ✓ | ✗ flips | ✓ | ~ null | 2 of 4 (B31 predicts) |
| Multi-scale | ✓ | ~ ns | ~ ns | (untested) | MNIST only |
| Depth+schedule helps | ~ ns | ✓ +2.78*** | ✗ hurts | (untested) | KMNIST only |
Of 5 architectural findings tested across multiple datasets, only the bare patch primitive is fully task-general. Every detail is conditional. KMNIST is the most frequent outlier (flips locality, flips rectangular preference, only place depth+schedule clearly helps). For typed-species NEAT integration, the genome should evolve patch geometry, placement strategy, depth, and training schedule per-task rather than locking in MNIST-derived defaults.