Raw, unedited Group B journal — typed-species research stream, separate from the main NEAT work. Reproduced exactly as produced.
Group B Journal
Chronological observations on the typed-species research stream.
Origin
Started 2026-05-07. The main NEAT stream had reached a plateau on [128, 64] at 99.73% MNIST with no obvious next architectural lever. Conversation about biological realism — real neurons aren’t identical, ReLU/Sigmoid/Tanh are convenient fictions — led to the “typed neuronal species” idea: nodes with non-scalar I/O, e.g. a “patch matcher” that takes a vector of pixels and a vector of weights and emits a scalar dot product.
The minimal version of the patch matcher is a learnable convolutional filter as a single node. Letting evolution discover where to place them, what size, etc., is the long-term target. But before building any of that machinery, prove the inductive bias actually pays.
Decision: spin up Group B as a separate research stream with its own binaries and journal. Trainers live in src/bin/ and don’t share the genome/phenotype machinery with main. If Group B produces a strong result, the relevant primitive gets lifted into the main system later.
Experiment B1: Prototype matcher (calibration baseline)
Hypothesis: a single dot product against the per-class mean training image carries enough signal to classify MNIST meaningfully. This is the dumbest possible “neuron” — no learning beyond computing the mean — and sets the floor for any patch-matcher work.
Setup: 50K MNIST training images → 10 prototypes (per-class pixel mean). 10K held-out test images. For each test image, compute dot product against all 10 prototypes, predict argmax.
Result: 64.99% accuracy
There’s signal — well above 10% chance — but the per-class breakdown is the interesting part:
| Class | Accuracy | Note |
|---|---|---|
| 0 | 97.3% | Round, distinctive, large self/nearest margin (+22.5) |
| 1 | 48.1% | Loses to “8” prototype 539/1064 of the time |
| 2 | 70.9% | |
| 3 | 73.0% | |
| 4 | 53.6% | |
| 5 | 0.0% | Predicts “0” 465/915 times — never gets a single 5 right |
| 6 | 80.1% | |
| 7 | 64.8% | |
| 8 | 92.2% | |
| 9 | 65.7% |
The pathology: raw dot product is biased toward dense prototypes
Look at the mean score table — the “8” prototype scores ~45 against every class, not because everything looks like an 8 but because the “8” prototype has the most active pixels. Dot product with any digit lights up many of those pixels. Same for “0”, which is also dense.
The “5” prototype is sparser (less total ink). When a test 5 is presented, dot product with “0” beats dot product with “5” because:
- score(5_image, 0_proto) ≈ count of bright pixels that overlap with the round 0 stencil
- score(5_image, 5_proto) ≈ count of bright pixels that overlap with the sparser 5 stencil
The 0 stencil is a strict superset of the 5 stencil in places — anywhere both have ink, the 0 wins on raw mass.
| This is exactly the magnitude-bias problem dot-product classifiers always have. Cosine similarity (normalize by | proto | × | image | ) eliminates it. Centered prototypes (subtract the global mean image first) also help by removing the “ink everywhere” bias. |
What B1 establishes
- Raw dot product carries signal — 65% on 10-way classification is real.
- The signal is heavily distorted by prototype magnitude. Some classes hit 0% not because the dot product is uninformative but because the unnormalized version is dominated by ink mass.
- Calibration target for patch matchers: whatever Group B builds has to outperform the cosine-normalized prototype matcher on the same train/test split, otherwise we’re not exceeding the floor of “dumb template matching with magnitude awareness.”
Next step: B2 should measure the cosine-normalized version to set a fair floor, then move toward localized patches.
Experiment B2: Normalized prototype matcher
| Hypothesis: cosine similarity (dot product divided by | proto | × | image | ) removes the magnitude confound from B1 and gives us an honest calibration floor. |
Setup: same 50K/10K split, same per-class mean prototypes. Three scoring variants compared head-to-head:
- B2a: cosine similarity
- B2b: centered prototypes (subtract grand mean) + cosine
Result: B2a hits 83.53%
| Variant | Accuracy | vs. B1 |
|---|---|---|
| B1 raw dot product | 64.99% | — |
| B2a cosine | 83.53% | +18.5pp |
| B2b centered + cosine | 58.89% | -6.1pp |
Per-class accuracy for B2a: 0:89.2, 1:94.5, 2:81.6, 3:83.1, 4:81.5, 5:69.1, 6:91.5, 7:84.5, 8:78.0, 9:80.4. The 0% catastrophe on class 5 is resolved — it’s now the worst class but still functional. No class below 69%. Confusion is now mostly local: 5↔3 (125 errors), 4↔9 (125), 8↔3 (76). Real shape similarity, not magnitude bias.
Why B2b actually got worse
Centering the prototype without also centering the image creates a mismatched comparison. Centered prototypes have negative values in regions where the average digit has ink but this class doesn’t (e.g., the center hole of a 0). Raw test images are non-negative everywhere — they have no way to “score” against those negative regions. The dot product loses signal where it should gain it. The principled fix would be to subtract the grand mean from the image too; not pursuing because B2a already gave us the clean answer.
Calibration floor established
83.53% with cosine-normalized prototype matching is the floor any Group B work has to clear to mean anything. For reference points we don’t have yet but might want:
- Trained linear classifier (10 dense outputs, SGD, no hidden layer) — probably ~92%
- Single hidden layer MLP with O(100) units — ~98%
- Single conv layer + small dense head — >98%
Next decision: do we set those reference floors first (cheap, ~couple hundred lines) or jump straight to patch-matcher experiments? Setting the linear-classifier floor would tell us whether our patch matchers are beating “any trained linear thing” or just “untrained means” — a stronger claim.
Experiment B3: Prototype features → trained discriminator
Hypothesis: take the 10 cosine-similarity scores from B2 as a 10-dim feature vector, train a linear softmax classifier on top of them with SGD. If the prototypes are useful features, downstream learning should comfortably exceed the 83.5% argmax floor.
Setup: 784 input → 10 frozen cosine-prototype nodes → 10×10 linear classifier + bias, softmax, cross-entropy loss, online SGD (lr=0.5, 5 epochs, shuffled).
Result: 84.85%
Only +1.32pp over B2’s argmax. Barely a lift.
Per-class: 0:92.2 1:97.6 2:73.5 3:78.5 4:92.4 5:68.5 6:94.5 7:94.1 8:76.7 9:77.7. Class 5 still the worst, no qualitative shift in failure structure.
What the learned weights reveal
The W matrix is strongly diagonal — feature c mostly predicts class c. Diagonal entries are +39 to +57, off-diagonals mostly small and negative. The classifier basically rediscovered argmax, with minor corrections like row 5 (+51 on f5, −20 on f7 — “subtract 7-similarity when scoring 5”). Those corrections earn the 1.3pp; nothing else.
The bottleneck is dimensionality, not training
Cosine similarity already produces calibrated, comparable-across-classes scores. Once you’ve reduced 784 pixels to 10 numbers, a linear classifier on those 10 numbers can’t recover information that isn’t there. The 10 features answer “how much does this look like class c?” — and given those 10 answers, argmax is nearly optimal.
To break ~85% with a frozen first layer, we need either:
- More features — multiple prototypes per class (k-means on each class), or random projections, or eigen-prototypes
- Local features — many small templates each looking at part of the image (this is the patch-matcher hypothesis)
- Non-linear discriminator — small MLP on top of the same 10 features. Unlikely to help much, since the features are already well-separated; would test whether interaction terms exist.
The patch-matcher path is more interesting: it’s the original Group B hypothesis, and the failure structure of B2/B3 (5↔3, 4↔9, 8↔3) is exactly the kind of confusion that local feature detectors should help with — those pairs differ in where the strokes are, not in whether they’re digit-shaped on average.
Net of B1-B3
| Experiment | Setup | Accuracy |
|---|---|---|
| B1 | Raw dot product, argmax | 64.99% |
| B2a | Cosine similarity, argmax | 83.53% |
| B2b | Centered prototypes + cosine | 58.89% |
| B3 | Cosine features → linear softmax | 84.85% |
Three calibration points, one clear takeaway: whole-image dot products against class means cap around 85%. To climb higher we need different features, not better discrimination over the same 10.
Experiment B4: Random local patches (frozen) → trained discriminator
Hypothesis: locality + nonlinearity are the inductive bias that breaks the 85% bottleneck. 32 random 5×5 patches at random positions, ReLU-activated and frozen, fed into a trained linear classifier should clear B3’s 84.85%. If it doesn’t, the content of the filters matters and learning the patches is necessary.
Setup: 32 patches, 5×5 each, He-init random weights, bias=0, placed at uniformly random (top, left) positions in [0..23]×[0..23]. ReLU on patch outputs. Patches frozen. 32→10 linear softmax classifier on top, online SGD, lr=0.1, 10 epochs.
Result: 67.56% (best across epochs: 68.60%)
Worse than whole-image cosine. The local features hypothesis loses badly when the features are random.
Per-class accuracy: 0:78.8 1:91.6 2:62.3 3:58.0 4:46.7 5:64.7 6:65.8 7:82.8 8:50.4 9:71.6. Class 4 is the new worst at 46.7% — it gets confused with 9 (218 out of 983) and 6 (93). The failure structure is different from B2/B3: more spread-out errors, no single dominant confusion.
Why random patches lose
Three diagnostics tell the story:
- Only 10.9/32 patches fire per image on average. Two-thirds of feature capacity sits silent because their random positions land on the dark MNIST background. Wasted parameters.
- Total patch params = 832 vs. B3’s 7840 in the prototype layer. An order of magnitude less raw capacity, most of which doesn’t engage.
- Random weights have no semantic content. A random 5×5 vector detects “this specific random linear combination of pixels exceeds zero” — uncorrelated with anything visually meaningful. The downstream classifier sees 32 noisy, unstructured features.
What B4 actually establishes
Locality + nonlinearity is not enough on its own. The content of the filters matters. Random local features at this count and size cap around 68% — well below whole-image cosine (84%).
This matches the historical finding that motivated LeNet: hand-designed Gabor-like filters worked, random conv weights didn’t, and learned conv weights matched or beat hand-designed ones. Group B is rediscovering that on a small scale.
Implications for the next step
The original Group B hypothesis is “patch matchers as a typed neuronal species.” B4 is a hard “no” if the patches are random. The right test is whether trained patches — backprop through the patch weights too, not just the discriminator — clear the 85% ceiling. That’s B5.
Two design choices to settle for B5:
- Same architecture as B4 but with patch weights trainable. Cleanest comparison — only the gradient flow changes.
- Patch count and size: keep 32 and 5×5 for now, since changing those simultaneously with adding learning would conflate two effects. Vary count/size in a follow-up after we know learning helps.
(Plan diverted: instead of jumping straight to trained patches, did a cheaper intermediate check first — what if the patches are meaningful but still frozen? See B5 below.)
Experiment B5: Prototype-slice patches (frozen) → trained discriminator
Hypothesis: B4 failed because random patches are content-free, not because locality is wrong. If we make the patches literal slices of the class-mean prototypes, frozen, and train only the discriminator on top — that should clear the 85% bottleneck. Sweeping the number of slices per class will show the scaling curve.
Setup: for each class c and slice index i (0..N), pick a random (top, left) in [0..23]×[0..23], copy that 5×5 region of prototype[c] as the patch weights. Cosine similarity at that position when applied to a test image. Frozen patches → 10×(10N) linear softmax classifier, online SGD lr=0.1, 10 epochs, seed=0xB5. Sweep N ∈ {1, 2, 4, 8, 12, 16, 24, 32, 48, 64}.
Result: clean monotonic scaling, peaks at 93.67% with N=64
| N/class | Features | Best test acc |
|---|---|---|
| 1 | 10 | 55.27% |
| 2 | 20 | 75.58% |
| 4 | 40 | 85.18% |
| 8 | 80 | 89.77% |
| 12 | 120 | 90.43% |
| 16 | 160 | 91.24% |
| 24 | 240 | 91.11% |
| 32 | 320 | 92.22% |
| 48 | 480 | 92.76% |
| 64 | 640 | 93.67% |
What this tells us
- Local + meaningful beats global + meaningful. At N=4 (40 features), B5 already matches B3’s 85% with whole-image prototypes. Beyond that B5 keeps climbing while B3 hit a wall.
- Local + meaningful destroys local + random. B4 (32 random patches, 32 features) was 68%. B5 at N≈3 (30 features) is around 80%. The filter content is what was missing.
- No magic number. The curve is smooth log-shaped. Most of the lift comes by N=12-16, then small gains continue. Phi/num_pixels intuition didn’t materialize — it’s just diminishing returns.
- 93.67% is the best Group B result so far — and the only learning happening is in the linear discriminator. The patches are dumb fixed copies.
The capacity confound
At N=32 we have 320 patches × 25 weights = 8,000 patch params. B4 had 32 random patches × 25 = 800 — roughly 10× less. Some of B5’s win is just having more parameters. To disentangle the inductive-bias effect from the raw-capacity effect:
- B6 candidate: 320 random patches, 5×5, frozen, trained discriminator. Same feature count as B5 N=32. If B6 hits ~85-90%, most of B5’s lift was capacity. If B6 stays in the 70s, the meaningful content is doing the real work.
That’s the cleanest single follow-up before we go to trained patches. Costs nothing to run (~30 seconds).
Experiments B6 and B7: random sweep + trained patches
Ran both in parallel: B6 sweeps random-patch count to control for capacity, B7 makes the patches trainable via backprop (the original Group B hypothesis).
B6: random patches at varied count
| Patches | Params | Avg firing | Best test |
|---|---|---|---|
| 32 | 800 | 9.2 | 66.26% |
| 80 | 2000 | 22.6 | 82.07% |
| 160 | 4000 | 51.2 | 88.85% |
| 320 | 8000 | 98.5 | 92.65% |
| 640 | 16000 | 189.7 | 94.01% |
B7: trained patches (random init, backprop)
| Patches | Total params | Best test |
|---|---|---|
| 32 | 1162 | 85.64% |
| 80 | 2890 | 93.46% |
| 160 | 5770 | 96.04% |
| 320 | 11530 | 97.13% |
| 640 | 23050 | 97.27% |
The combined picture, side-by-side
| Features | B6 random | B5 proto-slice | B7 trained |
|---|---|---|---|
| 80 | 82.07% | 89.77% | 93.46% |
| 160 | 88.85% | 91.24% | 96.04% |
| 320 | 92.65% | 92.22% | 97.13% |
| 640 | 94.01% | 93.67% | 97.27% |
B5’s “meaningful content matters” claim was partially an illusion
At small feature counts, prototype-slice patches beat random patches by 5-8pp. At ≥320 features, random patches catch up — and at 640 they slightly beat prototype-slice patches.
What B5 was actually measuring at large N was something close to “any sufficiently dense local feature set works.” The semantic content of the prototype slices gave a sample-efficiency boost that vanished once feature count was high enough that the linear classifier could find good combinations from noise.
This wouldn’t have shown up without B6. The right reading of B5 alone would have been “meaningful local features are great” — false at scale. The right reading of B5+B6 together is “meaningful local features have a head start that disappears with enough capacity.”
B7 validates the original Group B hypothesis
Trained patches dominate at every feature count, but the gap is biggest at small N:
- 32 patches: trained = 85.6% vs random = 66.3% (+19.4pp)
- 80 patches: trained = 93.5% vs random = 82.1% (+11.4pp)
- 320 patches: trained = 97.1% vs random = 92.7% (+4.5pp)
- 640 patches: trained = 97.3% vs random = 94.0% (+3.3pp)
SGD finds filters that are dramatically more sample-efficient per parameter than random or prototype-slice. The asymptotic gap is smaller (because random + huge gets pretty good) but never disappears in this range. 97.27% with 640 5×5 trained patches is close to the main NEAT system’s [128] dense-hidden result (98.7%) — but with locality as the inductive bias rather than dense connectivity.
Net of B1–B7
| Experiment | Setup | Best % |
|---|---|---|
| B1 | Raw dot product, argmax | 64.99% |
| B2 | Cosine, argmax | 83.53% |
| B3 | 10 cosine features → linear classifier | 84.85% |
| B4 | 32 random patches, frozen | 67.56% |
| B5 | Prototype-slice patches, frozen, sweep N | 93.67% (N=64) |
| B6 | Random patches, frozen, sweep count | 94.01% (640 patches) |
| B7 | Trained patches, sweep count | 97.27% (640 patches) |
What this means for the main stream
The original idea — typed neuronal species, with patch-matchers as the first species — is well-supported. The architectural ingredients for a strong MNIST classifier without a dense hidden layer are:
- Many local feature detectors (640 ~ same order as the [128] hidden layer’s input fan-in).
- Trained, not random. Sample efficiency depends on it.
- A learnable downstream discriminator that combines the local features.
For the main NEAT stream, this maps to: a “patch matcher” mutation that adds a typed node with bundled inputs (a 5×5 region of input layer) and learnable weights, plus existing connection mutations to wire its scalar output into hidden/output layers. NEAT then evolves which patches to keep and where to place them, rather than evolving from scratch.
A natural next experiment would be a hand-built bake-off in the main system: take a known-good NEAT architecture and replace some hidden nodes with patch matchers, see if matched-parameter-count beats matched-parameter-count without them. But that’s the integration step, and we said we’d save it for after Group B produced a clear signal. It has now.
Experiment B8: Patch size vs. parameter efficiency
Hypothesis: 5×5 was an arbitrary choice carried over from B4. At matched parameter budgets, smaller patches (more of them, less receptive field each) might be more efficient than larger patches (fewer of them, bigger receptive field each), or vice versa.
Setup: same architecture and training as B7 (trained patches, random He init, ReLU, linear discriminator, online SGD, lr=0.05, 10 epochs, seed=0xB8). Sweep patch size ∈ {3, 4, 5, 6, 7} crossed with patch count ∈ {32, 80, 160, 320, 640}. 25 configurations.
Result table
| Patches | 3×3 | 4×4 | 5×5 | 6×6 | 7×7 |
|---|---|---|---|---|---|
| 32 | 79.95% | 81.32% | 86.93% | 89.15% | 91.52% |
| 80 | 90.49% | 93.37% | 92.33% | 94.32% | 96.33% |
| 160 | 92.28% | 94.65% | 95.44% | 96.14% | 96.82% |
| 320 | 95.85% | 96.69% | 97.12% | 97.08% | 97.53% |
| 640 | 96.54% | 96.96% | 97.70% | 97.71% | 98.04% |
At any fixed patch count, larger patches always win — receptive field matters when count is scarce. But that’s the wrong axis for the parameter-efficiency question. The right axis is total parameters:
Parameter-efficient frontier
Sorted by parameter budget, the winning configuration at each scale:
| ≈Budget | Best config (params) | Accuracy |
|---|---|---|
| ~1.5K | 3×3 × 80 (1610) | 90.49% |
| ~3K | 5×5 × 80 (2890) / 3×3 × 160 (3210) | 92.3% |
| ~6K | 3×3 × 320 (6410) | 95.85% |
| ~12K | 5×5 × 320 (11530) | 97.12% |
| ~20K | 7×7 × 320 (19210) | 97.53% |
| ~30K+ | 7×7 × 640 (38410) | 98.04% |
Three findings
-
At low parameter budgets, smaller-patches-at-higher-count wins. Coverage matters more than per-detector receptive field when you can’t afford many parameters. 3×3 × 80 (1610 params, 90.5%) beats 7×7 × 32 (1930 params, 91.5%) on accuracy-per-param even though the absolute accuracies are close — and 3×3 × 320 at 6410 params decisively beats every other config in its weight class.
-
At matched accuracy targets, 5×5 looks like the sweet spot. 5×5 reaches 97% with 320 patches (11.5K params); 7×7 needs 320 patches (19.2K params) for the same threshold; 3×3 never gets there at all. The cost of an extra percentage point past 97% is ~10K-20K additional parameters regardless of which size you pick.
-
3×3 has a hard ceiling around 96.5%; 7×7 reaches 98%. Receptive fields too small can’t capture enough digit structure no matter how many you add. Past N=320 the 3×3 curve is essentially flat (95.85% → 96.54% from doubling count). Bigger patches with deeper local context can extract more signal per detector once you’re trying to push past 97%.
Saturation
All five sizes converge to 96.5%–98% by N=640. Adding more patches eventually stops helping at every size. Whether that ceiling is set by:
- the linear discriminator (one-layer head can’t combine 640 features richly enough)
- the test set (these are hard examples for any model)
- the random patch placements (some pixels under-covered, no patches positioned over digit-center where signal lives)
is unclear from this experiment alone. Probably some of all three.
Implication
For the typed-species hypothesis going into the main NEAT integration: patch size is a meaningful axis to evolve, not a hyperparameter to fix. Different parameter budgets favor different sizes. A genome that allows mutations to add patches of various sizes (and lets evolution pick the mix) is probably better than fixing one size. Multi-scale features have always been a CNN best practice; this confirms it for the typed-NEAT framing.
The “right default” for a single-size choice is 4×4 or 5×5 depending on your parameter budget. 5×5 is the safe pick that’s never far from optimal across the budget range.
Experiment B9: Rectangular patches (single-seed sweep)
Hypothesis: B8 showed square patches all converge to 96.5–98% by N=640, but only sampled square aspect ratios. Maybe rectangular patches with matched parameter counts have different performance — and maybe horizontal vs. vertical orientation matters.
Setup: same architecture as B7/B8 (trained patches, He init, ReLU, linear discriminator, online SGD lr=0.05, 10 epochs). 11 shapes covering three matched-area tiers (16/21, 24/27, 35/36) with horizontal/vertical mirror pairs. Two patch counts (N=160, N=320). Single seed (0xB9).
Headline single-seed numbers
At N=320 the best result was 5×7 with 97.66% at 14,730 params — matching B8’s 5×5 × N=640 result (97.70%) with 36% fewer parameters.
Mirror-pair gaps at N=320: 3×9 vs 9×3 = +0.87pp (extreme aspect ratio, wide wins). 5×7 vs 7×5 = +0.38pp (strong aspect ratio, wide wins). 4×6 vs 6×4 = +0.01pp (mild, no preference). 3×7 vs 7×3 = +0.06pp (low area, no preference).
Single-seed caveats
This is one seed. With per-config standard deviations of ~0.15pp (from later multi-seed work), gaps under ~0.3pp are inside noise. The 0.87pp gap at 3×9 vs 9×3 looks robust on its face; the 0.38pp gap at 5×7 vs 7×5 is borderline. Need multiple seeds before believing the moderate-aspect-ratio findings. That’s B9-stats.
Experiment B9-stats: Rectangular patches with paired multi-seed stats
Motivation: B9 single-seed was suggestive but not conclusive. Run the same sweep across 5 seeds with the same seed schedule for every shape (so paired comparisons cancel within-seed noise) and report proper statistics. Also stand up the stats infrastructure (SampleStats, PairedComparison, paired-t with Cohen’s d_z) for future use.
Setup: 8 shapes (5×5, 6×6, 4×6, 6×4, 3×9, 9×3, 5×7, 7×5) × 5 seeds (0xB9–0xBD) at N=320. Same training regime as B7/B8/B9. Stats are constant-time given running sums.
Per-shape stats (5 seeds, N=320)
| Shape | Mean % | Std | SEM | Min | Max |
|---|---|---|---|---|---|
| 5×5 | 97.060 | 0.141 | 0.063 | 96.860 | 97.240 |
| 6×6 | 97.338 | 0.158 | 0.070 | 97.120 | 97.460 |
| 4×6 | 97.036 | 0.127 | 0.057 | 96.910 | 97.250 |
| 6×4 | 96.856 | 0.129 | 0.058 | 96.660 | 97.000 |
| 3×9 | 97.102 | 0.187 | 0.084 | 96.820 | 97.260 |
| 9×3 | 96.596 | 0.159 | 0.071 | 96.350 | 96.780 |
| 5×7 | 97.384 | 0.190 | 0.085 | 97.170 | 97.660 |
| 7×5 | 97.346 | 0.110 | 0.049 | 97.220 | 97.500 |
Paired wide-minus-tall comparisons
| Wide | Tall | Δ pp | t | d_z | Effect | Sig |
|---|---|---|---|---|---|---|
| 4×6 | 6×4 | +0.180 | 1.63 | 0.73 | medium | ns |
| 3×9 | 9×3 | +0.506 | 4.22 | 1.89 | very large | *** |
| 5×7 | 7×5 | +0.038 | 0.37 | 0.17 | trivial | ns |
What B9-stats overturns from B9
-
The “5×7 N=320 = 97.66% is the new headline” claim was a lucky-seed artifact. Five-seed mean is 97.38%, statistically tied with 6×6 (97.34%). The 97.66% value is the max across the seed sample (the original seed 0xB9 specifically). Single-seed best-of-sweep is unreliable at this scale.
-
The 5×7 vs 7×5 wide preference is gone. Δ=+0.04pp, t=0.37, d_z=0.17 (trivial). At strong-but-not-extreme aspect ratios, mirror shapes are statistically indistinguishable.
-
The 3×9 vs 9×3 effect is real, large, and highly significant. Δ=+0.506pp, t=4.22, d_z=1.89 (very large effect, p < 0.001 by normal approximation). 9×3 is the worst shape in the sweep at mean 96.60% — even worse than 5×5.
-
The 4×6 vs 6×4 effect is suggestive but not significant. Medium effect size (d_z=0.73) but only 5 seeds; would likely become significant with more.
Lesson on methodology
This was the first time in Group B that we had two experiments where one’s conclusion contradicted (or refined) the other, and stats made the difference. Going forward: single-seed experiments are useful for direction-finding, but any quantitative claim about a difference of ~0.3pp or smaller needs multi-seed paired stats. The infrastructure is now in place; the marginal cost of running with 5 seeds is 5x compute, which is ~25 minutes for this size of experiment. Cheap insurance against fooling ourselves.
Refined picture of the rectangular-patch finding
- Aspect ratio doesn’t matter much at moderate ratios (5:7, 2:3).
- At extreme ratios (1:3), wide patches significantly outperform tall patches on MNIST.
- Original B9 mechanistic guess was MNIST-specific (digit stroke geometry). B10 will test that guess.
Experiment B10: Rectangular patches on Fashion-MNIST
Hypothesis: if the 3×9 ≫ 9×3 effect on MNIST is digit-specific (e.g. due to stroke geometry of handwritten numerals), Fashion-MNIST should show a different pattern. If the effect persists, it’s something more general about 28×28 grayscale image classification — and the digit-stroke explanation is wrong.
Setup: same code, same 8 shapes × 5 seeds at N=320, same training regime. Only the data path changed (fashion-train-images-idx3-ubyte).
Per-shape Fashion stats (5 seeds, N=320)
| Shape | Mean % | Std | SEM |
|---|---|---|---|
| 5×5 | 86.450 | 0.333 | 0.149 |
| 6×6 | 86.468 | 0.277 | 0.124 |
| 4×6 | 86.534 | 0.434 | 0.194 |
| 6×4 | 86.234 | 0.221 | 0.099 |
| 3×9 | 86.374 | 0.168 | 0.075 |
| 9×3 | 85.444 | 0.593 | 0.265 |
| 5×7 | 86.366 | 0.301 | 0.135 |
| 7×5 | 86.130 | 0.335 | 0.150 |
Two general observations:
- Fashion accuracies are ~10-11pp lower than MNIST (86% vs. 97%). Consistent with main-stream finding that Fashion is harder for sparse linear classifiers.
- Fashion standard deviations are ~2× larger than MNIST’s (0.27-0.59 vs 0.11-0.19). More run-to-run variance. The 9×3 case has the highest variance of any shape (std=0.59), suggesting it’s genuinely unstable — sometimes catastrophically bad.
Paired wide-minus-tall on Fashion
| Wide | Tall | Δ pp | t | d_z | Effect | Sig |
|---|---|---|---|---|---|---|
| 4×6 | 6×4 | +0.300 | 1.18 | 0.53 | medium | ns |
| 3×9 | 9×3 | +0.930 | 3.32 | 1.48 | very large | *** |
| 5×7 | 7×5 | +0.236 | 1.08 | 0.48 | small | ns |
Comparison to MNIST
| Pair | MNIST Δ | MNIST t | Fashion Δ | Fashion t |
|---|---|---|---|---|
| 4×6 vs 6×4 | +0.18 | 1.63 | +0.30 | 1.18 |
| 3×9 vs 9×3 | +0.51 | 4.22 | +0.93 | 3.32 |
| 5×7 vs 7×5 | +0.04 | 0.37 | +0.24 | 1.08 |
The effect replicates and strengthens on Fashion. The 1:3 aspect-ratio gap is +0.93pp on Fashion vs. +0.51pp on MNIST. All three pairs trend the same direction (wide > tall) on both datasets.
What B10 overturns
The “MNIST stroke geometry” mechanistic hypothesis from B9. The effect isn’t digit-specific. It exists on a different dataset with different content (clothing items, often vertically oriented), and is larger there.
Refined hypothesis
The wide-over-tall preference at extreme aspect ratios is a property of how patch shape interacts with the dominant orientation of discriminative features in centered grayscale images. Both MNIST and Fashion-MNIST have predominantly vertical structural features (digit strokes, garment major axes, body lines). A tall narrow patch (9×3) tends to lie along such features, integrating uniform brightness over their length — wasting capacity. A wide flat patch (3×9) cuts across the same features, capturing transitions at higher information density.
This is a known principle in classical filter design: kernels should be perpendicular to the dominant feature orientation to maximize information capture. We appear to be rediscovering it from random init via SGD.
If the hypothesis is correct, rotating the input 90° should flip the preference (vertical patches should win on rotated MNIST). And the effect should grow monotonically with aspect ratio extremity. Both are testable and would constitute a real mechanistic confirmation rather than just replication.
Experiment B11: Rotated MNIST
Hypothesis: if the wide > tall preference comes from “kernels perpendicular to the dominant feature orientation in the data” (B10’s refined hypothesis), then rotating MNIST images 90° CW should flip the dominant feature orientation from vertical to horizontal — and the preference should flip too: 9×3 (tall) should beat 3×9 (wide).
Setup: same code, same shapes/seeds/N as B9-stats. Only change: every image rotated 90° CW (new[c][27-r] = old[r][c]) before training and testing.
Result: hypothesis confirmed — sign cleanly flipped
| Pair | Upright Δ pp | Rotated Δ pp | t (rotated) | Sig | Sign flip? |
|---|---|---|---|---|---|
| 4×6 / 6×4 | +0.180 | +0.022 | +0.25 | ns | (neither sig) |
| 3×9 / 9×3 | +0.506 | −0.402 | −2.27 | * | ✓ flipped |
| 5×7 / 7×5 | +0.038 | −0.106 | −1.77 | ns | direction flipped, ns |
The 3×9 vs 9×3 effect cleanly reverses sign on rotated MNIST. Wide wins by +0.506pp on upright MNIST; tall wins by −0.402pp on rotated MNIST. The direction is unambiguous; the magnitude drops a bit (0.40 vs 0.51) but the significance level (* at 5 seeds, with d_z=−1.02 large effect) is solid.
5×7 vs 7×5 trended in the new “tall wins” direction with a medium effect size (d_z=−0.79) but didn’t reach significance at 5 seeds — consistent with the original B9-stats pattern that mild aspect ratios show smaller effects requiring more samples to detect.
Why rotation is a good test
A simpler hypothesis like “wide patches just have more useful placements due to their position-distribution” or “the architecture has a horizontal bias somewhere” would NOT predict a sign flip — the architecture and placement geometry are unchanged by data rotation. Only the content orientation changed.
The pattern moves with the data. That’s the kind of evidence that makes a mechanistic claim survive scrutiny: it’s not a fixed bias in the model, it’s the filters genuinely organizing around the orientation of structure in the input.
Why might the magnitude be smaller after rotation
Three plausible reasons we can’t disentangle from this experiment:
- Rotated digits are slightly off-distribution. Whatever structural regularities make MNIST learnable (digit centering, baseline alignment, stroke conventions) are partly axis-specific. After rotation, the network is solving a slightly different problem.
- Vertical-feature dominance isn’t purely orientation-driven. Centered digits have vertical major axes by construction (the 28×28 frame is taller than necessary, so digits’ bounding boxes tend to be vertically prominent), and this geometric centering isn’t fully reversed by image rotation.
- 5-seed variance. With 5 seeds, ±0.1pp on the magnitude estimate is plausible.
What matters for the hypothesis: the sign is unambiguous and the effect is significant.
Experiment B12: Extreme aspect ratios
Hypothesis: if wider beats taller because of a perpendicular-to-vertical-features mechanism, the gap should grow monotonically with aspect-ratio extremity — 1:3 → 1:9 → 1:15 → 1:21.
Setup: 9 shapes × 5 seeds at N=320. Mirror pairs at 1:3 (3×9/9×3, the known reference), 1:9 (1×9/9×1), 1:15 (1×15/15×1), and 1:21 (1×21/21×1, single-pixel-thick line spanning most of the image). Plus 5×5 as a square baseline.
Per-shape results
| Shape | Area | Mean % | Std |
|---|---|---|---|
| 5×5 | 25 | 97.06 | 0.14 |
| 3×9 | 27 | 97.10 | 0.19 |
| 9×3 | 27 | 96.60 | 0.16 |
| 1×9 | 9 | 94.38 | 0.16 |
| 9×1 | 9 | 94.36 | 0.16 |
| 1×15 | 15 | 95.21 | 0.23 |
| 15×1 | 15 | 94.70 | 0.06 |
| 1×21 | 21 | 95.31 | 0.09 |
| 21×1 | 21 | 94.62 | 0.25 |
Paired wide-minus-tall
| Pair | Aspect | Δ pp | t | d_z | Sig |
|---|---|---|---|---|---|
| 3×9 / 9×3 | 1:3 | +0.506 | 4.22 | 1.89 | *** |
| 1×9 / 9×1 | 1:9 | +0.012 | 0.11 | 0.05 | ns |
| 1×15 / 15×1 | 1:15 | +0.518 | 4.17 | 1.86 | *** |
| 1×21 / 21×1 | 1:21 | +0.690 | 4.62 | 2.07 | *** |
Result: the prediction is refuted, but in an informative way
The wide-tall gap does not grow monotonically with aspect ratio. There’s a curious null at 1:9 sandwiched between large effects at 1:3 and 1:15+. The pattern goes large → null → large → larger as we increase aspect ratio.
What’s special about the null pair
The 1×9 / 9×1 pair has properties no other pair has: both patches are 1-pixel-thick AND short. They have:
- Zero perpendicular extent (so they can’t “cut across” a feature — they ARE a 1D pixel sample)
- Limited length (9 px is less than half the image width)
The 3×9 case has perpendicular extent (3 rows). The 1×15 / 1×21 cases have length sufficient to span most of a digit. Either property recovers the effect; having neither kills it.
Refined mechanistic story
The wide-vs-tall preference at extreme aspect ratios isn’t a single mechanism. It’s at least two separate phenomena that happen to produce the same direction of preference:
-
For thick rectangular patches (≥3 px perpendicular extent): B10’s perpendicular-to-feature argument. The patch has internal 2D structure and detects edge transitions perpendicular to vertical strokes.
-
For 1-pixel-thick long strips (1×15, 1×21): cross-section sampling. A horizontal strip captures the intensity profile across a digit’s width — different digits have different horizontal cross-sections (5 vs 6, 4 vs 9 etc.). A vertical strip captures the intensity profile across a digit’s height — vertical extents are more uniform across digit classes (all ~20 px tall), so less discriminative.
Both produce wide > tall on MNIST, but they’re different reasons. At 1×9 / 9×1 (zero perpendicular extent and insufficient length for either mechanism to engage), neither operates, and we get the null.
Other observations from B12
- All single-pixel-thick patches cap around 94-95% — well below 5×5 (97%) or 3×9 (97%). Single-row/column samplers are bandwidth-limited.
- Adding a single perpendicular row helps massively: 1×9 → 3×9 jumps accuracy from 94.4% to 97.1%. Going from 1D to 2D internal structure is a phase transition for usefulness.
- Within 1×N variants, longer wins: 1×9 (94.4%) → 1×15 (95.2%) → 1×21 (95.3%). Diminishing returns by 1×21.
- 9×3 in B12 (96.60%) matches B9-stats (96.60%) to within rounding — same configuration, same seeds, same code path. Reproducibility check passed.
Combined B11+B12 picture
| Test | Prediction | Result |
|---|---|---|
| Mechanism reverses with rotation (B11) | sign flip on rotated MNIST | ✓ confirmed (3×9 vs 9×3 went from +0.51 *** to −0.40 *) |
| Effect grows monotonically with aspect ratio (B12) | gap increases 1:3 → 1:9 → 1:15 → 1:21 | ✗ refuted — null at 1:9, effect requires either thickness ≥3 OR length ≥15 |
The hypothesis survives the rotation test cleanly. The aspect-ratio-monotonicity prediction was wrong, but the failure mode (the 1×9/9×1 null) is itself mechanistically informative — it tells us the wide preference isn’t a single phenomenon but at least two, and that single-pixel-thick short patches are bandwidth-limited regardless of orientation.
Closing the rectangular-patches arc
Where Group B started this thread (B9): “are non-square patches more parameter-efficient than square ones?” Where it ended (B11+B12): “the preference for wide patches at extreme aspect ratios on MNIST is real, replicates and strengthens on Fashion-MNIST, flips sign with input rotation, and decomposes into (at least) two distinct mechanisms — perpendicular-to-feature edge detection for thick rectangles, and discriminative cross-section sampling for long strips.”
That’s a real research arc with mechanistic depth, all from a curveball question about aspect ratios. Worth banking and moving on for now. Future work could push further on disentangling the two mechanisms (e.g., 2×N vs N×2 isolates the “thick rectangle” regime; 1×N at fixed N varies length-only), but the marginal value vs the bigger Group B questions (multi-scale, integration into NEAT) is small.
Experiments B13-B17: multi-scale, multi-layer, and the locality question
Three architectural questions explored in parallel: does mixing patch sizes help (A1)? does adding a hidden layer help (A2)? does spatial contiguity matter, or are “patches” really just sparse linear features (A3)? A3 also got cross-task and across-size verification runs.
B14 / A1: Multi-scale patches
Hypothesis: mixing patch sizes (1/3 each at 3×3, 5×5, 7×7) in one network beats single-scale 5×5 at matched parameter count. The typed-species framing predicts evolution should pick a mix; this was a proof-of-concept that mixing actually helps in the trainable framework.
Setup: 5×5 single-scale vs 3/5/7 mixed thirds. Two patch counts (N=240, N=480). 5 seeds each, paired by seed.
Results:
| N | Single 5×5 | Mixed 3/5/7 | Δ (mixed − single) | t | d_z | Sig |
|---|---|---|---|---|---|---|
| 240 | 96.40 ± 0.16 | 96.93 ± 0.22 | +0.530 | +4.40 | +1.97 | *** |
| 480 | 97.58 ± 0.14 | 97.46 ± 0.15 | −0.114 | −1.36 | −0.61 | ns |
Finding: multi-scale is a low-capacity phenomenon. At N=240 mixing wins decisively (+0.53pp ***). At N=480 the advantage vanishes. The interpretation: at low N, no single scale has enough patches to fully exploit it, so receptive-field diversity helps. At high N each size has enough density to do its job alone, and mixing stops mattering.
Implication for NEAT integration: at small parameter budgets, evolution should prefer mixed sizes. At large budgets it might converge to whatever single size is locally cheapest to mutate toward. Both are sensible behaviors — and they suggest a nuanced “evolve patch size as a typed-mutation parameter” story rather than “always mix” or “fix the size.”
This rhymes with the broader B5/B6/B7 pattern: smarter feature designs help when capacity is scarce; dumber-but-more works just as well when capacity is abundant.
B15 / A2: Multi-layer patches (hidden ReLU layer)
Hypothesis: the linear classifier was the bottleneck, not the patch features. Adding a small MLP hidden layer should let the network combine patch features non-linearly and break past the ~97% ceiling.
Setup: 5×5 × 320 patches → M ReLU hidden units → 10 linear softmax. Sweep M ∈ {0, 32, 64, 128}. Same training as B7/B8 (10 epochs, lr=0.05). 5 seeds.
Results:
| M | Mean ± std | Total params | Δ vs M=0 |
|---|---|---|---|
| 0 (linear) | 97.00 ± 0.25 | 11,530 | — |
| 32 | 95.30 ± 0.45 | 18,922 | −1.69pp |
| 64 | 95.29 ± 0.54 | 29,514 | −1.71pp |
| 128 | 95.44 ± 0.16 | 50,698 | −1.55pp |
Finding: depth hurts uniformly across all M tested. The hidden layer’s extra parameters can’t be properly trained at our fixed 10-epoch / fixed-LR budget. Gradient gets split between learning patch weights and learning hidden-unit weights, and neither converges as well as the patches alone did under the linear-classifier configuration.
This is the opposite of what depth did in the main NEAT stream (where [128, 64] beat [128] by ~1pp on MNIST). The crucial difference is training budget: main NEAT runs 1.8M steps with LR decay; our Group B harness runs 500K with fixed LR. Depth in this minimal framework would need re-tuned hyperparameters before declaring it dead — likely more epochs and/or LR scheduling. Worth flagging for any future “patches as a NEAT typed species” integration: training-schedule tuning matters, not just architecture.
B13 / A3: Random-index “patches” — does spatial contiguity matter?
Hypothesis: at 320 trained patches with He-init weights and SGD, “patches” are really just sparse linear features — the spatial contiguity of the inputs they sample doesn’t matter. SGD will find good combinations regardless of input-pixel layout. If true, the locality inductive bias Group B has been celebrating is mostly irrelevant once capacity is sufficient.
Setup: head-to-head comparison of two configurations:
- Spatial: 320 contiguous 5×5 patches at random positions
- Indexed: 320 “patches” each sampling 25 random pixel indices from anywhere in the 784-pixel image
Both have identical parameter count (25 weights + bias each = 26 per “patch”), identical compute structure, identical training. Same architecture downstream. 5 seeds, paired.
Result: hypothesis refuted.
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 96.996 ± 0.246 |
| Indexed 25 | 96.386 ± 0.147 |
Δ = +0.610pp, t = +7.91, d_z = +3.54 (***)
Spatial contiguity is doing real work even at 320 trained patches. The locality inductive bias matters.
B16: A3 cross-task replication on Fashion-MNIST
Hypothesis: if locality is intrinsic to the patch primitive, the spatial advantage should replicate on Fashion-MNIST (analogous to how the rectangular-patch wide-preference replicated on Fashion in B10).
Setup: identical to B13 except data path swapped to Fashion.
Result:
| Config | Mean ± std |
|---|---|
| Spatial 5×5 | 86.36 ± 0.33 |
| Indexed 25 | 86.48 ± 0.19 |
Δ = −0.124pp, t = −0.71, d_z = −0.32 (ns) — and the sign is even slightly negative.
Finding: the locality advantage is task-specific. It’s strong on MNIST and absent on Fashion-MNIST. This is the inverse of the rectangular-patch finding (which was task-general).
Plausible mechanism: MNIST has very high local pixel correlation — adjacent pixels within a digit stroke are almost always similar; sharp transition at stroke edges. A spatial 5×5 patch captures this “local smoothness vs edge transition” structure cheaply. Fashion-MNIST has lower local correlation — clothes have textures, gradients, finer detail — so adjacent pixels are less mutually informative. Random-index “patches” effectively become global pixel fingerprints, and on Fashion that’s apparently as good as local-feature detection.
This finding is genuinely informative: Group B has now produced examples of both task-general and task-specific properties of trained patches (rectangular-patch orientation preference is general; spatial contiguity advantage is MNIST-specific). For the eventual NEAT integration story, that’s useful — different architectural choices have different transferability profiles.
B17: A3 across patch sizes on MNIST
Hypothesis: if the contiguity advantage is real, it should exist across patch sizes — not be a 5×5-specific quirk.
Setup: same head-to-head spatial vs indexed comparison at sizes 3×3, 5×5, 7×7. 320 patches at each size. 5 seeds, paired. (Note: indexed variant samples size² random pixel indices.)
Results:
| Size | Spatial mean ± std | Indexed mean ± std | Δ pp | t | d_z | Sig |
|---|---|---|---|---|---|---|
| 3×3 | 95.50 ± 0.28 | 94.91 ± 0.25 | +0.596 | +4.26 | +1.91 | *** |
| 5×5 | 97.21 ± 0.14 | 96.30 ± 0.12 | +0.912 | +18.30 | +8.18 | *** |
| 7×7 | 97.62 ± 0.19 | 96.92 ± 0.14 | +0.694 | +7.30 | +3.26 | *** |
Finding: the contiguity advantage is robust at every size on MNIST, with magnitude roughly 0.6-0.9pp and large-to-very-large effect sizes everywhere. Peaks at 5×5 (d_z=+8.18 — extraordinarily consistent across seeds), with smaller but still significant gaps at 3×3 and 7×7.
Methodological note: B13 reported Δ=+0.610pp at 5×5, but B17 with a different seed-offset scheme (different patch positions and weight inits, same base seeds) gave Δ=+0.912pp at the same configuration. The true effect size is probably 0.7-0.9pp; B13 happened to land low. Lesson: even paired multi-seed Δ estimates have ~0.2-0.3pp uncertainty in their magnitude, though the sign and significance level are robust. This is actually a useful cross-check on how much to trust the precise numbers — the direction and existence of effects are reliable, but the exact magnitude needs more samples to nail down to <0.1pp.
Net of B13-B17
Three architectural levers explored, three different shapes:
- Multi-scale (A1): conditional positive — mixing wins at low N, neutral at high N.
- Multi-layer (A2): negative under fixed budget — depth doesn’t pay without retuned schedule.
- Locality (A3): task-specific positive — strong on MNIST, absent on Fashion.
The most striking finding is the A3 cross-task pattern. We’ve now seen both transferability profiles in Group B:
- Rectangular-patch wide preference: task-general (replicates on Fashion, flips with rotation)
- Spatial-locality advantage: MNIST-specific (absent on Fashion)
For the NEAT integration story, that means architectural choices in the typed-species framework will have different transferability properties — some will be features evolution can rely on across tasks, others will be MNIST-specific tricks. Worth knowing before investing in the integration refactor.
Experiments B18-B24: harder tasks force a major reframing
By mid-B series, MNIST was saturating. Most architectural comparisons at N=320 trained patches were eating their differential signal in the 95-97% asymptote — d_z values were huge but absolute Δs were small, and there was little dynamic range to detect mechanism. We needed harder workhorse tasks.
Infrastructure: mixed dataset loader
Added src/data/mixed.rs exposing a DatasetSplit::load(&[DatasetKind], train_fraction) API. Supports MNIST, Fashion-MNIST, KMNIST (cursive Japanese, IDX format converted from CSV via stdlib Python), and EMNIST balanced (47-class letters+digits, IDX direct from HuggingFace). Mixing two or more datasets concatenates them with disjoint label offsets so each dataset’s classes occupy a different range in the combined label space — a 20-class MNIST+Fashion task or a 77-class all-four task is a single softmax problem.
B18: Task difficulty calibration
Trained 5×5 patches at N ∈ {80, 320}, single seed across 9 task variants:
| Task | Classes | N=80 | N=320 | Spread |
|---|---|---|---|---|
| MNIST | 10 | 93.63% | 96.92% | 3.3pp |
| Fashion | 10 | 83.19% | 86.14% | 3.0pp |
| KMNIST | 10 | 81.88% | 90.19% | 8.3pp |
| EMNIST balanced | 47 | 67.56% | 77.69% | 10.1pp |
| MNIST+Fashion | 20 | 86.53% | 90.60% | 4.1pp |
| MNIST+KMNIST | 20 | 84.97% | 92.62% | 7.6pp |
| KMNIST+Fashion | 20 | 78.94% | 86.25% | 7.3pp |
| 3-mix (M+F+K) | 30 | 82.29% | 89.42% | 7.1pp |
| 4-mix (M+F+K+E) | 77 | 71.60% | 82.61% | 11.0pp |
KMNIST is the cleanest single-task workhorse: 8.3pp spread between N=80 and N=320, and an absolute ceiling around 90% that leaves room for architectural choices to differ. MNIST+KMNIST lands almost exactly at the user’s “92% at moderate budget” target. EMNIST and the 4-mix give the largest spreads but introduce class-count confounds.
B19-B23: Locality finding (B13/B16/B17) reexamined across tasks
The most surprising result of this batch. Same head-to-head test as B13 (320 trained 5×5 patches, spatial vs 25-random-pixel-index, 5 paired seeds) on each new task:
| Task | Type | Δpp (spatial − indexed) | t | Sig |
|---|---|---|---|---|
| MNIST (B13/B17) | printed digits | +0.61 to +0.91 | +7.91 / +18.30 | *** |
| EMNIST balanced (B20) | printed letters+digits | +1.03 | +5.53 | *** |
| Fashion (B16) | clothing/texture | −0.12 | −0.71 | ns |
| KMNIST (B19/B23) | cursive Japanese | −1.16 to −1.38 | −6.82 to −13.43 | *** |
| MNIST+KMNIST (B22) | mixed | +0.09 | +0.42 | ns |
The locality story has changed substantially. From “spatial wins on MNIST, neutral on Fashion” we now have:
- Printed-character data (MNIST, EMNIST): spatial wins reliably (+0.6 to +1.0 pp ***)
- Cursive-character data (KMNIST): spatial loses — sign flip, comparable magnitude, also ***
- Texture data (Fashion): null
- Mixed positive+negative tasks: cancel cleanly (MNIST+KMNIST = +0.09pp, ns)
B23 (KMNIST across patch sizes) shows the flip is robust at every size:
| Size | KMNIST Δ | MNIST Δ (B17) |
|---|---|---|
| 3×3 | −1.35 *** | +0.60 *** |
| 5×5 | −1.16 *** | +0.91 *** |
| 7×7 | −1.38 *** | +0.69 *** |
So MNIST B17 and KMNIST B23 are mirror images: similar magnitudes, opposite signs, both *** at every size.
The “spatial locality wins” finding is genuinely conditional on data structure, not just task-specific in the loose sense. Cursive characters have stroke patterns that distribute discriminative information differently from printed characters — apparently in a way that 5×5-ish receptive fields actively misalign with. Random-index “patches” sample more of the image’s discriminative structure for cursive data.
B21: Multi-scale (A1) doesn’t replicate on KMNIST
Same setup as B14 (single 5×5 vs mixed 3/5/7), now on KMNIST:
| N | Single 5×5 | Mixed 3/5/7 | Δ | t | Sig |
|---|---|---|---|---|---|
| 120 | 85.75 | 86.06 | +0.31 | +1.35 | ns |
| 240 | 89.71 | 89.46 | −0.25 | −1.78 | ns |
| 480 | 91.84 | 92.12 | +0.28 | +1.56 | ns |
All three patch counts ns. On MNIST (B14), N=240 had Δ=+0.53pp ***. The multi-scale advantage was MNIST-specific, not a general low-N capacity phenomenon. Whatever benefits receptive-field diversity provided on MNIST didn’t carry over.
This is mildly disappointing for the typed-species framing — “evolve a mix of sizes” was supposed to be a robust win. It’s not, at least not cheaply.
B24: Multilayer (A2) hurts on KMNIST too
Same setup as B15: 5×5 × 320 patches → M ReLU hidden → softmax. KMNIST has 7pp more headroom than MNIST (~90% baseline vs 97%), so if the original “depth hurts” was driven by MNIST saturation, KMNIST should show depth helping.
| M | Mean ± std | Δ vs M=0 |
|---|---|---|
| 0 | 90.32 ± 0.22 | — |
| 64 | 88.47 ± 1.35 | −1.85pp |
| 128 | 89.17 ± 0.44 | −1.15pp |
Same direction, similar magnitude as MNIST. Multilayer hurts is task-general at fixed training budget, confirming the under-training hypothesis from B15 was right. This is now a robust negative result, not a conditional one.
For NEAT integration: any deeper architecture with patch matchers as a typed species will need explicit attention to training schedule (more epochs, LR decay) before depth pays off.
Synthesis: which findings survive contact with harder tasks
Updating the transferability picture:
| Finding | MNIST result | Holds on harder tasks? |
|---|---|---|
| Patch matchers work (B7/B8) | trained patches reach 97-98% | Yes — calibration confirms reasonable accuracy on every task |
| Multi-scale wins at low N (B14) | +0.53pp *** at N=240 | NO — all ns on KMNIST |
| Spatial locality (B13/B17) | +0.6-0.9pp *** at every size | NO — sign FLIPS on KMNIST; null on Fashion; same on EMNIST |
| Multilayer hurt at fixed budget (B15) | −1.5-1.7pp at every M | YES — replicates on KMNIST |
| Rectangular wide preference (B9-stats/B10/B11) | +0.51pp *** at 1:3 | YES (Fashion B10, rotated MNIST B11) |
Of the five findings tested across multiple tasks, two are task-general and three are task-specific to varying degrees — and one of the task-specific ones actually flips sign, which is much stronger than just “varies by task.”
What this means for Group B
The MNIST-only Group B story was substantially over-optimistic about what generalizes. Architectural choices in this framework have heterogeneous transferability — some are real properties of the patch primitive, others are MNIST-specific tricks. For the planned NEAT integration as a typed-species mutation, the genome should:
- Support evolving patch geometry (size, shape) per task rather than locking in defaults
- Support evolving patch placement and even spatial vs random-index per task
- Use proper training schedules (LR decay, longer epochs) before evaluating depth
The good news: the patch-matcher primitive itself is robust. Training random patches with backprop reliably gets you a reasonable classifier on every task tested. The architecture works; the configuration needs to be evolved per data distribution.
The other good news: we now have a battery of harder tasks (KMNIST, EMNIST, mixes) where future architectural experiments can show clearer differentiation than the saturated MNIST regime allowed.
Experiments B25-B35: corrections, mechanistic probes, and the limits of generalization
This batch tested several open questions and produced a major correction to an earlier B-series finding plus the cleanest mechanistic story Group B has produced.
B27: Pixel-correlation probe — null result
Hypothesis: MNIST/EMNIST have higher local pixel correlation than KMNIST, explaining why spatial 5×5 patches help on the former and hurt on the latter. Computed Pearson r between adjacent pixels and at various distances.
Result: hypothesis refuted. MNIST and KMNIST have nearly identical adjacent-pixel correlations (0.808 vs 0.789). Fashion has the highest correlation of any dataset (0.846 H, 0.898 V) yet locality is null there. Simple pairwise correlation doesn’t predict the locality direction.
B31: Per-pixel class-discriminability — clean mechanistic story
Computed an F-like ratio (between-class variance / within-class variance) for every pixel position, then measured the spatial autocorrelation of that discriminability map at various distances.
Result: spatial autocorrelation of class-discriminability at the patch scale (d=5) predicts the locality direction perfectly:
| Dataset | autoc d=1 | autoc d=2 | autoc d=5 | Locality direction |
|---|---|---|---|---|
| MNIST | 0.903 | 0.728 | +0.320 | spatial +0.6-0.9 *** |
| Fashion | 0.852 | 0.652 | +0.125 | null |
| KMNIST | 0.869 | 0.601 | −0.067 | spatial −1.21 *** |
| EMNIST | 0.940 | 0.780 | +0.371 | spatial +1.03 *** |
At d=1 and d=2 all four datasets have similar high autocorrelation. The differentiator is at the patch scale. MNIST/EMNIST keep ~32-37% autocorrelation: a fixed 5×5 spatial patch reliably catches a discriminative cluster. KMNIST collapses to ≈0 at d=5: class-discriminative info at one patch-sized region is uncorrelated with neighbors. So a fixed-position 5×5 patch can’t reliably sample concentrated discriminative information; random-index patches sampling 25 pixels from anywhere distribute their sampling more effectively.
This is the cleanest mechanistic explanation Group B has produced. The locality direction is a measurable property of the data — no training required.
B25/B32: The B15 multilayer reversal
B15 (MNIST) and B24 (KMNIST) found that adding a hidden ReLU layer between patches and the linear classifier hurt by 1.5-1.85pp at fixed 10-epoch / fixed-LR training budget. Original reading: “depth hurts” at fixed budget, conditional on training schedule.
B25 (KMNIST, 20 epochs + linear LR decay 0.05→0.005):
| Config | Mean ± std |
|---|---|
| M=0 (linear) | 92.96 ± 0.16 |
| M=64 (multilayer) | 95.74 ± 0.17 |
Δ = +2.78pp, t = +24.50, d_z = +10.96, *** — multilayer helps enormously with proper schedule.
Compared to B24’s fixed-budget run on KMNIST: M=0 went 90.32 → 92.96 (+2.6pp from longer training); M=64 went 88.47 → 95.74 (+7.3pp). The hidden layer benefits much more from proper training than the linear baseline does.
B15’s “depth hurts” conclusion was a training-budget artifact. It needs explicit retraction.
B32 (MNIST, same schedule): Δ=+0.11pp, ns. With proper schedule, multilayer is null on MNIST (the linear classifier is at its capacity-saturated ceiling around 98%; the hidden layer has nowhere to help).
B34/B35 (EMNIST): Surprise — depth hurts on EMNIST even with proper schedule. M=64: −1.11pp ***. M=128: −0.62pp ***. M=256: −0.47pp. The bottleneck hypothesis (M=64 < 47 classes) is disproved by M=128/M=256 still hurting.
So depth’s value across tasks:
| Task | Δ from adding M=64 hidden (best M) |
|---|---|
| MNIST (saturated) | +0.11 ns |
| KMNIST (cursive, 10 cls) | +2.78 *** |
| EMNIST (printed, 47 cls) | −0.47 to −1.11 |
Three different signs on three tasks. Simple “depth scales with task headroom” is false. Plausible reading: KMNIST has 10 visually-similar cursive classes that share components, and the hidden layer’s compositional features help discriminate them. EMNIST’s 47 classes are visually distinct (digits + letters) and the linear classifier on raw patch features is approximately optimal at this capacity — adding non-linearity doesn’t improve what’s already separable.
B21/B30: Multi-scale doesn’t replicate on harder tasks
B21 (KMNIST) and B30 (EMNIST) re-ran the B14 multi-scale test (single 5×5 vs mixed 3/5/7). Both: all ns at every patch count tested.
| N | MNIST B14 | KMNIST B21 | EMNIST B30 |
|---|---|---|---|
| 120 | (untested) | +0.31 ns | −0.58 ns |
| 240 | +0.53 *** | −0.25 ns | +0.13 ns |
| 480 | −0.11 ns | +0.28 ns | +0.22 ns |
Multi-scale was a MNIST-specific quirk. Doesn’t generalize to either harder grayscale character task tested. Whatever the +0.53pp on MNIST N=240 was capturing isn’t a general property of receptive-field diversity in this framework.
B29/B33: Rectangular wide-preference is mostly general but flips on KMNIST
B29 (KMNIST) tested the B9-stats / B10 / B11 rectangular wide-preference. Result: Δ=−0.40pp * — sign flipped on KMNIST. B33 (EMNIST): Δ=+0.98pp *** — follows MNIST/Fashion.
Full 4-task picture:
| Task | 3×9 vs 9×3 | Sig | Type |
|---|---|---|---|
| MNIST | +0.51 | *** | printed digits |
| Fashion | +0.93 | *** | clothing |
| EMNIST | +0.98 | *** | printed letters+digits |
| KMNIST | −0.40 | * | cursive Japanese |
KMNIST is the outlier on yet another finding. Of the 4 datasets tested, 3 follow the wide-preference pattern with similar magnitudes; KMNIST flips sign. This is consistent with the original “perpendicular to dominant feature orientation” mechanistic reading from B11 — KMNIST evidently has a different dominant feature orientation than the other three, possibly more horizontal than vertical.
Updated transferability tally
| Finding | MNIST | KMNIST | EMNIST | Fashion | Generalization |
|---|---|---|---|---|---|
| Patch matchers work | ✓ | ✓ | ✓ | ✓ | all 4 |
| Rectangular wide-pref | ✓ | ✗ flips | ✓ | ✓ | 3 of 4 |
| Spatial locality | ✓ | ✗ flips | ✓ | ~ null | 2 of 4 (predicted by B31 metric) |
| Multi-scale | ✓ | ✗ ns | ✗ ns | (untested) | MNIST only |
| Depth+schedule helps | ~ ns | ✓ | ✗ hurts | (untested) | KMNIST only |
The patch-matcher primitive is universally functional. Every architectural detail is task-conditional. KMNIST is the most frequent outlier — it inverts locality, inverts rectangular preference, and is the only task where multilayer clearly helps with proper schedule. Cursive script structure appears to occupy a genuinely different region of the data-property space than printed-character or texture data.
What this means for typed-species NEAT integration
The case for evolving architecture per-task is now stronger. Group B started with the hypothesis “patch matchers as a typed species” and generated a battery of MNIST-derived defaults (5×5 patches, single scale, no hidden layer, spatial locality). The harder-task experiments revealed that most of those defaults are MNIST-specific. A genome that locks them in would generalize poorly.
The genome should evolve:
- Patch geometry (size, aspect ratio) — task-specific
- Patch placement strategy (spatial vs distributed indices) — task-specific
- Network depth and training schedule — task-conditional and entangled
- Patch count is the only parameter where “more is monotonically better” reliably holds
This is more nuanced than “lift patches into the genome” but ultimately better-grounded. Future integration work should design mutations for each of these axes.