Group F — Journal

2026-05-18 — opening

Group F follows directly from the E7-E9 CL battery. The CL question was answered cleanly via replay; the residual research question that emerged was foundational rather than incremental: does the system’s “online learning” framing actually contribute anything measurable? Across Groups A-E we always trained via online per-example SGD, but never tested the alternative. If standard mini-batch SGD on the same architecture matches online, then the “online” framing is unnecessary and the system’s narrative weakens to “NEAT topology + ordinary backprop.”

Today I added gradient accumulation to the network’s backward pass — a backward_accumulate that fills per-connection / per-patch-entry / per-patch-bias accumulators without modifying weights, paired with apply_accumulated(lr, batch_size) that does the single deferred update. Two unit tests lock in the equivalence:

online_offline_equivalent_at_bs1: at batch_size=1, the batched path produces identical weights to the online path. Verified across 5 examples on a small 8-node MLP, agreement to 1e-6.
batched_average_matches_single_step_when_examples_identical: 4 copies of the same example accumulated then applied at lr/4 equals one single online step. Verified to 1e-5.

Both tests pass. Lib test count is now 32 (was 30). Clippy and fmt clean.

2026-05-18 — F1 results: equal-per-step LR — online dominates per-example

First experiment under naive LR setup: LR_PER_EXAMPLE = 0.01, applied via apply_accumulated(0.01, B) which performs weight -= (0.01/B) * sum(grads) = 0.01 * mean(grads) per update. This holds per-update weight-change magnitude constant across batch sizes.

3 seeds × 4 batch sizes × 500K examples on a fixed 784→128→10 sparse MLP.

batch	final test	std	gap (train-test)	examples to 95%
1	95.89%	0.17%	+1.59pp	250K
16	92.06%	0.33%	−0.03pp	never
64	87.78%	0.15%	−0.51pp	never
256	80.04%	0.57%	−1.92pp	never

At 50K examples, B=1 is already at 92.6% test accuracy. B=256 is at 41.1%. At 500K, B=1 has converged to 95.9% with slight overfitting (test < train by 1.6pp); B=256 still climbing slowly, under 80%.

Why this is the expected (textbook) outcome

Under equal per-step LR, online does 16× more weight updates per epoch than B=16 (and 256× more than B=256). The “effective learning pressure” per epoch is proportional to (number of updates × per-update magnitude). With per-update magnitude held constant, total pressure scales as 1/B. So online dominates convergence per example.

This isn’t really an answer to the research question. It just shows that “with constant per-step LR, update frequency matters”, which is uncontroversial. The fair comparison requires scaling LR with batch size — the classic “linear scaling rule” — so that B sequential online updates and one B-batch update produce roughly the same total weight change.

F2 setup

F2 will run linear-LR scaling: lr_batch = lr_online × B so apply_accumulated effectively applies lr_online × sum(grads) instead of lr_online × mean(grads). Predicted outcomes:

B=16, lr=0.16: should approximately match B=1 lr=0.01. Linear scaling works well for small B.
B=64, lr=0.64: probably starts to drift. Large per-step weight changes may overshoot good minima.
B=256, lr=2.56: very likely to diverge or destabilize. Beyond the typical “linear scaling rule breaks” regime (~B=512 in dense-net literature; smaller here).

If F2 shows B=16 lr=0.16 matches B=1 lr=0.01 at final accuracy, the conclusion is: online updates aren’t load-bearing for the weight learning itself — what matters is total weight-change magnitude per epoch, not update granularity. If B=16 still trails B=1 even with linear scaling, the conclusion is: online has an intrinsic advantage from using progressively-updated gradient estimates (each step’s gradient is from a freshly-updated weight state).

Side observations from F1

Train-test gap inversion across batch sizes: B=1 shows +1.6pp overfitting (train > test); B≥16 all show negative gap (train < test). At B=1, the network has effectively done 500K updates with high LR noise — the noise is acting as a regularizer that hurts training accuracy more than test. At larger B, the gradient noise is averaged out per update so the network’s “trying harder” on the train set — but it hasn’t trained enough overall to overfit. This is a known effect (small-batch SGD has implicit regularization) and a useful sanity check that the gradient accumulation is working correctly.
Seed variance is small and consistent: 0.15-0.57% across batch sizes. The experiment is well-behaved — the gap between conditions is dozens of times the within-condition variance.

Run wall-time was ~6 minutes (i9-9900K, no rayon parallelism since population size = 1).

2026-05-18 — F2 results: linear-scaled batch SGD matches online up to B=64

Re-ran F1’s conditions with linear LR scaling: lr_batch = lr_online × B. 4 (batch, lr) cells × 3 seeds.

batch	lr	final test	std	gap
1	0.01	95.84%	0.23%	+1.63pp
16	0.16	95.96%	0.29%	+1.40pp
64	0.64	95.73%	0.19%	+1.26pp
256	2.56	94.81%	0.38%	+0.49pp

The load-bearing result: with proper linear LR scaling, online and batched SGD converge to essentially the same place. B=16 with lr=0.16 actually edges B=1 by a hair (within noise). B=64 lr=0.64 trails by 0.11pp — within noise. Even B=256 lr=2.56 (where linear scaling typically starts to break) reaches 94.81% — only 1pp behind online and well above any divergence threshold.

Convergence dynamics are also tight:

ex_seen   B=1 lr=0.01   B=16 lr=0.16   B=64 lr=0.64   B=256 lr=2.56
50K       0.926         0.925          0.923          0.896
100K      0.936         0.940          0.938          0.922
500K      0.958         0.960          0.957          0.948

By 100K examples, B=1, B=16, B=64 are within 0.4pp of each other. By 500K they’re within 0.3pp. Online’s convergence advantage is purely a per-step-LR confound; with proper scaling it vanishes.

What this means for the project’s positioning

Strong negative result for the “online learning is foundational” claim. On a fixed architecture, online per-example SGD is statistically indistinguishable from standard mini-batch SGD with linear LR scaling, up to at least batch size 64 (and not far behind at batch size 256). The Synth-distinctive part of “online per-example SGD + evolutionary topology change” is the topology change, not the per-example SGD.

This doesn’t kill the broader research program — NEAT-style structural evolution under SGD-trained weights is still a coherent and valuable framing. But the “online learning” lens needs to either go or be rescoped. Three possible rescopings:

“Online” means “no separate train/test phase” — the system processes one example at a time and immediately uses it. This is true whether updates are per-example or per-batch as long as no held-out data is being used during training. This framing is honest but doesn’t differentiate from “standard SGD on a streaming dataset.”
“Online” matters during evolutionary topology change — fixed-architecture comparison (F1+F2) is the wrong test; the real claim should be tested under evolution. F3 will check this.
Drop the “online learning” framing entirely — describe the system as “NEAT-style topology evolution with concurrent SGD-trained weights.” More accurate but less distinctive-sounding.

My current read: F2’s result is too clean to be a fluke. Online ≈ batch on fixed architectures. The question is whether evolution changes that — and that’s what F3 should test.

F3 setup (next)

Open question: does the online-vs-batch equivalence survive the introduction of evolutionary topology change? The mechanism could plausibly fail at the interface: when a structural mutation introduces a new patch or connection, online SGD updates only adjust the new weights once per example; batched SGD averages noisy gradients across examples that may have very different “fresh-mutant”-vs-host gradient signals. Online’s faster feedback may matter more here than on a fixed architecture.

F3 will run the standard niche/evolution loop (Group B’s [128] MLP setup with patches and warm-start enabled) with two conditions:

Online (current) — train_step path
Batched B=64 with lr=0.64 — using backward_accumulate + apply_accumulated within train_batch

Same population size, same training budget, same evolution interval. Compare best/avg fitness trajectory, patch count, final test accuracy. If they’re indistinguishable, the online claim is dead. If batched lags by >1-2pp or shows different evolutionary dynamics, online’s advantage is in the structural-mutation regime, not the fixed-architecture regime.

2026-05-18 — F3: online ≈ batched under evolution. Online claim is fully dead.

Set up F3 to test whether the F2 fixed-architecture equivalence (online ≈ batched with linear LR scaling) survives evolutionary topology change. The mechanistic case for “evolution needs online” was: warm-patch and other structural mutations introduce fresh weights with no training history; online updates them on every example; batched averages their gradients with mature host weights, potentially suppressing the mutant signal during the survival-critical few-steps-after-insertion window.

Setup

Single MNIST niche, pop=50, 600K examples per condition. Seeded 128-patch initial topology with warm-patch insertion enabled (warm_patch_insertion=true, add_patch_prob=0.10, burst_count=4). Mutation config matched to Group E E2’s settings. Two conditions:

Online: existing Niche.train_batch path, lr=0.01, dispatch batch=100 (online updates per example)
Batched B=64 lr=0.64: new Niche.train_batch_accumulated path using linear LR scaling from F2

2 seeds per condition.

First-run bug + fix

First run had a subtle confound: the evolution trigger was step % EVOLVE_INTERVAL == 0, and batched mode advances step by 64 per inner iteration. step values mod 10000 only landed on zero at LCM(64, 10000) intervals → batched ran 14 generations vs online’s 59 in the same 600K-example budget. The numbers still came out roughly equivalent (online 96.37%, batched 96.79%) even with batched at a ~4× evolution-count disadvantage, but that’s a confound. Switched to threshold-based step >= next_evolve scheduling and re-ran.

Result (fair evolution schedule)

mode	seed	final_test	best_patches	best_fitness
online	0xf301	96.40%	130	0.9733
online	0xf302	96.34%	130	0.9777
batched B=64	0xf6e9	96.69%	138	0.9649
batched B=64	0xf6ea	96.52%	132	0.9765

Per-mode means:

Online: 96.37% test acc, 130.0 patches, 0.9755 best fitness
Batched: 96.61% test acc, 135.0 patches, 0.9707 best fitness

Same 60 evolution cycles each. Final test accuracies within 0.24pp — within the 2-seed noise level (between-seed variance ~0.1-0.2%). Batched actually grew more patches than online (135 vs 130), contradicting my pre-experiment prediction that batching would suppress warm-mutant survival.

Why batched might grow MORE patches than online

Possible mechanism (speculative; not directly verified): the lr=0.64 single-update-per-64-examples gives the network “settling time” between weight changes. A freshly-inserted warm patch sees 64 examples worth of gradients accumulated before any of its weights change; that’s a more representative gradient estimate than a single online update. Online’s per-example updates can momentarily move the patch’s weights in noisy directions before the host’s host-trajectory stabilizes them. Batched gives a smoother trajectory through the post-insertion regime.

Alternatively it could just be 2-seed noise — the patch-count gap is 5 across 2 seeds, well within reasonable seed variance. F4 with more seeds would settle this.

What the F1+F2+F3 battery establishes

The “online per-example SGD” framing is not load-bearing for this system:

F1 (naive comparison): under equal per-step LR, online dominates per-example convergence (95.9% vs 80.0% at B=256). Expected confound.
F2 (LR-scaled): under linear LR scaling, online and batched converge to equivalent final accuracies (~95.8% for B=1, 16, 64; 94.8% for B=256). Online has no advantage on fixed architecture.
F3 (under evolution): with the full evolutionary loop including warm-patch insertion, online and batched give equivalent final test accuracy (96.4% vs 96.6%). Online has no advantage even in the structural-mutation regime.

The project’s “online learning” positioning is empirically unsupported. The distinctive mechanism is the NEAT-style topology evolution; the per-example update style is incidental and can be replaced by standard mini-batch SGD with linear LR scaling without measurable loss.

This is a significant finding that affects the project’s framing. The good news is it’s a positive practical finding: batched SGD parallelizes far better than online, and is the universal paradigm in modern ML — Synth can switch to it without sacrificing capability. The “online” framing was hindering rather than helping.

What this means for prior results

A reasonable next test (F4) would be re-running a representative Group B/C/E experiment with batched SGD to confirm prior accuracy numbers hold. If they do, the entire research history is preserved — just with a corrected mechanism description. If anything diverges materially, that would be unexpected and worth investigating.

But the F1+F2+F3 evidence is already substantial. Time to surface this finding to the human and discuss what it means for the project’s research narrative.