Group F — Journal
2026-05-18 — opening
Group F follows directly from the E7-E9 CL battery. The CL question was answered cleanly via replay; the residual research question that emerged was foundational rather than incremental: does the system’s “online learning” framing actually contribute anything measurable? Across Groups A-E we always trained via online per-example SGD, but never tested the alternative. If standard mini-batch SGD on the same architecture matches online, then the “online” framing is unnecessary and the system’s narrative weakens to “NEAT topology + ordinary backprop.”
Today I added gradient accumulation to the network’s backward pass — a backward_accumulate that fills per-connection / per-patch-entry / per-patch-bias accumulators without modifying weights, paired with apply_accumulated(lr, batch_size) that does the single deferred update. Two unit tests lock in the equivalence:
online_offline_equivalent_at_bs1: at batch_size=1, the batched path produces identical weights to the online path. Verified across 5 examples on a small 8-node MLP, agreement to 1e-6.batched_average_matches_single_step_when_examples_identical: 4 copies of the same example accumulated then applied at lr/4 equals one single online step. Verified to 1e-5.
Both tests pass. Lib test count is now 32 (was 30). Clippy and fmt clean.
2026-05-18 — F1 results: equal-per-step LR — online dominates per-example
First experiment under naive LR setup: LR_PER_EXAMPLE = 0.01, applied via apply_accumulated(0.01, B) which performs weight -= (0.01/B) * sum(grads) = 0.01 * mean(grads) per update. This holds per-update weight-change magnitude constant across batch sizes.
3 seeds × 4 batch sizes × 500K examples on a fixed 784→128→10 sparse MLP.
| batch | final test | std | gap (train-test) | examples to 95% |
|---|---|---|---|---|
| 1 | 95.89% | 0.17% | +1.59pp | 250K |
| 16 | 92.06% | 0.33% | −0.03pp | never |
| 64 | 87.78% | 0.15% | −0.51pp | never |
| 256 | 80.04% | 0.57% | −1.92pp | never |
At 50K examples, B=1 is already at 92.6% test accuracy. B=256 is at 41.1%. At 500K, B=1 has converged to 95.9% with slight overfitting (test < train by 1.6pp); B=256 still climbing slowly, under 80%.
Why this is the expected (textbook) outcome
Under equal per-step LR, online does 16× more weight updates per epoch than B=16 (and 256× more than B=256). The “effective learning pressure” per epoch is proportional to (number of updates × per-update magnitude). With per-update magnitude held constant, total pressure scales as 1/B. So online dominates convergence per example.
This isn’t really an answer to the research question. It just shows that “with constant per-step LR, update frequency matters”, which is uncontroversial. The fair comparison requires scaling LR with batch size — the classic “linear scaling rule” — so that B sequential online updates and one B-batch update produce roughly the same total weight change.
F2 setup
F2 will run linear-LR scaling: lr_batch = lr_online × B so apply_accumulated effectively applies lr_online × sum(grads) instead of lr_online × mean(grads). Predicted outcomes:
- B=16, lr=0.16: should approximately match B=1 lr=0.01. Linear scaling works well for small B.
- B=64, lr=0.64: probably starts to drift. Large per-step weight changes may overshoot good minima.
- B=256, lr=2.56: very likely to diverge or destabilize. Beyond the typical “linear scaling rule breaks” regime (~B=512 in dense-net literature; smaller here).
If F2 shows B=16 lr=0.16 matches B=1 lr=0.01 at final accuracy, the conclusion is: online updates aren’t load-bearing for the weight learning itself — what matters is total weight-change magnitude per epoch, not update granularity. If B=16 still trails B=1 even with linear scaling, the conclusion is: online has an intrinsic advantage from using progressively-updated gradient estimates (each step’s gradient is from a freshly-updated weight state).
Side observations from F1
-
Train-test gap inversion across batch sizes: B=1 shows +1.6pp overfitting (train > test); B≥16 all show negative gap (train < test). At B=1, the network has effectively done 500K updates with high LR noise — the noise is acting as a regularizer that hurts training accuracy more than test. At larger B, the gradient noise is averaged out per update so the network’s “trying harder” on the train set — but it hasn’t trained enough overall to overfit. This is a known effect (small-batch SGD has implicit regularization) and a useful sanity check that the gradient accumulation is working correctly.
-
Seed variance is small and consistent: 0.15-0.57% across batch sizes. The experiment is well-behaved — the gap between conditions is dozens of times the within-condition variance.
Run wall-time was ~6 minutes (i9-9900K, no rayon parallelism since population size = 1).
2026-05-18 — F2 results: linear-scaled batch SGD matches online up to B=64
Re-ran F1’s conditions with linear LR scaling: lr_batch = lr_online × B. 4 (batch, lr) cells × 3 seeds.
| batch | lr | final test | std | gap |
|---|---|---|---|---|
| 1 | 0.01 | 95.84% | 0.23% | +1.63pp |
| 16 | 0.16 | 95.96% | 0.29% | +1.40pp |
| 64 | 0.64 | 95.73% | 0.19% | +1.26pp |
| 256 | 2.56 | 94.81% | 0.38% | +0.49pp |
The load-bearing result: with proper linear LR scaling, online and batched SGD converge to essentially the same place. B=16 with lr=0.16 actually edges B=1 by a hair (within noise). B=64 lr=0.64 trails by 0.11pp — within noise. Even B=256 lr=2.56 (where linear scaling typically starts to break) reaches 94.81% — only 1pp behind online and well above any divergence threshold.
Convergence dynamics are also tight:
ex_seen B=1 lr=0.01 B=16 lr=0.16 B=64 lr=0.64 B=256 lr=2.56
50K 0.926 0.925 0.923 0.896
100K 0.936 0.940 0.938 0.922
500K 0.958 0.960 0.957 0.948
By 100K examples, B=1, B=16, B=64 are within 0.4pp of each other. By 500K they’re within 0.3pp. Online’s convergence advantage is purely a per-step-LR confound; with proper scaling it vanishes.
What this means for the project’s positioning
Strong negative result for the “online learning is foundational” claim. On a fixed architecture, online per-example SGD is statistically indistinguishable from standard mini-batch SGD with linear LR scaling, up to at least batch size 64 (and not far behind at batch size 256). The Synth-distinctive part of “online per-example SGD + evolutionary topology change” is the topology change, not the per-example SGD.
This doesn’t kill the broader research program — NEAT-style structural evolution under SGD-trained weights is still a coherent and valuable framing. But the “online learning” lens needs to either go or be rescoped. Three possible rescopings:
- “Online” means “no separate train/test phase” — the system processes one example at a time and immediately uses it. This is true whether updates are per-example or per-batch as long as no held-out data is being used during training. This framing is honest but doesn’t differentiate from “standard SGD on a streaming dataset.”
- “Online” matters during evolutionary topology change — fixed-architecture comparison (F1+F2) is the wrong test; the real claim should be tested under evolution. F3 will check this.
- Drop the “online learning” framing entirely — describe the system as “NEAT-style topology evolution with concurrent SGD-trained weights.” More accurate but less distinctive-sounding.
My current read: F2’s result is too clean to be a fluke. Online ≈ batch on fixed architectures. The question is whether evolution changes that — and that’s what F3 should test.
F3 setup (next)
Open question: does the online-vs-batch equivalence survive the introduction of evolutionary topology change? The mechanism could plausibly fail at the interface: when a structural mutation introduces a new patch or connection, online SGD updates only adjust the new weights once per example; batched SGD averages noisy gradients across examples that may have very different “fresh-mutant”-vs-host gradient signals. Online’s faster feedback may matter more here than on a fixed architecture.
F3 will run the standard niche/evolution loop (Group B’s [128] MLP setup with patches and warm-start enabled) with two conditions:
- Online (current) — train_step path
- Batched B=64 with lr=0.64 — using backward_accumulate + apply_accumulated within train_batch
Same population size, same training budget, same evolution interval. Compare best/avg fitness trajectory, patch count, final test accuracy. If they’re indistinguishable, the online claim is dead. If batched lags by >1-2pp or shows different evolutionary dynamics, online’s advantage is in the structural-mutation regime, not the fixed-architecture regime.
2026-05-18 — F3: online ≈ batched under evolution. Online claim is fully dead.
Set up F3 to test whether the F2 fixed-architecture equivalence (online ≈ batched with linear LR scaling) survives evolutionary topology change. The mechanistic case for “evolution needs online” was: warm-patch and other structural mutations introduce fresh weights with no training history; online updates them on every example; batched averages their gradients with mature host weights, potentially suppressing the mutant signal during the survival-critical few-steps-after-insertion window.
Setup
Single MNIST niche, pop=50, 600K examples per condition. Seeded 128-patch initial topology with warm-patch insertion enabled (warm_patch_insertion=true, add_patch_prob=0.10, burst_count=4). Mutation config matched to Group E E2’s settings. Two conditions:
- Online: existing
Niche.train_batchpath, lr=0.01, dispatch batch=100 (online updates per example) - Batched B=64 lr=0.64: new
Niche.train_batch_accumulatedpath using linear LR scaling from F2
2 seeds per condition.
First-run bug + fix
First run had a subtle confound: the evolution trigger was step % EVOLVE_INTERVAL == 0, and batched mode advances step by 64 per inner iteration. step values mod 10000 only landed on zero at LCM(64, 10000) intervals → batched ran 14 generations vs online’s 59 in the same 600K-example budget. The numbers still came out roughly equivalent (online 96.37%, batched 96.79%) even with batched at a ~4× evolution-count disadvantage, but that’s a confound. Switched to threshold-based step >= next_evolve scheduling and re-ran.
Result (fair evolution schedule)
| mode | seed | final_test | best_patches | best_fitness |
|---|---|---|---|---|
| online | 0xf301 | 96.40% | 130 | 0.9733 |
| online | 0xf302 | 96.34% | 130 | 0.9777 |
| batched B=64 | 0xf6e9 | 96.69% | 138 | 0.9649 |
| batched B=64 | 0xf6ea | 96.52% | 132 | 0.9765 |
Per-mode means:
- Online: 96.37% test acc, 130.0 patches, 0.9755 best fitness
- Batched: 96.61% test acc, 135.0 patches, 0.9707 best fitness
Same 60 evolution cycles each. Final test accuracies within 0.24pp — within the 2-seed noise level (between-seed variance ~0.1-0.2%). Batched actually grew more patches than online (135 vs 130), contradicting my pre-experiment prediction that batching would suppress warm-mutant survival.
Why batched might grow MORE patches than online
Possible mechanism (speculative; not directly verified): the lr=0.64 single-update-per-64-examples gives the network “settling time” between weight changes. A freshly-inserted warm patch sees 64 examples worth of gradients accumulated before any of its weights change; that’s a more representative gradient estimate than a single online update. Online’s per-example updates can momentarily move the patch’s weights in noisy directions before the host’s host-trajectory stabilizes them. Batched gives a smoother trajectory through the post-insertion regime.
Alternatively it could just be 2-seed noise — the patch-count gap is 5 across 2 seeds, well within reasonable seed variance. F4 with more seeds would settle this.
What the F1+F2+F3 battery establishes
The “online per-example SGD” framing is not load-bearing for this system:
- F1 (naive comparison): under equal per-step LR, online dominates per-example convergence (95.9% vs 80.0% at B=256). Expected confound.
- F2 (LR-scaled): under linear LR scaling, online and batched converge to equivalent final accuracies (~95.8% for B=1, 16, 64; 94.8% for B=256). Online has no advantage on fixed architecture.
- F3 (under evolution): with the full evolutionary loop including warm-patch insertion, online and batched give equivalent final test accuracy (96.4% vs 96.6%). Online has no advantage even in the structural-mutation regime.
The project’s “online learning” positioning is empirically unsupported. The distinctive mechanism is the NEAT-style topology evolution; the per-example update style is incidental and can be replaced by standard mini-batch SGD with linear LR scaling without measurable loss.
This is a significant finding that affects the project’s framing. The good news is it’s a positive practical finding: batched SGD parallelizes far better than online, and is the universal paradigm in modern ML — Synth can switch to it without sacrificing capability. The “online” framing was hindering rather than helping.
What this means for prior results
A reasonable next test (F4) would be re-running a representative Group B/C/E experiment with batched SGD to confirm prior accuracy numbers hold. If they do, the entire research history is preserved — just with a corrected mechanism description. If anything diverges materially, that would be unexpected and worth investigating.
But the F1+F2+F3 evidence is already substantial. Time to surface this finding to the human and discuss what it means for the project’s research narrative.