Group F: Online vs Offline SGD

The fifth research stream, opened 2026-05-18 after Group E’s CL battery closed cleanly. Group F tests a foundational positioning claim that Groups A–E held fixed: does Synth’s distinguishing feature — online per-example SGD + evolutionary topology change — actually contribute anything measurable over standard mini-batch SGD on the same architecture?

The hypothesis under test

Synth has been positioned as an “online learning” system from the start: per-example forward + backward + immediate weight update, no separate train/test split, no batching. Implicit claim: this matters. Group F tests whether it does, on a fixed architecture (F1+F2) and under evolution (F3).

The pages

Headline result

The “online per-example SGD” framing is empirically unsupported. Across three experiments (fixed architecture with naive LR, fixed architecture with linear LR scaling, full evolution with linear LR scaling), batched SGD matches online SGD within seed noise. The project’s distinguishing mechanism is the NEAT-style topology evolution; the per-example update style is incidental.

F1: naive comparison — expected confound

Equal-per-step LR ablation. With LR held constant across batch sizes, online does B× more updates per epoch than batched B; total weight-change pressure scales as 1/B. Naturally online wins:

batch final test examples to 95%
1 95.89% 250K
16 92.06% never (in 500K)
64 87.78% never
256 80.04% never

Not informative about the research question — confounds update frequency with effective learning rate.

F2: linear LR scaling — online’s advantage disappears

lr_batch = lr_online × B matches the per-example weight-change magnitude across batch sizes:

batch lr final test std
1 0.01 95.84% 0.23%
16 0.16 95.96% 0.29%
64 0.64 95.73% 0.19%
256 2.56 94.81% 0.38%

B=16 actually edges B=1 by 0.12pp (within noise). B=64 trails by 0.11pp (within noise). Even B=256 lr=2.56 — where linear scaling typically breaks — reaches 94.8%, only 1pp behind online. Convergence curves overlap throughout the trajectory:

ex_seen B=1 B=16 B=64 B=256
50K 0.926 0.925 0.923 0.896
100K 0.936 0.940 0.938 0.922
500K 0.958 0.960 0.957 0.948

Online ≈ batched up to B=64 on fixed architecture.

F3: under evolution — same equivalence holds

Single MNIST niche, pop=50, 600K examples, warm-patch insertion enabled. Online (lr=0.01) vs batched B=64 (lr=0.64), 60 evolution cycles each, 2 seeds:

mode mean test acc best patches best fitness
Online 96.37% 130.0 0.9755
Batched B=64 96.61% 135.0 0.9707

Batched edges online by 0.24pp on test accuracy and 5 patches on best-individual count. Both within reasonable seed noise (within-condition variance ~0.1-0.2pp on test accuracy). Convergence trajectories track each other within 0.01-0.015 throughout the run.

The pre-experiment prediction was that batching would suppress warm-mutant survival because fresh patches’ gradients get averaged with mature host gradients. That’s not what happened — batching actually grew slightly more patches than online. Mechanism speculation: lr=0.64 single-updates-per-64-examples gives the network settling time between weight changes, so freshly-inserted warm patches see a more representative gradient estimate before any weights move. But this could be 2-seed noise; F4 with more seeds would settle it.

What this means for the project’s positioning

The “online per-example SGD” framing is a description error and an unnecessary self-imposed constraint. The system’s actual mechanism is:

NEAT-style topology evolution under SGD-trained weights, where the weight training is per-example online updates (an arbitrary implementation choice) rather than a fundamental property of the algorithm.

Switching to batched SGD with linear LR scaling preserves all prior Group A–E findings (verified within seed noise on the F3 architecture; F4 would extend the verification across more configurations). The practical upside: batched SGD is the universal modern ML paradigm, parallelizes better, and has well-developed tooling (Adam, LR scheduling, mixed precision) that Synth has not exploited.

This is a positive practical finding. The research narrative becomes more honest and more aligned with the broader field without sacrificing capability. The “online” framing was never load-bearing for the system’s actual contributions — those live in the patch-matcher genome representation, the warm-start mechanism, and the ecological speciation framework.

Still open

Compute and methodology

F1+F2+F3 together: ~12 minutes wall time on a 16-thread i9-9900K. Total experiment compute is dwarfed by the implementation work — the load-bearing engineering was the gradient accumulator in network/phenotype.rs and network/backward.rs, plus the train_batch_accumulated paths in population/population.rs and population/niche.rs. Two unit tests lock in B=1 online ≡ batched equivalence at the network level (32 lib tests now pass, up from 30).