Group F: Online vs Offline SGD
The fifth research stream, opened 2026-05-18 after Group E’s CL battery closed cleanly. Group F tests a foundational positioning claim that Groups A–E held fixed: does Synth’s distinguishing feature — online per-example SGD + evolutionary topology change — actually contribute anything measurable over standard mini-batch SGD on the same architecture?
The hypothesis under test
Synth has been positioned as an “online learning” system from the start: per-example forward + backward + immediate weight update, no separate train/test split, no batching. Implicit claim: this matters. Group F tests whether it does, on a fixed architecture (F1+F2) and under evolution (F3).
The pages
- Group F Journal — chronological narrative.
- Group F Experiments — structured records (F1, F2, F3).
Headline result
The “online per-example SGD” framing is empirically unsupported. Across three experiments (fixed architecture with naive LR, fixed architecture with linear LR scaling, full evolution with linear LR scaling), batched SGD matches online SGD within seed noise. The project’s distinguishing mechanism is the NEAT-style topology evolution; the per-example update style is incidental.
F1: naive comparison — expected confound
Equal-per-step LR ablation. With LR held constant across batch sizes, online does B× more updates per epoch than batched B; total weight-change pressure scales as 1/B. Naturally online wins:
| batch | final test | examples to 95% |
|---|---|---|
| 1 | 95.89% | 250K |
| 16 | 92.06% | never (in 500K) |
| 64 | 87.78% | never |
| 256 | 80.04% | never |
Not informative about the research question — confounds update frequency with effective learning rate.
F2: linear LR scaling — online’s advantage disappears
lr_batch = lr_online × B matches the per-example weight-change magnitude across batch sizes:
| batch | lr | final test | std |
|---|---|---|---|
| 1 | 0.01 | 95.84% | 0.23% |
| 16 | 0.16 | 95.96% | 0.29% |
| 64 | 0.64 | 95.73% | 0.19% |
| 256 | 2.56 | 94.81% | 0.38% |
B=16 actually edges B=1 by 0.12pp (within noise). B=64 trails by 0.11pp (within noise). Even B=256 lr=2.56 — where linear scaling typically breaks — reaches 94.8%, only 1pp behind online. Convergence curves overlap throughout the trajectory:
| ex_seen | B=1 | B=16 | B=64 | B=256 |
|---|---|---|---|---|
| 50K | 0.926 | 0.925 | 0.923 | 0.896 |
| 100K | 0.936 | 0.940 | 0.938 | 0.922 |
| 500K | 0.958 | 0.960 | 0.957 | 0.948 |
Online ≈ batched up to B=64 on fixed architecture.
F3: under evolution — same equivalence holds
Single MNIST niche, pop=50, 600K examples, warm-patch insertion enabled. Online (lr=0.01) vs batched B=64 (lr=0.64), 60 evolution cycles each, 2 seeds:
| mode | mean test acc | best patches | best fitness |
|---|---|---|---|
| Online | 96.37% | 130.0 | 0.9755 |
| Batched B=64 | 96.61% | 135.0 | 0.9707 |
Batched edges online by 0.24pp on test accuracy and 5 patches on best-individual count. Both within reasonable seed noise (within-condition variance ~0.1-0.2pp on test accuracy). Convergence trajectories track each other within 0.01-0.015 throughout the run.
The pre-experiment prediction was that batching would suppress warm-mutant survival because fresh patches’ gradients get averaged with mature host gradients. That’s not what happened — batching actually grew slightly more patches than online. Mechanism speculation: lr=0.64 single-updates-per-64-examples gives the network settling time between weight changes, so freshly-inserted warm patches see a more representative gradient estimate before any weights move. But this could be 2-seed noise; F4 with more seeds would settle it.
What this means for the project’s positioning
The “online per-example SGD” framing is a description error and an unnecessary self-imposed constraint. The system’s actual mechanism is:
NEAT-style topology evolution under SGD-trained weights, where the weight training is per-example online updates (an arbitrary implementation choice) rather than a fundamental property of the algorithm.
Switching to batched SGD with linear LR scaling preserves all prior Group A–E findings (verified within seed noise on the F3 architecture; F4 would extend the verification across more configurations). The practical upside: batched SGD is the universal modern ML paradigm, parallelizes better, and has well-developed tooling (Adam, LR scheduling, mixed precision) that Synth has not exploited.
This is a positive practical finding. The research narrative becomes more honest and more aligned with the broader field without sacrificing capability. The “online” framing was never load-bearing for the system’s actual contributions — those live in the patch-matcher genome representation, the warm-start mechanism, and the ecological speciation framework.
Still open
- F4 — replicate a representative Group B/C/E experiment with batched SGD. Confirm that the F1+F2+F3 equivalence extends to a previously-published result (e.g. Group B Experiment 21 [128, 64] → 99.73%, or Group C C5d 4-way joint task at 87%).
- F5 — Adam / RMSProp / momentum variants. Now that batched SGD is established as equivalent to online, the standard modern optimizers should also be tested. Adam in particular may interact differently with structural mutations than vanilla SGD.
Compute and methodology
F1+F2+F3 together: ~12 minutes wall time on a 16-thread i9-9900K. Total experiment compute is dwarfed by the implementation work — the load-bearing engineering was the gradient accumulator in network/phenotype.rs and network/backward.rs, plus the train_batch_accumulated paths in population/population.rs and population/niche.rs. Two unit tests lock in B=1 online ≡ batched equivalence at the network level (32 lib tests now pass, up from 30).