Group F — Experiments

Structured experiment records. See journal.md for narrative.


F1: online vs batched SGD, equal per-step LR (naive baseline)

Date: 2026-05-18 Binary: cargo run --release --bin group_f_online_offline Output: notes/group_f/f1_output.txt

Setup

Result

batch final_test_mean final_test_std train_test_gap ex_to_95%
1 95.89% 0.17% +1.59pp 250K
16 92.06% 0.33% −0.03pp never
64 87.78% 0.15% −0.51pp never
256 80.04% 0.57% −1.92pp never

Test accuracy curves (mean across 3 seeds) at checkpoints:

ex_seen B=1 B=16 B=64 B=256
0 0.094 0.101 0.100 0.116
50K 0.926 0.826 0.682 0.411
100K 0.937 0.867 0.773 0.568
250K 0.954 0.902 0.843 0.723
500K 0.959 0.921 0.878 0.800

Analysis

  1. This is the expected textbook outcome under equal per-step LR. Online does B× more weight updates per epoch than batch size B. With per-update magnitude held constant, total weight-change pressure per epoch scales as 1/B. So online converges B× faster per example. Not informative about the research question.

  2. The fair comparison requires linear LR scaling. F2 runs lr_batch = lr_online × B so per-step weight change equals the sum (not mean) of gradients × LR — matching the total weight-change magnitude of B sequential online updates.

  3. Side observation: train-test gap inversion across batch sizes is real. B=1 shows +1.6pp overfitting; B≥16 show train ≤ test. Small-batch SGD’s implicit regularization vs large-batch’s lower gradient noise. Expected from the literature but a useful sanity check that the gradient accumulation is correctly implemented.

  4. Seed variance is small (0.15-0.57%) and within-condition variance is dwarfed by between-condition variance by 10-100×. The experiment is well-behaved.

Conclusion

F1 isn’t a meaningful answer to the research question — it confounds update frequency with effective learning rate. It does confirm the gradient accumulation implementation works correctly (B=1 batched ≈ B=1 online via unit tests; B>1 shows the expected textbook scaling). F2 is the substantive experiment.


F2: linear-LR-scaled batched SGD

Date: 2026-05-18 Binary: cargo run --release --bin group_f_lr_scaled Output: notes/group_f/f2_output.txt

Setup

Same architecture, dataset, budget, seed family as F1. Linear LR scaling: lr_batch = lr_online * batch_size. Conditions (batch, lr) × 3 seeds:

Result

batch lr final_test_mean final_test_std train_test_gap diverged
1 0.01 95.84% 0.23% +1.63pp 0/3
16 0.16 95.96% 0.29% +1.40pp 0/3
64 0.64 95.73% 0.19% +1.26pp 0/3
256 2.56 94.81% 0.38% +0.49pp 0/3

Test accuracy curves (mean across 3 seeds):

ex_seen B=1 lr=.01 B=16 lr=.16 B=64 lr=.64 B=256 lr=2.56
50K 0.926 0.925 0.923 0.896
100K 0.936 0.940 0.938 0.922
250K 0.953 0.952 0.951 0.936
500K 0.958 0.960 0.957 0.948

Analysis

  1. Online and batched SGD are statistically equivalent with linear LR scaling, up to B=64. B=16 lr=0.16 produces final accuracy 95.96% vs B=1’s 95.84% — within noise. B=64 lr=0.64 at 95.73%, 0.11pp behind online — within noise. Convergence dynamics across the 500K-example trajectory are within 0.4pp of each other.

  2. B=256 lr=2.56 starts to lag but does not diverge. 94.81% final accuracy, ~1pp behind. Linear scaling starts to break in the B=128-512 range on dense MLPs (known from the literature); F2 surfaces this knee around B=256 on the sparse [128] architecture.

  3. The “online learning” claim is dead on fixed architectures. F1 + F2 together establish: under fixed topology and proper LR scaling, per-example online SGD is indistinguishable from mini-batch SGD with B up to at least 64. F1’s “online wins” was a same-per-step-LR confound; F2’s correctly-scaled comparison shows no advantage.

  4. Train-test gap is positive across the entire F2 sweep, opposite to F1. F1’s gap inversion was an undertraining artifact, not small-batch SGD’s implicit regularization.

  5. Mild implicit regularization at large batches. B=256 has the smallest gap (+0.49pp) — possibly because lr=2.56 introduces enough per-step noise to function as a regularizer. Could be noise from 3 seeds.

Conclusion

The “online per-example SGD” framing provides no measurable accuracy advantage over standard mini-batch SGD with linear LR scaling on a fixed architecture. Synth’s distinguishing mechanism is the NEAT-style topology evolution, not the per-example updates. F3 tests whether this conclusion survives under evolution.


F3: online vs batched SGD under evolution

Date: 2026-05-18 Binary: cargo run --release --bin group_f_evo_online_vs_batch Output: notes/group_f/f3_v2_output.txt (with fair-evolution-schedule fix)

Setup

Single MNIST niche, pop=50, 600K examples per condition. Seeded 128-patch initial topology with warm-patch insertion enabled (Group E E2 mutation config). Two conditions × 2 seeds:

Fair evolution scheduling: both conditions run 60 generations (one evolve per 10K examples). Initial run used step % EVOLVE_INTERVAL == 0 which only fires on exact multiples — batched mode advances step by 64 per inner iteration, so it only hit the evolution trigger at LCM(64, 10000) intervals, giving batched 14 generations vs online’s 60. Fixed by switching to threshold-based scheduling.

Result

mode mean_test_acc mean_best_patches mean_best_fitness
Online 96.37% 130.0 0.9755
Batched B=64 96.61% 135.0 0.9707

Best-fitness trajectory (mean across 2 seeds):

step online best batched best online patches batched patches
100K 0.941 0.955 128.7 129.0
200K 0.959 0.958 128.8 130.8
300K 0.967 0.965 130.2 132.0
400K 0.969 0.970 131.9 134.2
500K 0.966 0.969 131.1 135.3
600K 0.976 0.971 130.7 135.8

Analysis

  1. Online and batched SGD produce statistically equivalent final accuracy under evolution. Test accuracy gap is 0.24pp (batched ahead) — within the ~0.1-0.2% noise floor of 2 seeds. The mechanistic prediction that “warm-mutant survival needs per-example updates” is not supported by the data.

  2. Batched actually grew more patches than online: 135 vs 130. The opposite of the prediction. Possible mechanism: lr=0.64 single-update-per-64-examples gives the network more settling time between weight changes; freshly-inserted warm patches see a more representative gradient estimate before any weights move. But this could just be 2-seed variance (5-patch difference is within plausible noise).

  3. Convergence trajectories are nearly identical. Best-fitness curves track each other within 0.01-0.015 throughout the 600K-example budget. The two conditions are operating in the same effective regime.

  4. The “online learning is foundational” claim is fully dead. F1 + F2 showed equivalence on fixed architecture. F3 shows the same equivalence holds under evolution — including with structural mutations active. The project’s distinctive mechanism is the NEAT-style topology evolution, not the per-example update style. Batched SGD with linear LR scaling is a drop-in replacement.

Practical implications

This is a positive practical finding. Batched SGD:

The “online” framing was both a description error and an unnecessary self-imposed constraint. Switching to batched SGD opens the door to using standard ML infrastructure where useful without sacrificing any of the system’s actual capabilities.

What this means for the prior research (Groups A-E)

Every prior result was obtained under online SGD. If batched SGD is equivalent on fixed architecture (F1+F2) and under evolution (F3), then all prior Group A-E findings should replicate under batched SGD with linear LR scaling. F4 (optional, conditional on human direction) would verify this on a representative experiment.


F4: Adam vs SGD on evolved architecture

Date: 2026-05-18 Binary: cargo run --release --bin group_f_adam Output: notes/group_g/f4_output.txt

Setup

Same fixed [128]-MLP architecture as F1/F2. 500K examples, batch size 64. 4 conditions × 2 seeds:

Result

condition final test mean std gap
SGD lr=0.64 96.18% 0.04% +1.04pp
Adam lr=0.001 94.69% 0.17% +1.14pp
Adam lr=0.003 95.86% 0.06% +1.85pp
Adam lr=0.01 96.17% 0.13% +1.68pp

Adam at standard lr=0.001 underperforms by 1.5pp. Adam at lr=0.01 ties with SGD exactly (96.17% vs 96.18%).

Analysis

Adam converges marginally faster in the early phase (50K examples: Adam-0.01 at 92.88% vs SGD at 92.34%) but the final accuracy converges. SGD continues improving past 300K examples while Adam plateaus earlier.

Conclusion

The optimizer choice doesn’t matter on this system. The F1-F4 sequence has now fully ablated the optimizer axis: neither online vs batched (F1-F3) nor SGD vs Adam (F4) makes a meaningful difference. NEAT-style topology evolution + standard SGD with reasonable hyperparameters is the operating point. Modern ML optimizers offer no improvement.

This is a positive finding from an engineering simplicity standpoint — Synth doesn’t need fancy optimizers.