What We Found
22 experiments on a NEAT-style neuroevolution system, from first working prototype to 99.7% MNIST accuracy.
The System
Synth evolves small neural network topologies using NEAT-style genetic algorithms while training connection weights via per-example stochastic gradient descent. A population of networks all train on the same data stream. Periodically, the worst performers are culled and replaced with mutated offspring of the survivors.
The key design choice: evolution handles topology (what connects to what), SGD handles weights (how strong each connection is). This division of labor exploits what each method is good at. Gradient descent is excellent at continuous optimization but can’t add or remove connections. Evolutionary search is excellent at discrete combinatorial problems but inefficient at tuning thousands of continuous parameters.
For the multi-task experiments (1-13, 19-20), the population splits into ecological niches with different data mix ratios of MNIST and Fashion-MNIST. Each niche evolves independently under its own data distribution — a form of speciation driven by environment rather than genome similarity.
Written in Rust. ~2,000 lines. No tensor libraries. CPU only.
The Experiment Arc
Phase 1: Building and Calibrating (Experiments 1-4)
The first experiments established the baseline system. Key early lessons:
-
SGD and evolution conflict on weights. Weight mutation is almost always harmful because SGD has already optimized the weights. Reducing
per_weight_perturb_probfrom 1.0 to 0.10 was critical — it turned weight mutation from a catastrophic reset into a mild regularizer. (Experiment 1) -
Multi-task learning works out of the box. The 50/50 niche (equal MNIST and Fashion-MNIST) achieved ~66% on both tasks simultaneously with only 50 individuals. No special multi-task architecture needed — the ecological niche pressure naturally drives multi-task capability. (Experiment 4)
-
Zero cross-task transfer between pure niches. The 100% MNIST niche scores ~0% on Fashion-MNIST and vice versa. The networks specialize completely to their niche’s data distribution. (Experiment 4)
Phase 2: Scaling and Tuning (Experiments 5-8)
Population size and learning rate schedule tuning. This phase produced the strongest baseline.
-
Population size is the strongest lever. 50→100 individuals gave +4-6pp accuracy across all niches. More individuals = better sampling of the fitness landscape = better selection = better offspring. (Experiment 5)
-
Learning rate decay has a Goldilocks zone. Decay from 0.01→0.001 in the niche phase produced the most structural divergence between niches (17-connection spread) while maintaining accuracy. Decaying to 0.0001 froze the fitness landscape — too little gradient signal for either weights or topology to improve. (Experiments 6-8)
-
Warm-up quality dominates final accuracy. Across experiments 5-8, the single best predictor of post-split accuracy was how well the warm-up phase trained the initial population. Constant lr=0.01 during warm-up + decay during niche phase became the default. (Experiment 8)
Phase 3: The Plateau (Experiments 9-12)
Four consecutive negative results. Every variation from the Experiment 8 baseline made things worse.
| Experiment | What We Tried | Result | Why It Failed |
|---|---|---|---|
| 9 | Cross-niche ring migration | -0.6 to -1.6pp (4/5 niches) | Disrupts ecological isolation. Foreign genomes are maladapted to the receiving niche’s data distribution. |
| 10 | Aggressive structural mutation (2x rates) | -0.2 to -0.9pp (mixed niches) | More mutations add bulk complexity uniformly. Ecological divergence comes from differential selection, not more mutation. |
| 11 | Aggressive LR decay (0.01→0.0001) | -1.2 to -2.3pp | LR drops below effectiveness threshold. Networks freeze in late training — no signal for selection to differentiate niches. |
| 12 | Extended warm-up (600K→1M max steps) | -0.2 to -1.5pp | Over-converged warm-up population loses plasticity. A population still improving at the split point adapts better to new data. |
These experiments were individually disappointing but collectively illuminating. They revealed that the system was at a local optimum for hyperparameter tuning within its current architecture. The sparse linear classifier (direct input→output connections with a few isolated hidden nodes) had a hard ceiling around 78% MNIST.
The meta-lesson: when every variation makes things worse, the problem isn’t the hyperparameters. It’s the architecture.
Phase 4: Breaking Through (Experiments 13-14)
-
Population 200 gave +1-2pp (Experiment 13). Diminishing returns compared to the 50→100 jump, but the only consistently positive lever across all experiments. Interestingly, larger populations produced smaller networks — more competition means leaner survivors.
-
Seeded hidden layer: 78% → 96% (Experiment 14). Instead of evolving topology from sparse input→output connections, seed every network with a 784→32→10 architecture (32 hidden nodes with ReLU, each connected to ~10% of inputs). One change. +18 percentage points. More than all 12 previous tuning experiments combined.
Phase 5: Scaling the Hidden Layer (Experiments 15-19)
With the architectural breakthrough in hand, we explored how far it could go.
-
Width scaling has no diminishing returns yet. 32→64→128 hidden nodes gave 95.87%→97.23%→98.70%. Each doubling adds ~1.4pp. NEAT consistently discovers sparse subnetworks at ~11% of dense parameters regardless of width. (Experiments 15, 17)
-
More narrow feature detectors beat fewer wide ones. At the same ~5,500 connection budget, 64 nodes with 10% input fraction (97.23%) beats 32 nodes with 20% fraction (97.06%). Specialized local detectors compose better than diffuse global ones. (Experiments 15 vs 16)
-
Depth is more parameter-efficient than width. A two-layer 784→64→32→10 (7,518 connections, 98.29%) beats the single-layer 784→64→10 (5,673 connections, 97.23%) and approaches the wider 784→128→10 (11,498 connections, 98.70%) at lower cost. The second hidden layer learns compositional features. (Experiment 18)
-
The hidden layer transforms multi-task learning. Re-enabling Fashion-MNIST ecological speciation with the seeded hidden layer: the 50/50 niche jumped from 74%+73% to 94%+84% on MNIST+Fashion simultaneously. The hidden layer enables shared feature representations that both tasks exploit. Cross-task transfer became positive — the 20/80 niche (80% Fashion) still achieves 92% MNIST. (Experiment 19)
Phase 6: Depth, Multi-Task Scaling, and Sparsity (Experiments 20-22)
Pushing the architecture further — deeper networks, wider multi-task, and sparse inter-layer connections.
-
Multi-task scales with width. 128 hidden nodes in multi-task mode: 50/50 niche hits 97.5% MNIST + 87.5% Fashion (92.5% total). Up from 94%+84% with 32 nodes. Width scaling transfers cleanly from single-task to multi-task. (Experiment 20)
-
Depth breaks the 99% ceiling. Two-layer [128, 64] achieves 99.73% MNIST — only 160 errors on 60,000 images. The single-layer [128] plateaued at 98.70%. The second layer provides compositional features (combinations of first-layer detectors) that single layers can’t express. (Experiment 21)
-
Inter-layer sparsity is free. Halving the connections between hidden layers (50% instead of 100%) costs only -0.05pp (99.68% vs 99.73%) while saving 20% of total parameters. The dense inter-layer connections were massively over-parameterized. (Experiment 22)
The Big Insights
1. Architecture » Hyperparameters
One structural change (the seeded hidden layer) outweighed 12 experiments of hyperparameter tuning. The sparse linear classifier couldn’t exceed ~85% MNIST regardless of population size, mutation rates, or learning schedule. The hidden layer provides nonlinear feature extraction — the thing that makes neural networks neural networks rather than fancy logistic regression.
This echoes a broader lesson in machine learning: model capacity and architecture choices dominate training procedure choices. The best optimizer can’t compensate for an inadequate model.
2. NEAT as Sparse Subnetwork Discovery
With the seeded hidden layer, NEAT’s role shifted from “evolve a network from scratch” to “discover the optimal sparse subnetwork.” Across all widths tested, the system converges to roughly 11% of the equivalent dense network’s parameters:
| Hidden Nodes | Dense Weights | Synth Connections | Compression | Accuracy |
|---|---|---|---|---|
| 32 | 25,760 | 2,943 | 11.4% | 95.87% |
| 64 | 50,880 | 5,673 | 11.1% | 97.23% |
| 128 | 101,760 | 11,498 | 11.3% | 98.70% |
This is essentially the lottery ticket hypothesis implemented via evolution rather than magnitude-based pruning. The evolutionary system finds a “winning ticket” — a sparse subnetwork that performs nearly as well as the dense original — through selection pressure rather than post-hoc pruning. The ~11% compression ratio appears to be a fundamental property of this system, not a coincidence.
3. Negative Results Have Structure
The four negative experiments (9-12) weren’t random failures. They followed a pattern:
- Migration hurts because ecological speciation requires isolation. The system works precisely because each niche evolves independently.
- More mutation doesn’t mean more diversity because diversity comes from differential selection, not more variation. Adding noise uniformly washes out the ecological signal.
- Stronger decay freezes the landscape because there’s a minimum gradient signal needed for selection to differentiate variants.
- Over-training the warm-up kills plasticity because a fully-converged population is too specialized to adapt to new data distributions.
Each negative result constrained the design space and clarified why the positive results (population scaling, moderate LR decay) worked.
4. Depth > Width > Density
At a given parameter budget, depth (more layers) is more efficient than width (more nodes in one layer), which is more efficient than density (more connections per node):
| Architecture | Connections | MNIST | Extra conn per +1pp |
|---|---|---|---|
| [64] single layer | 5,673 | 97.23% | 2,007 |
| [32] @20% inputs | 5,409 | 97.06% | 2,283 |
| [64, 32] two layers | 7,518 | 98.29% | 1,890 |
| [128] single layer | 11,498 | 98.70% | 3,023 |
| [128, 64] @50% inter | 15,201 | 99.68% | 3,239 |
| [128, 64] full inter | 18,924 | 99.73% | 4,142 |
The second hidden layer adds compositional features — combinations of the first layer’s edge and stroke detectors — that a wider single layer can’t efficiently express. At 50% inter-layer sparsity, the [128, 64] network gets 99.68% with 20% fewer parameters than the fully-connected version — the lottery ticket pattern extends to inter-layer connections too.
5. The Hidden Layer Transforms Multi-Task Learning
The seeded hidden layer had an even larger impact on multi-task learning than on single-task:
| Metric | Sparse linear (Exp 13) | Hidden [32] (Exp 19) | Hidden [128] (Exp 20) |
|---|---|---|---|
| 50/50 MNIST | 74.34% | 94.14% | 97.54% |
| 50/50 Fashion | 72.93% | 83.80% | 87.45% |
| 50/50 Total | 73.6% | 88.9% | 92.5% |
With sparse linear classifiers, multi-task learning was limited by representational poverty — each input→output connection is specific to one class. The hidden layer provides a shared feature space that both MNIST and Fashion-MNIST can exploit. Width scaling transfers directly: 4x wider hidden layer gives +3.5pp on both tasks. Cross-task transfer is strongly positive: the 20/80 niche (80% Fashion) still achieves 96% MNIST because warm-up features generalize.
6. Population Size: The Only Consistently Positive Lever
Across all 22 experiments, increasing population size was the only intervention that always helped:
| Change | Pop 50→100 | Pop 100→200 |
|---|---|---|
| MNIST (100/0 niche) | +5.7pp | +2.8pp |
| Multi-task (50/50) | +4.2pp | +1.1pp |
Diminishing returns, but always positive. Larger populations improve evolutionary search quality without changing the learning dynamics — the only intervention that doesn’t disrupt the SGD/evolution balance.
The Numbers
Best Single-Task Result (Experiment 21)
Best individual on full 60,000-image MNIST evaluation:
| Metric | Value |
|---|---|
| MNIST accuracy | 99.73% (59,840 / 60,000) |
| Connections | 18,924 |
| Architecture | 784→128→64→10 (sparse input, full inter-layer) |
| Parameters as % of dense equivalent | 17.3% (13.9% with sparse inter-layer) |
| Training steps | 1,800,000 (online, one example at a time) |
| Population size | 200 |
| Generations | 179 |
Best Multi-Task Result (Experiment 20)
Best individuals from each niche on full 60K×2 datasets:
| Niche | MNIST | Fashion | Total | Connections |
|---|---|---|---|---|
| 100/0 | 98.94% | — | — | 12,844 |
| 50/50 | 97.54% | 87.45% | 92.50% | 12,849 |
| 20/80 | 96.34% | 88.32% | 92.33% | 12,854 |
| 0/100 | — | 88.77% | — | 12,699 |
Accuracy Progression Across All 22 Experiments
| Experiment | Key Change | Best MNIST | Connections |
|---|---|---|---|
| 1-3 | Initial system (50 pop, 10 outputs) | ~72% | ~400 |
| 4 | 20 outputs, 4 niches | ~70% | ~785 |
| 5 | Population 100 | ~75% | ~866 |
| 8 | Decoupled LR decay | ~75% | ~858 |
| 9-12 | Four negative results (migration, mutation, decay, warmup) | ~74-76% | ~850-870 |
| 13 | Population 200 | ~78% | ~838 |
| 14 | Seeded hidden layer [32] | 95.87% | 2,943 |
| 15 | Wider hidden layer [64] | 97.23% | 5,673 |
| 17 | Wider hidden layer [128] | 98.70% | 11,498 |
| 18 | Two hidden layers [64, 32] | 98.29% | 7,518 |
| 19 | Multi-task [32] | 94.14%+83.80% | 3,230 |
| 20 | Multi-task [128] | 97.54%+87.45% | 12,849 |
| 21 | Deep [128, 64] | 99.73% | 18,924 |
| 22 | Deep [128, 64] @50% inter-layer | 99.68% | 15,201 |
The raw data: Experiment Notes and Research Journal.