What We Found

22 experiments on a NEAT-style neuroevolution system, from first working prototype to 99.7% MNIST accuracy.


The System

Synth evolves small neural network topologies using NEAT-style genetic algorithms while training connection weights via per-example stochastic gradient descent. A population of networks all train on the same data stream. Periodically, the worst performers are culled and replaced with mutated offspring of the survivors.

The key design choice: evolution handles topology (what connects to what), SGD handles weights (how strong each connection is). This division of labor exploits what each method is good at. Gradient descent is excellent at continuous optimization but can’t add or remove connections. Evolutionary search is excellent at discrete combinatorial problems but inefficient at tuning thousands of continuous parameters.

For the multi-task experiments (1-13, 19-20), the population splits into ecological niches with different data mix ratios of MNIST and Fashion-MNIST. Each niche evolves independently under its own data distribution — a form of speciation driven by environment rather than genome similarity.

Written in Rust. ~2,000 lines. No tensor libraries. CPU only.

The Experiment Arc

Phase 1: Building and Calibrating (Experiments 1-4)

The first experiments established the baseline system. Key early lessons:

Phase 2: Scaling and Tuning (Experiments 5-8)

Population size and learning rate schedule tuning. This phase produced the strongest baseline.

Phase 3: The Plateau (Experiments 9-12)

Four consecutive negative results. Every variation from the Experiment 8 baseline made things worse.

Experiment What We Tried Result Why It Failed
9 Cross-niche ring migration -0.6 to -1.6pp (4/5 niches) Disrupts ecological isolation. Foreign genomes are maladapted to the receiving niche’s data distribution.
10 Aggressive structural mutation (2x rates) -0.2 to -0.9pp (mixed niches) More mutations add bulk complexity uniformly. Ecological divergence comes from differential selection, not more mutation.
11 Aggressive LR decay (0.01→0.0001) -1.2 to -2.3pp LR drops below effectiveness threshold. Networks freeze in late training — no signal for selection to differentiate niches.
12 Extended warm-up (600K→1M max steps) -0.2 to -1.5pp Over-converged warm-up population loses plasticity. A population still improving at the split point adapts better to new data.

These experiments were individually disappointing but collectively illuminating. They revealed that the system was at a local optimum for hyperparameter tuning within its current architecture. The sparse linear classifier (direct input→output connections with a few isolated hidden nodes) had a hard ceiling around 78% MNIST.

The meta-lesson: when every variation makes things worse, the problem isn’t the hyperparameters. It’s the architecture.

Phase 4: Breaking Through (Experiments 13-14)

Phase 5: Scaling the Hidden Layer (Experiments 15-19)

With the architectural breakthrough in hand, we explored how far it could go.

Phase 6: Depth, Multi-Task Scaling, and Sparsity (Experiments 20-22)

Pushing the architecture further — deeper networks, wider multi-task, and sparse inter-layer connections.

The Big Insights

1. Architecture » Hyperparameters

One structural change (the seeded hidden layer) outweighed 12 experiments of hyperparameter tuning. The sparse linear classifier couldn’t exceed ~85% MNIST regardless of population size, mutation rates, or learning schedule. The hidden layer provides nonlinear feature extraction — the thing that makes neural networks neural networks rather than fancy logistic regression.

This echoes a broader lesson in machine learning: model capacity and architecture choices dominate training procedure choices. The best optimizer can’t compensate for an inadequate model.

2. NEAT as Sparse Subnetwork Discovery

With the seeded hidden layer, NEAT’s role shifted from “evolve a network from scratch” to “discover the optimal sparse subnetwork.” Across all widths tested, the system converges to roughly 11% of the equivalent dense network’s parameters:

Hidden Nodes Dense Weights Synth Connections Compression Accuracy
32 25,760 2,943 11.4% 95.87%
64 50,880 5,673 11.1% 97.23%
128 101,760 11,498 11.3% 98.70%

This is essentially the lottery ticket hypothesis implemented via evolution rather than magnitude-based pruning. The evolutionary system finds a “winning ticket” — a sparse subnetwork that performs nearly as well as the dense original — through selection pressure rather than post-hoc pruning. The ~11% compression ratio appears to be a fundamental property of this system, not a coincidence.

3. Negative Results Have Structure

The four negative experiments (9-12) weren’t random failures. They followed a pattern:

Each negative result constrained the design space and clarified why the positive results (population scaling, moderate LR decay) worked.

4. Depth > Width > Density

At a given parameter budget, depth (more layers) is more efficient than width (more nodes in one layer), which is more efficient than density (more connections per node):

Architecture Connections MNIST Extra conn per +1pp
[64] single layer 5,673 97.23% 2,007
[32] @20% inputs 5,409 97.06% 2,283
[64, 32] two layers 7,518 98.29% 1,890
[128] single layer 11,498 98.70% 3,023
[128, 64] @50% inter 15,201 99.68% 3,239
[128, 64] full inter 18,924 99.73% 4,142

The second hidden layer adds compositional features — combinations of the first layer’s edge and stroke detectors — that a wider single layer can’t efficiently express. At 50% inter-layer sparsity, the [128, 64] network gets 99.68% with 20% fewer parameters than the fully-connected version — the lottery ticket pattern extends to inter-layer connections too.

5. The Hidden Layer Transforms Multi-Task Learning

The seeded hidden layer had an even larger impact on multi-task learning than on single-task:

Metric Sparse linear (Exp 13) Hidden [32] (Exp 19) Hidden [128] (Exp 20)
50/50 MNIST 74.34% 94.14% 97.54%
50/50 Fashion 72.93% 83.80% 87.45%
50/50 Total 73.6% 88.9% 92.5%

With sparse linear classifiers, multi-task learning was limited by representational poverty — each input→output connection is specific to one class. The hidden layer provides a shared feature space that both MNIST and Fashion-MNIST can exploit. Width scaling transfers directly: 4x wider hidden layer gives +3.5pp on both tasks. Cross-task transfer is strongly positive: the 20/80 niche (80% Fashion) still achieves 96% MNIST because warm-up features generalize.

6. Population Size: The Only Consistently Positive Lever

Across all 22 experiments, increasing population size was the only intervention that always helped:

Change Pop 50→100 Pop 100→200
MNIST (100/0 niche) +5.7pp +2.8pp
Multi-task (50/50) +4.2pp +1.1pp

Diminishing returns, but always positive. Larger populations improve evolutionary search quality without changing the learning dynamics — the only intervention that doesn’t disrupt the SGD/evolution balance.

The Numbers

Best Single-Task Result (Experiment 21)

Best individual on full 60,000-image MNIST evaluation:

Metric Value
MNIST accuracy 99.73% (59,840 / 60,000)
Connections 18,924
Architecture 784→128→64→10 (sparse input, full inter-layer)
Parameters as % of dense equivalent 17.3% (13.9% with sparse inter-layer)
Training steps 1,800,000 (online, one example at a time)
Population size 200
Generations 179

Best Multi-Task Result (Experiment 20)

Best individuals from each niche on full 60K×2 datasets:

Niche MNIST Fashion Total Connections
100/0 98.94% 12,844
50/50 97.54% 87.45% 92.50% 12,849
20/80 96.34% 88.32% 92.33% 12,854
0/100 88.77% 12,699

Accuracy Progression Across All 22 Experiments

Experiment Key Change Best MNIST Connections
1-3 Initial system (50 pop, 10 outputs) ~72% ~400
4 20 outputs, 4 niches ~70% ~785
5 Population 100 ~75% ~866
8 Decoupled LR decay ~75% ~858
9-12 Four negative results (migration, mutation, decay, warmup) ~74-76% ~850-870
13 Population 200 ~78% ~838
14 Seeded hidden layer [32] 95.87% 2,943
15 Wider hidden layer [64] 97.23% 5,673
17 Wider hidden layer [128] 98.70% 11,498
18 Two hidden layers [64, 32] 98.29% 7,518
19 Multi-task [32] 94.14%+83.80% 3,230
20 Multi-task [128] 97.54%+87.45% 12,849
21 Deep [128, 64] 99.73% 18,924
22 Deep [128, 64] @50% inter-layer 99.68% 15,201

The raw data: Experiment Notes and Research Journal.