This is the raw, unedited research journal that Claude maintained across sessions. It is reproduced here exactly as produced.
Research Journal
Chronological observations, decisions, and discoveries.
2024-03-08: Initial System Build (commit 13d79b0)
Phase 1: Static sparse network on MNIST
- Built genome/network/data pipeline from scratch. 784 inputs + bias → 10 outputs.
- 5% initial_connection_fraction gives ~400 connections per network.
- Reached ~72% accuracy on MNIST — reasonable for a sparse linear classifier (full dense linear would get ~92%).
- Forward pass: topological sort → weighted sums → softmax. Backward: cross-entropy → reverse topo backprop → SGD.
- Half-cosine weighted fitness over rolling window of 1000 samples.
Phase 2-4: Evolution + Population
- First evolutionary run showed three major problems:
- Survivor fitness wipe: evolve() was rebuilding all individuals from genomes, resetting fitness trackers even for survivors. Fixed by moving survivors intact.
- Over-aggressive weight mutation: mutate_weights() was perturbing ALL ~400 connections when triggered. At stddev=0.1 this was an L2 perturbation of ~2.0 per mutation event — enough to destroy SGD-learned weights. Fixed by adding
per_weight_perturb_prob = 0.10so only ~40 weights get perturbed. - No hidden nodes: population converging to uniform topology. Fixed by increasing structural mutation rates (add_node 3%→10%, add_connection 5%→15%) and evolve_interval (5000→10000).
- After fixes: hidden nodes appearing (796-798 nodes), peak fitness ~0.77, healthy population dynamics.
Key insight: SGD and evolution conflict on weights
Weight mutation is almost always harmful in this system because SGD has already optimized the weights. The per_weight_perturb_prob fix was critical — it means weight mutation now acts as a mild regularizer (perturbing ~10% of weights) rather than a catastrophic reset. The real value of mutation is in structural changes (new connections, new nodes, activation function changes).
2024-03-08: Ecological Speciation (this session)
Expanding to 20 outputs
- Changed from 10 to 20 output nodes to support both MNIST (labels 0-9) and Fashion-MNIST (labels 10-19).
- Initial connection count jumped from ~400 to ~785 (2x outputs = 2x connections from each input/bias to outputs).
- Softmax over 20 outputs during MNIST-only warm-up: the 10 Fashion outputs are essentially random noise. Softmax naturally suppresses them as the MNIST outputs learn to produce higher logits.
DataStream refactoring
- Replaced shuffle-all-and-iterate approach with ratio-based weighted random sampling.
- Datasets shared via
Rc<Vec<Dataset>>so 4 niches don’t quadruple the ~94MB memory footprint. - Label offset mechanism: Fashion-MNIST labels (0-9 in raw data) get +10 offset to map to output nodes 10-19.
Warm-up phase observations
- Stabilization detection: best fitness delta < 0.01 for 3 consecutive evolve intervals.
- Warm-up stabilized at step 110,000 (generation 10), fitness ~0.71.
- This is lower than the previous 10-output run (~0.77). Expected: with 20 outputs the softmax denominator is larger, diluting probability mass. The network must learn to suppress 10 irrelevant outputs.
Early niche behavior (first ~87K steps after split)
- All 4 niches started from the same MNIST-trained population (fitness trackers reset).
- Early fitness across niches (sampled at various points):
- 100/0 (pure MNIST): 0.66-0.73
- 80/20: 0.63-0.69
- 50/50: 0.64-0.72
- 20/80: 0.67-0.73
- The 20/80 niche (mostly Fashion) performs comparably to 100/0 (pure MNIST). Two possible explanations:
- Fashion-MNIST may be slightly easier for these sparse linear-ish networks
- The MNIST-pretrained weights provide a useful initialization even for Fashion tasks (both are 28x28 grayscale images with similar low-level features)
- Connection counts are ~832-836 across all niches, up from warm-up’s ~785. Structural mutations adding connections steadily.
- Node counts barely changed (805-806). Node addition mutations are rarer (10% prob) and many may get culled.
- No clear divergence yet — all niches performing similarly. Need longer runs to see if ecological pressure creates measurable specialization.
Full run results (commit 21c4be1)
The full run completed: 110K warmup + 600K niche = 710K total steps, ~12 min in release.
The big surprise: U-shaped fitness across niches. The extreme niches (100/0 pure MNIST and 20/80 mostly Fashion) significantly outperform the mixed niches (80/20 and 50/50). The 20/80 niche is actually the strongest late-game performer (mean 0.7153 in last 100 samples), surpassing even pure MNIST (0.7100).
Answering the open questions:
- Topology: No meaningful divergence. All niches ended at ~842-845 connections, 807-810 nodes. Structural evolution is too slow relative to SGD weight adaptation. The genomes are essentially converging to similar topologies across all niches.
- 50/50 difficulty: 80/20 is actually harder than 50/50 (mean 0.6818 vs 0.6876). The small 20% Fashion contamination in 80/20 may be just enough to confuse MNIST-trained weights without being enough to actually learn from. The 50/50 niche at least gets enough Fashion examples to start adapting.
- Migration: Not yet tested. Given the lack of structural divergence, migration might just homogenize further. Need topological specialization first.
- Per-dataset accuracy: YES — this is the critical missing metric. The composite fitness metric hides what’s actually happening. We can’t tell if the 20/80 niche is learning Fashion or just exploiting its inherent simplicity.
Structural observations:
- Connections grew from ~785 (warm-up start) → ~833 (niche split) → ~844 (end). About 1 new connection per generation.
- Nodes grew from 805 → 807-810. Barely any new hidden nodes across 59 niche generations.
- The energy penalty (connections * 1e-6) is negligible at this scale (~0.0008). It’s not providing meaningful pressure toward smaller networks.
New insight: Fashion-MNIST might be inherently easier for sparse linear classifiers. The 20/80 niche’s strong performance is unexpected if we assume Fashion-MNIST is harder (it’s generally considered harder for CNNs). But for sparse linear classifiers connecting raw pixels to output classes, Fashion-MNIST’s categories (T-shirt, trouser, pullover, etc.) might have more distinctive pixel patterns than handwritten digits. The visual structure of a trouser (two legs) vs a bag (rectangular blob) may be more linearly separable than a 3 vs an 8.
Suggested improvements (priority order)
-
Per-dataset accuracy tracking: Most important — track MNIST and Fashion accuracy separately for each individual, both on training data and on held-out test sets. This reveals whether multi-task learning is actually happening.
-
Cross-evaluation: After training, evaluate each niche’s best individual on pure MNIST and pure Fashion test sets. This directly measures transfer and specialization.
- Increase structural mutation pressure: The lack of topological divergence suggests structural mutations aren’t frequent enough or impactful enough. Options:
- Increase add_node_prob further (10% → 20%)
- Decrease energy penalty threshold or increase the coefficient
- Add a structural diversity bonus to fitness
-
Add a 0/100 niche: Pure Fashion-MNIST control to establish Fashion baseline.
-
Fitness function redesign: Consider task-weighted fitness where accuracy on different datasets is tracked separately and combined with niche-specific weights.
-
Larger population per niche: 50 individuals per niche may not be enough for meaningful evolutionary exploration. Try 100+, or reduce the number of niches to concentrate population.
-
Learning rate schedule: Currently fixed at 0.01. A warm-up → decay schedule might help convergence.
- Track the innovation tracker: How many innovations are being generated per generation? If niches share the innovation tracker but never produce the same structural mutations, the shared tracker is wasted. Could track innovation reuse as a metric of evolutionary convergence.
2026-03-08: Per-Dataset Accuracy Tracking (Experiment 3)
The big answer: Multi-task learning IS happening
Added DatasetCounter ring buffers to FitnessTracker — each training example records which dataset it came from, so we can track MNIST and Fashion accuracy independently per individual.
Key finding: All mixed niches genuinely learn both tasks. The 50/50 niche achieves ~70% on MNIST AND ~71% on Fashion simultaneously — within 1 percentage point of the pure-MNIST niche’s 70.6%. A single evolved sparse network can classify both handwritten digits and fashion items.
Detailed observations
The cost of multi-tasking is surprisingly low. The 80/20 niche only loses 2.3 percentage points of MNIST accuracy (68.3% vs 70.6%) while gaining 66.7% Fashion accuracy. That’s an extraordinary trade-off — you get a new task nearly for free.
Fashion-MNIST is definitively easier for these networks. The 20/80 niche’s Fashion accuracy (72.0%) exceeds the 100/0 niche’s MNIST accuracy (70.6%). For sparse linear classifiers operating on raw pixels, Fashion categories (trousers, bags, sneakers) apparently have more distinctive pixel patterns than handwritten digits. This resolves the open question from Experiment 2.
MNIST saturates faster than Fashion. In the 50/50 niche, MNIST accuracy barely improves over training (+0.8 percentage points early→late), while Fashion improves steadily (+4.5 percentage points). The networks reach near-maximum MNIST performance early and then mostly improve on Fashion. This makes sense — the MNIST-pretrained warm-up weights already encode digit features; Fashion must be learned from scratch.
Per-dataset accuracies are nearly independent. Pearson r between MNIST and Fashion accuracy within mixed niches is only 0.08-0.20. The two tasks don’t interfere with each other. This is a positive finding about the network’s capacity — 20 outputs with ~844 connections is enough to hold both task mappings without catastrophic interference.
Implementation notes
DatasetCounteruses same ring-buffer design as the mainFitnessTracker- Each counter has its own
seenbitmap — when a dataset isn’t sampled, its counter’s current slot is marked unseen and the slot advances. This means the effective window for per-dataset accuracy is smaller thanfitness_windowin proportion to the dataset’s ratio. For 80/20 niche, Fashion counter sees only ~200 of the 1000-slot window filled. - The
Examplestruct already haddataset_idfrom the DataStream refactor — just needed to thread it throughtrain_on_example → fitness.record.
Next steps
- Add cross-evaluation: test each niche’s best on pure MNIST and pure Fashion datasets
- Add 0/100 (pure Fashion) niche as control
- Increase structural mutation rates to try driving topological divergence
2026-03-08: Cross-Evaluation + 0/100 Niche + Structural Mutation (Experiment 4)
Cross-evaluation reveals zero transfer between tasks
The most important result yet: pure-task niches show zero cross-task transfer. The 100/0 niche’s best individual gets 47/60000 correct on Fashion-MNIST (0.08%). The 0/100 niche gets 492/60000 on MNIST (0.82%). These are essentially random chance for a 10-class problem with 20 outputs.
This proves multi-task capability must be actively trained. The MNIST-pretrained warm-up weights provide NO useful features for Fashion classification — despite both being 28x28 grayscale images. The pixel patterns that distinguish digits (curves, loops, line segments) are fundamentally different from the patterns that distinguish fashion items (silhouettes, textures, shapes).
Fashion confirmed easier: 75.6% vs 70.5%
The 0/100 niche achieves 75.6% Fashion accuracy on the full 60K dataset, versus 70.5% MNIST for the 100/0 niche. Fashion-MNIST is definitively easier for sparse linear classifiers. This may seem counterintuitive (Fashion is harder for CNNs), but makes sense for raw-pixel linear classifiers — fashion categories have distinctive overall shapes (trouser=two legs, bag=rectangle, ankle boot=L-shape) that are more linearly separable in pixel space than handwritten digits.
Multi-task networks are the most capable
If we measure total capability (correct predictions across both tasks on 120K examples):
- 50/50 niche: 82,854 correct (69.0%)
- 20/80 niche: 82,886 correct (69.1%)
- 80/20 niche: 79,842 correct (66.5%)
- 100/0 niche: 42,342 correct (35.3%)
- 0/100 niche: 45,846 correct (38.2%)
The mixed niches are objectively more capable — a single 845-connection network handling both tasks almost as well as specialized networks handle one task each. The “cost” of multi-tasking: 50/50 loses ~2.3pp on MNIST and ~5.7pp on Fashion vs pure niches.
Structural mutation: early signs of divergence
With add_node_prob doubled (10%→20%):
- Node counts now span 808-811 (vs 807-809 in Experiment 3)
- Connection counts span 808-848 — a 40-connection spread (vs 841-845 previously)
- Fashion-heavy niches (20/80, 0/100) accumulate more connections (847-848) than MNIST-heavy ones (808-809)
- This could mean Fashion classification benefits from more network complexity, or it could be noise
The 80/20 problem
The 80/20 niche consistently underperforms. Across all four experiments, it’s always the weakest. The mechanism seems to be: 20% Fashion examples are enough to inject gradient noise into MNIST-trained weights but not enough to build robust Fashion representations. The cross-eval confirms this: 63.9% Fashion (worst of all mixed niches) and 69.2% MNIST (second-worst after 20/80’s MNIST). It’s not just that 80/20 is bad at Fashion — it’s slightly worse at MNIST too compared to pure 100/0 (69.2% vs 70.5%).
Next experiments to try
- Population size 100 per niche — more evolutionary diversity
- Longer runs (1.2M niche steps) — see if structural divergence continues
- Fitness function redesign — reward multi-task breadth
- Migration between niches — share successful genomes across ecological boundaries
2026-03-08: Large Population Experiment (Experiment 5)
Population size is the key variable
Doubled population (50→100) and niche steps (600K→1.2M). 179 generations total, ~70 min runtime.
The 80/20 problem is fixed. With population 100, the 80/20 niche’s best individual achieves 75.3% MNIST accuracy on the full 60K dataset — MATCHING the pure-MNIST niche (also 75.3%). It also achieves 66.9% Fashion. The “worst of both worlds” effect from small populations was a population diversity problem, not a fundamental issue with the 80/20 distribution.
New records across the board. All task accuracies improved 1-6pp over Experiment 4. The biggest gain was 80/20’s MNIST (+6.1pp), the smallest was 0/100’s Fashion (+1.1pp, was already high).
50/50 niche = best generalist. 72.8% MNIST + 72.2% Fashion = 72.5% combined accuracy on 120K examples. A single 867-connection network performing almost as well as two separate specialists. The cost of multi-tasking: only 2.5pp below the pure-MNIST specialist.
Structural convergence. All niches ended at ~866-868 connections and 816-817 nodes despite starting the run with visible structural divergence (80/20 was 10 connections lighter mid-run). The conclusion: at this network scale (~860 connections), there’s an “optimal size” that all niches converge to regardless of task distribution. Structural evolution explores but finds the same optimum.
Interesting asymmetry in transfer. The 0/100 (pure Fashion) niche shows 7.6% MNIST accuracy — above the 5% chance level, suggesting slight positive transfer from Fashion features to digit recognition. But 100/0 (pure MNIST) shows 0.00% Fashion accuracy (1/60000) — zero transfer in the other direction. Fashion images apparently contain features mildly useful for digit classification (maybe edge patterns?), but digit features are useless for fashion classification.
The stabilization problem
The warm-up phase ran the full 600K max_steps without triggering stabilization (delta < 0.01 for 3 consecutive intervals). With 100 individuals, the best fitness oscillates too much. The detector needs to be adapted for larger populations — either by tracking average fitness instead of best, or by relaxing the patience parameter.
What I learned about this system
-
Population diversity > structural complexity. The jump from 50→100 individuals improved accuracy more than doubling the structural mutation rate. The evolutionary search through weight combinations (via crossover of SGD-trained parents) is the primary source of improvement, not topology exploration.
-
Multi-task learning scales with evolution time. 179 generations consistently outperforms 72 generations across all metrics. The networks haven’t fully converged even at 1.8M steps — there’s probably more to gain with longer runs.
-
The system design is validated. A population of 100 small networks (~866 connections), evolving topology via NEAT and training weights via online SGD, can simultaneously classify handwritten digits at 73% and fashion items at 72% — from raw 784-dimensional pixel vectors, with no feature engineering, no convolutional structure, no batch training.
Next experiments
- Try learning rate decay — may help with convergence
- Fix the stabilization detector for large populations
- Consider migration between niches now that structural convergence is confirmed
- Population 200 to test diminishing returns
2026-03-08: Learning Rate Decay — A Cautionary Result (Experiment 6)
TL;DR: Learning rate decay hurt because the warm-up got shortened
Added linear LR decay (0.01→0.001) and fixed warm-up stabilization to use avg fitness instead of best fitness. The stabilization worked TOO well — it triggered at 80K steps (vs 600K in Exp 5), giving the population only 800 examples per individual of MNIST pre-training. All cross-eval accuracies dropped 1-3pp.
The lesson: stabilization detection needs a minimum floor
The avg-fitness stabilization detector is correct in principle (less noisy than best-fitness), but it’s too sensitive. Average fitness stabilizes before the population has learned enough. Need a min_warmup_steps parameter so that stabilization checking doesn’t begin until a minimum training period has elapsed.
LR decay may still be worth testing
The 0.01→0.001 decay did produce slightly lower late-phase fitness variance (need to confirm this quantitatively). The 80/20 niche’s Fashion accuracy was actually +0.53pp vs Experiment 5 — the only metric that improved. If the warm-up can be extended back to ~200K+ steps, the decay might help in the niche phase.
An important meta-lesson about confounded experiments
This experiment changed TWO things simultaneously (LR decay + stabilization method), making it impossible to isolate causation. The performance drop is clearly from shorter warm-up (the stabilization change), not from LR decay (the intended change). Should have changed one thing at a time — but sometimes you learn more from mistakes than from successes. Adding min_warmup_steps and rerunning.
Next: add min_warmup_steps and rerun with LR decay
2026-03-08: LR Decay Retest with Warm-Up Floor (Experiment 7)
Structural divergence: the breakthrough
With min_warmup_steps=200K, the LR decay experiment produced the most structural divergence ever seen across niches:
- 80/20 niche: 878 avg connections (highest ever)
- 0/100 niche: 852 avg connections (lightest)
- Spread: 26 connections (vs ~4-5 in previous experiments)
The 80/20 niche (which historically struggled most) accumulates the most network complexity. This makes intuitive sense — handling the difficult 80% MNIST / 20% Fashion confusion pattern requires more representational capacity than pure single-task training.
Fashion-heavy niches are leaner: the pure Fashion niche (0/100) and Fashion-dominant niche (20/80) are the lightest, consistent with Fashion being an “easier” task for these networks.
Why LR decay might help structural evolution
The hypothesis: when LR is high (0.01), SGD can quickly adapt weights to compensate for any structural change, making structural mutations nearly neutral. When LR is low (0.003-0.005 in late training), SGD adapts weights more slowly, so structural mutations (adding connections/nodes) have more persistent impact on fitness. This makes topology matter more, enabling niches to diverge structurally.
If this hypothesis is correct, we should see MORE structural divergence with MORE aggressive LR decay (e.g., 0.01→0.0001). But we need adequate warm-up first.
Accuracy: still slightly behind Experiment 5
All cross-eval metrics are 0.5-2pp below Experiment 5 (constant lr=0.01). The primary suspect is still warm-up duration: 270K steps (Exp 7) vs 600K (Exp 5), giving 2700 vs 6000 examples per individual. A clean comparison would use the same warm-up duration for both.
Design consideration: decouple warm-up LR from niche LR
The current LR decay runs across the entire training duration. A better design might be:
- Warm-up: constant lr=0.01 (maximizes MNIST learning speed)
- Niche phase: decay from 0.01 to 0.001 (enables structural divergence)
This would give each niche the benefit of a fully-trained warm-up while still allowing late-phase structural differentiation.
Next steps
- Decouple warm-up LR from decay schedule
- Increase min_warmup_steps to match Experiment 5’s effective warm-up
- Test more aggressive LR decay (0.01→0.0001) to amplify structural divergence
2026-03-08: Decoupled LR Schedule (Experiment 8)
Result: matches baseline accuracy, moderate structural divergence
The decoupled LR schedule (constant lr=0.01 for warm-up, decay 0.01→0.001 in niche phase only) successfully recovers the accuracy that Experiments 6-7 lost due to premature/short warm-up.
Cross-eval results nearly identical to Experiment 5 (constant lr=0.01):
- 50/50 niche: 72.4% MNIST + 72.6% Fashion = 72.5% total (same as Exp 5)
- 80/20 niche: 74.7% MNIST + 67.1% Fashion = 70.9% total (vs 71.1% in Exp 5)
- 20/80 niche: 69.2% MNIST + 75.2% Fashion = 72.2% total (vs 72.1% in Exp 5)
Structural divergence: 20/80 at 875 avg connections, 100/0 at 858 — a 17-connection spread. Less than Exp 7’s 26-connection spread but more than Exp 5’s negligible spread.
Key learning: warm-up quality dominates accuracy
Across Experiments 5-8, the single most important variable for final cross-eval accuracy is warm-up duration/quality. Experiments 6 and 7 had shorter warm-ups (80K and 270K) and lost 1-3pp. Experiments 5 and 8 had 600K warm-ups and matched each other. The niche-phase LR decay is a secondary factor — it modestly helps structural divergence but doesn’t meaningfully change accuracy when warm-up is adequate.
Design resolution
Adopting decoupled LR (constant warm-up + niche-phase decay) as the default going forward. It gives us:
- Same accuracy as constant-lr (Exp 5)
- Moderate structural divergence (better than Exp 5, less than Exp 7)
- Clean separation of concerns: warm-up phase optimizes for learning quality, niche phase optimizes for fine-tuning + structural differentiation
What’s next: the system is converging
After 8 experiments, the basic system design feels stable. The main hyperparameters are tuned (pop 100, lr 0.01→0.001, 1.8M total steps, 5 niches). The remaining design space to explore:
- Structural diversity — how to drive more topological specialization between niches
- Inter-niche dynamics — migration, competition, or resource sharing
- Network expressiveness — activation functions beyond identity (tanh, ReLU) for hidden nodes
- Fitness landscape — should multi-task breadth be explicitly rewarded?
Each of these is a qualitative shift, not a hyperparameter tweak. Time to pick the highest-leverage one.
Performance Optimization: 70min → 3min (22.5x speedup)
Phase 1: Rayon parallelism (Experiment 9, commit 8d656a1)
- Rc→Arc for thread safety, zero-copy input sampling, par_iter_mut on individuals and niches
- Result: 70min → 15min (4.5x)
Phase 2: Profiling-guided hot-path optimization
Profiled with perf record (~783K samples). Top hot spots:
- forward.rs inner loop (weighted sum): 20.08%
- forward.rs input copy + NodeKind match: 13.83%
- backward.rs inner loop (gradient prop + weight update): 19.26%
- backward.rs reverse topo + match: 7.22%
- Bounds checking (index.rs): ~5.9%
- Rayon/crossbeam overhead: ~8.7%
Three optimizations applied:
- Eliminate NodeKind match: Store
input_bias_countin Network. Forward/backward loops skip input/bias nodes by range instead of per-node match. Saves ~13% (forward match + backward match). - Unsafe get_unchecked: All indices in connection loops come from compiled topology and are provably valid. Eliminates bounds checks in the hottest inner loops. Saves ~6%.
- Batched training: Pre-sample 100 examples, then train all individuals in one rayon dispatch per batch. Each rayon task does 100 forward+backward passes (500K FLOPs) instead of 1 (5K FLOPs). Reduces rayon dispatch overhead from ~9% to negligible.
Result: 15min → 3min 7sec (4.8x). Total: 70min → 3min (22.5x).
Cross-evaluation results are consistent with previous experiments — no accuracy regression from optimizations.
Experiment 9: Ring migration — a clear negative result
Implemented ring migration (best individual from niche[i] → niche[(i+1) % 5] every 100K steps). The hypothesis was that sharing successful genomes between niches could enable knowledge transfer.
Result: Migration hurt 4/5 niches (-0.6pp to -1.6pp). Only the 0/100 niche benefited (+1.1pp), because its migrants come from 20/80 which trains on mostly Fashion data — the distributions overlap.
Why it fails: Ring migration forces distribution-mismatched individuals into foreign niches. A Fashion specialist in a MNIST niche either dies (wasting evolutionary bandwidth) or dilutes the gene pool with maladapted features. The 50/50 balanced niche suffered most (-1.6pp), perhaps because it’s equidistant from both extremes and gets the most disruptive migrants.
Structural homogenization: Migration also collapsed the structural divergence from 17-connection spread (Exp 8) to 11-connection spread. It actively counteracts topological specialization.
Takeaway: Ecological speciation works BECAUSE niches are isolated. Simple migration breaks the isolation without compensating benefits. If migration is worth revisiting, it should be similarity-aware (adjacent niches only) and breed-based (crossover with a native) rather than replacement-based.
Reverting migration_interval to 0 (disabled) for future experiments.
Experiment 10: Aggressive Structural Mutation — diminishing returns
Doubled both structural mutation rates (add_node: 0.20→0.30, add_connection: 0.15→0.30). The prediction was that more structural mutations would drive more topological divergence between niches.
Result: More bulk complexity (+30 connections across all niches), but LESS inter-niche divergence (10-connection spread vs 17 in Experiment 8). The pure-task niches improved slightly (+1.4pp average) but mixed niches degraded slightly (-0.5pp average).
Why more mutation doesn’t mean more divergence
The key insight: structural divergence between niches comes from differential selection on structural variants, not from more structural variants. With 30% add_node/add_conn, every niche generates enormous structural novelty — but selection applies the same culling pressure everywhere, washing out most of it identically. The ecological differentiation in Experiments 7-8 came from LR decay making structural changes matter more to fitness, not from generating more of them.
Analogy: increasing mutation rate is like adding more paint cans. Structural divergence requires different niches to paint different pictures — that comes from different selection pressures, not more paint.
The 20/80 niche lost its structural uniqueness
In Experiment 8, 20/80 was the clear structural outlier at 875 avg connections (+17 over 100/0). With aggressive mutation, it’s actually the LEAST complex at 887 (-10 below 100/0’s 897). High mutation rate overwhelms the ecological signal that previously drove 20/80 to accumulate more capacity. The biological analogy: if every organism mutates aggressively, the subtle fitness differences between niches get drowned in noise.
Reverting to default rates
Structural mutation at 0.20/0.15 is near-optimal. The path to more divergence is through the LR schedule (stronger decay) or selection pressure (fitness landscape changes), not through mutation rate.
Experiment 11: Aggressive LR Decay — the Goldilocks lesson
The hypothesis was wrong
Predicted that stronger LR decay (0.01→0.0001 vs 0.01→0.001) would amplify the structural divergence seen in Experiments 7-8 by making structural mutations even more influential relative to weight learning. Instead, accuracy dropped 1-2pp across all mixed niches and structural divergence actually decreased (14-conn spread vs 17 in Exp 8).
Why too-aggressive LR decay hurts everything
With LR decaying to 0.0001, networks in the last ~300K niche steps are effectively frozen. SGD updates move weights by < 0.0001 per example — negligible for any single training sample. This creates a “frozen landscape”:
- Weight mutations can’t improve fitness (SGD too slow to train new weights)
- Structural mutations can’t improve fitness (added connections can’t learn useful weights)
- Selection has no signal (everything is near-equal fitness)
- Niches can’t differentiate (no differential selection pressure)
The Goldilocks zone for LR decay
Three data points now:
- Constant lr=0.01 (Exp 5): Good accuracy, no structural divergence
- Decay to 0.001 (Exp 8): Good accuracy, moderate structural divergence (17-conn spread)
- Decay to 0.0001 (Exp 11): Bad accuracy (-2pp), less structural divergence (14-conn spread)
The sweet spot is around 0.001. The LR needs to be low enough that structural changes matter but high enough that networks can still learn. The window is roughly 0.001-0.005 for the final LR.
Recurring pattern: pure 0/100 niche benefits from degraded learning
For the third consecutive experiment (9, 10, 11), the pure Fashion niche is the only one that improves. Pattern: any change that slows/disrupts late-phase learning helps 0/100 but hurts mixed niches. This may indicate that Fashion-MNIST features are “fragile” — once learned, they’re easily overwritten by continued training on the same distribution. Lower late LR protects them. Mixed niches can’t benefit because they need ongoing adaptation to handle both tasks.
Next direction: try something qualitatively different
Experiments 9-11 all explored “more of the same” — more migration, more mutation, more decay — and all produced diminishing or negative returns. The system may be near a local optimum for the current architecture (sparse linear classifiers with NEAT topology evolution + SGD weight training).
Qualitatively different directions to consider:
- Longer warm-up — push from 600K to 1M steps. This is the most reliable positive intervention (Exp 5 vs 6-7 showed warm-up quality dominates).
- Population 200 — the last time we increased pop (50→100 in Exp 5) it gave +4-6pp. Diminishing returns are likely but worth measuring.
- Fitness landscape redesign — the current fitness function doesn’t explicitly reward multi-task breadth. A 50/50 niche individual that gets 72% on both tasks has the same fitness as one that gets 80% on one and 64% on the other. An entropy bonus or breadth reward could drive more balanced multi-task learning.
- Architecture change — move beyond per-connection evolution to something that can express more complex functions (attention, gating, modularity).
Experiment 12: Extended Warm-Up — the plasticity trap
Counter-intuitive result: more warm-up hurts
Extended max_steps from 600K to 1M. Warm-up stabilized at 670K (avg fitness 0.7439, 66 generations). All mixed niches lost 1-2pp accuracy despite starting from a higher warm-up fitness. Only 0/100 Fashion improved (+1.2pp).
The plasticity hypothesis
A population that’s “fully trained” on MNIST-only is over-specialized. When split into niches with Fashion data, the population’s topology and weight patterns are optimized for MNIST digit features. The Fashion-containing niches (80/20, 50/50, 20/80) need to partially unlearn MNIST specialization and develop Fashion features — harder when the starting point is more deeply entrenched.
In Experiment 8, the population was still improving at 600K steps. This “still learning” state means the population retains plasticity — it hasn’t settled into a deep MNIST-specific optimum. The niche split catches the population in a more adaptable state.
Structural convergence: warm-up homogenizes topology
The extra 7 evolution cycles in warm-up produced MORE MNIST-optimal topology convergence. 20/80 lost its structural uniqueness (862 vs 875 in Exp 8). With less starting topology diversity, niche-phase LR decay can’t recreate the divergence.
The stabilization detector is a false positive indicator
The detector triggers when avg fitness delta < 0.01 for 3 intervals. But “stabilized” doesn’t mean “optimized for niche split.” It means “the population has stopped improving on MNIST-only.” This is exactly the wrong moment to split — it means the population has maximally specialized on the warm-up task and minimally retains plasticity for the new data.
The BETTER warm-up exit is the one in Exp 8: hitting max_steps while still improving. The population is “good enough” at MNIST but not over-committed.
Implications for warm-up design
The ideal warm-up duration is NOT “until convergence.” It’s “long enough to learn useful features, short enough to retain adaptability.” For this system, that’s roughly 400K-600K steps. The stabilization detector should perhaps be replaced with a fixed warm-up duration.
The experiments 9-12 pattern
Four consecutive experiments (migration, aggressive mutation, aggressive LR decay, extended warm-up) have all been net-negative vs Experiment 8. The system is at a local optimum for hyperparameter tuning. The next breakthrough requires a qualitative change, not a quantitative one.
Top candidates:
- Fitness function redesign — reward multi-task breadth explicitly
- Population 200 — last known positive intervention type (pop 50→100 gave +4-6pp)
Experiment 13: Population 200 — the first positive result since Experiment 8
Finally, an improvement
After four consecutive negative experiments (9-12), population 200 broke through:
- 100/0: 77.78% MNIST (+2.77pp) — new best
- 50/50: 74.34% MNIST (+1.97pp) + 72.93% Fashion (+0.29pp) = 73.6% total — new best multi-task
- 0/100: 75.45% Fashion (+0.70pp) + 6.48% accidental MNIST transfer (vs 2.68%) — highest cross-task leak ever
Diminishing returns are real but manageable
Pop 50→100 gave +4-6pp (Experiment 5). Pop 100→200 gives +0.5-2.2pp (this experiment). Returns roughly halved. Prediction: pop 200→400 would give +0.2-1pp — probably still worth it if compute allows.
Smaller networks with bigger populations
Counter-intuitive: pop 200 networks average 824-842 connections, while pop 100 networks average 858-875. More individuals = stronger selection pressure = leaner survivors. The energy penalty (connections × 1e-6) becomes a stronger differentiator when competition is fiercer. This means more population → more efficient networks, not just better networks.
Population size: the ONLY consistently positive lever
Across all experiments:
- Pop 50→100: +4-6pp (Exp 5 vs 4)
- Pop 100→200: +0.5-2.2pp (Exp 13 vs 8)
- Migration: negative (Exp 9)
- Aggressive mutation: negative (Exp 10)
- Aggressive LR decay: negative (Exp 11)
- Extended warm-up: negative (Exp 12)
Population size is uniquely positive because it improves evolutionary search quality without changing the learning dynamics. More individuals = better sampling of the fitness landscape = better selection = better offspring. Every other change we tried altered the learning dynamics (LR, mutation, warm-up) and disrupted the balance.
Adopting pop 200 as new default
The ~2x compute cost (5.5min vs 3min) is acceptable for +1-2pp accuracy. Updating the default.
Experiment 14: Seeded Hidden Layer — 78% → 96% in one change
The breakthrough: architecture, not hyperparameters
After 13 experiments of tuning hyperparameters within a sparse linear classifier architecture (best: 77.78% MNIST), one architectural change — seeding networks with a 784→32→10 hidden layer — jumped to 95.87% MNIST with ~2,943 connections.
This is +18pp in a single experiment. For comparison, all 12 previous hyperparameter experiments combined produced roughly +8pp improvement (from ~70% to ~78%).
Why this works
The previous architecture was fundamentally a sparse logistic regression: ~830 random input→output connections, no nonlinearity on the path from input to output. Hidden nodes existed but were isolated (one input, one output from the connection split) and almost never enriched.
The seeded hidden layer provides:
- Nonlinear features: 32 ReLU hidden nodes each seeing ~78 of 784 inputs. These can compute edges, corners, stroke detectors — the building blocks of digit recognition.
- A trainable starting point: SGD can immediately train useful features, unlike the old architecture where hidden nodes had to be grown one at a time by evolution.
- A pruning/refinement problem for evolution: Instead of “build a network from scratch,” evolution now solves “which of these ~2,800 connections are useful?” — a much easier problem.
Training dynamics are completely different
- 89% in 20K steps (gen 1!) vs 65% at 20K before
- 91% in 50K steps vs ~70% at 50K before
- 95% by 500K steps — the old system never reached this
- Convergence is essentially complete by step 1M; the remaining 800K steps provide <1pp marginal gain
Connection count equilibrium
Started at ~2,760, stabilized at ~2,940-2,945. The remove_connection mutation (5%) and add_connection/add_node mutations roughly balance. Only +5 hidden nodes added over 179 generations — the 32 pre-seeded ones are sufficient.
Parameter efficiency
2,943 connections at 95.87% vs:
- Dense linear (7,840 weights): ~92% — we beat it with 37% of parameters
- Dense 784→32→10 (25,760 weights): ~96-97% — we match it with 11.4% of parameters
- Our previous sparse linear (830 weights): 78% — same system, just better starting topology
Implications
-
Architecture » hyperparameters: One structural change outweighed 12 experiments of tuning. This is the lesson of deep learning applied to neuroevolution — the topology matters more than the training schedule.
-
NEAT as subnetwork discovery: With a seeded hidden layer, NEAT’s role shifts from “evolve a network from scratch” to “discover the optimal sparse subnetwork within a given architecture.” This is exactly the lottery ticket hypothesis — and evolutionary selection is a natural mechanism for finding winning tickets.
-
Multi-task potential: The obvious next step is re-enabling 20 outputs and 5 niches with the seeded hidden layer. If the sparse linear system achieved 72-74% multi-task accuracy, a hidden-layer system might reach 90%+.
-
Deeper networks: If 784→32→10 gets 96%, what about 784→64→32→10? The system already handles arbitrary DAG topologies — deeper initialization is a config change, not an architecture change.
Experiments 15-16: Width vs Density scaling
Width wins over density at the same parameter budget
Two experiments at ~5,500 connections:
- Exp 15: 64 nodes × 78 inputs each → 97.23%
- Exp 16: 32 nodes × 157 inputs each → 97.06%
More narrow feature detectors beat fewer wide ones. This makes intuitive sense — each hidden node is more useful when it specializes in detecting a specific pattern in a local input region. Wider receptive fields dilute the signal.
The ~97% single-layer ceiling
Both configurations converge to ~97%. A fully-connected 784→64→10 gets ~97.5-98%. We’re at ~97.2% with 11% of those parameters. The gap is closing but the last 0.5-1pp will be hard to get without depth.
Scaling trajectory
| Hidden | Connections | MNIST | Δ from prev |
|---|---|---|---|
| 32 | 2,943 | 95.87% | (baseline) |
| 64 | 5,673 | 97.23% | +1.36pp |
| 128? | ~11,000? | 97.5-98%? | +0.3-0.8pp? |
Prediction: 128 nodes will give ~97.5-97.8%, diminishing returns. Getting past 98% probably requires depth.
Experiment 17: 128 hidden nodes — prediction was wrong (in a good way)
The scaling is better than expected
I predicted 97.5-97.8% for 128 nodes. Got 98.70%. The +1.47pp from 64→128 is actually slightly LARGER than the +1.36pp from 32→64. Width scaling is not diminishing yet — it may even be accelerating.
The width scaling law
| Hidden | Connections | MNIST | Δ from prev |
|---|---|---|---|
| 32 | 2,943 | 95.87% | — |
| 64 | 5,673 | 97.23% | +1.36pp |
| 128 | 11,498 | 98.70% | +1.47pp |
| 256? | ~23,000? | 99.0-99.2%? | +0.3-0.5pp? (ceiling) |
The ~11% compression ratio is remarkably consistent: NEAT always discovers that ~89% of connections in the seeded architecture are unnecessary. This is a fundamental property, not a coincidence.
99.6% rolling accuracy was observed
At step 1.77M, the rolling window (last 1000 examples) hit 99.6%. The full-dataset eval is lower (98.7%) because it includes harder examples. But seeing 99.6% in any window means the architecture can handle most MNIST digits perfectly — the remaining errors are genuinely ambiguous examples.
Updated prediction
256 hidden nodes should push close to 99%. Getting past 99% may require depth (two hidden layers) or tricks (data augmentation-like effects from the evolutionary process). A single hidden layer MLP on MNIST typically maxes out around 99.0-99.2%.
Experiment 18: Depth is more efficient than width
Generalized multi-layer support
Extended Genome::new_seeded() to accept &[u32] layer specs instead of a single hidden_count. Now supports arbitrary architectures like [64, 32] or [128, 64, 32]. Connectivity: sparse input→first layer, full between adjacent layers, full last→output.
The depth efficiency result
784→64→32→10 gets 98.29% with 7,518 connections. This is the most parameter-efficient configuration yet:
| Config | Connections | MNIST | Extra conn per +1pp |
|---|---|---|---|
| [64] | 5,673 | 97.23% | 2,007 |
| [64, 32] | 7,518 | 98.29% | 1,890 |
| [128] | 11,498 | 98.70% | 3,023 |
Depth beats width on a per-parameter basis. The second layer provides compositional features — combinations of the first layer’s edge/stroke detectors — that a wider single layer can’t efficiently express.
Implications for architecture search
The system now supports arbitrary layer specs, which opens the design space the user suggested. Architectures like [700, 100, 100, 300] are now expressible. The question is whether evolution can navigate this space effectively, or whether the initial topology matters more than what evolution does with it.
Two directions to explore:
- Deeper standard architectures: [128, 64], [128, 64, 32] — push the depth scaling
- Re-enable Fashion-MNIST ecological pressure: With the seeded hidden layer, multi-task learning should be dramatically better than the old 72-74%
Experiment 19: Multi-task with hidden layer — the system works
The numbers are dramatic
50/50 niche accuracy with sparse linear classifiers (Exp 13): 74.34% MNIST + 72.93% Fashion 50/50 niche accuracy with seeded hidden layer (Exp 19): 94.14% MNIST + 83.80% Fashion
That’s +20pp MNIST and +11pp Fashion. The hidden layer is even more impactful for multi-task learning than single-task. Why? Because sparse linear features can’t learn task-agnostic representations. A direct input→output connection is specific to one output class. A hidden node with ReLU can detect an edge or texture useful for BOTH MNIST digits and Fashion-MNIST categories.
Cross-task transfer is now positive
In the old system, pure niches had zero cross-task transfer. Now the 20/80 niche gets 92.35% MNIST despite only seeing 20% MNIST data. The warm-up phase (MNIST-only) trains general features that persist through niche-phase specialization. The hidden layer’s features are partly task-agnostic.
Fashion-MNIST is the bigger beneficiary
Fashion gained more than MNIST from the hidden layer:
- MNIST: 74% → 94% = +20pp (27% of the gap to 100%)
- Fashion: 73% → 84% = +11pp (41% of the gap to 100%)
But Fashion still trails MNIST by ~10pp. Fashion-MNIST categories (T-shirt vs coat, sandal vs sneaker) have more inter-class similarity than MNIST digits, requiring finer feature discrimination. A wider hidden layer might help Fashion more than MNIST.
The system is now genuinely useful
A network with 3,230 connections that gets 94% MNIST + 84% Fashion simultaneously is a real multi-task learning result. The ecological speciation creates natural specialization — pure-task niches achieve near-peak single-task accuracy (95.5% MNIST, 86.3% Fashion) while mixed niches learn to balance both tasks.
Next: try wider hidden layers (64 or 128 nodes) in multi-task mode. If single-task scaled from 96%→99%, multi-task might scale from 88%→95%+.
Experiment 20: Multi-task [128] — the scaling transfers
50/50 niche: 97.54% MNIST + 87.45% Fashion = 92.5% total. Up from 88.9% with [32]. Width scaling transfers cleanly from single-task to multi-task — each 4x width increase gives ~3.5pp on both tasks.
The 100/0 niche hits 98.94% MNIST — almost identical to the pure-MNIST Experiment 17 (98.70%). Multi-task overhead with 20 outputs is negligible. The hidden layer features are largely task-agnostic.
Fashion-MNIST still trails MNIST by ~10pp across all niches. This gap is structural — Fashion categories have more inter-class similarity than MNIST digits. Depth might help more than further width for closing this gap.
Experiment 21: [128, 64] depth — 99.73% and the single-layer ceiling
Depth breaks the 99% barrier
Single-layer [128] plateaued at 98.70%. Two-layer [128, 64] hits 99.73%. The second layer enables compositional features — combinations of first-layer edge/stroke detectors — that single layers can’t express. Only 160 errors on 60,000 images.
The population converged completely
Top 3 individuals are identical: 99.73%, 18,924 connections, 989 nodes. The evolutionary search found THE subnetwork within [128, 64] and the entire population converged to it. Selection has no remaining signal.
The architecture scaling story is now clear
| Architecture | Connections | MNIST | Key insight |
|---|---|---|---|
| Sparse linear | 830 | 78% | No hidden features → hard ceiling |
| [32] | 2,943 | 96% | Hidden layer breaks the ceiling |
| [128] | 11,498 | 99% | Width helps but single-layer caps at ~99% |
| [128, 64] | 18,924 | 99.7% | Depth breaks the next ceiling |
Each architectural innovation (hidden layer, more width, depth) breaks through a ceiling that hyperparameter tuning can’t penetrate. The lesson applies at every scale: when you’re stuck, change the architecture.
Experiment 22: Sparse inter-layer connections — free parameter savings
Halved the 128→64 inter-layer connections (50% instead of 100%). Result: 99.68% vs 99.73% — only 0.05pp cost for 20% fewer total connections (15.2K vs 18.9K). The dense inter-layer was massively over-parameterized.
This confirms the lottery ticket pattern extends to inter-layer connections, not just input→hidden. The optimal inter-layer density is probably 25-50%, not 100%. Compression ratio improved from 17.3% to 13.9% of the dense equivalent.
Performance, Phase 3: HotConnection — packing the hot path (commit ca3ecdd)
The deeper architectures from Experiments 18-22 made runtime a problem again. [128, 64] at 1,000 steps/sec meant ~30 minutes for a 1.8M-step run. Profiled and rewrote the inner loop’s data layout.
What changed
The compiled phenotype previously stored CompiledConnection { from_idx, to_idx, weight_idx, genome_conn_index } (4 usizes = 32 bytes), with weights in a separate Vec<f32> indexed by weight_idx. That meant every connection in the forward inner loop did two random loads: one for the connection record, one for the weight. On a 15K-connection network, that’s 30K dependent loads per forward pass.
HotConnection { from_idx: u32, weight: f32 } (8 bytes) puts the weight inline. Cold-path data (genome_conn_index, used only during weight write-back) moved to a parallel Vec. The hot loop is now one sequential read of the connection array — from_idx still drives a random load into post_activations, but the weight rides along on the same cache line.
Result
- Before: 1,000 steps/sec (~30 min for 1.8M steps)
- After: 1,667 steps/sec (~18 min for 1.8M steps)
- Speedup: 1.67×
Also added adaptive batch sizing (500 examples per rayon dispatch for large networks, 100 for small) — negligible additional impact, but cheap to keep.
Why this was the right shape
The original 22.5× speedup (Phase 1 + 2) was for the [32] architecture, where networks had ~3K connections and the working set fit comfortably anywhere. With [128, 64] at 15K connections, working-set size starts to matter: 15K × 32 bytes = 480 KB per individual (spills L2), but 15K × 8 bytes = 120 KB (fits L2 cleanly). Cache-line packing wasn’t a meaningful win at the [32] scale, but at [128, 64] it’s the difference between L2-resident and L3-resident.
Profile after the change shows the hot loop is dominated by random post_activations[from_idx] loads — exactly what you’d expect from a sparse network. No remaining low-hanging fruit in scalar-land.
Performance, Phase 4: The SoA dead end (commit 1790a7c)
With scalar AoS at the limit, the obvious next idea: vectorize across individuals. Build a superset topology shared by all 200 individuals, lay weights out SoA (16-wide groups), let LLVM autovectorize the inner loop.
It worked, in the narrow sense: vmulps/vaddps confirmed in perf annotate. It also ran 40× slower than the AoS baseline.
Memory bandwidth is the bottleneck for sparse networks, not compute. Each SoA connection access loads 128 bytes (64 of weights + 64 of source activations) vs 12 bytes for AoS. The 2× SIMD throughput cannot overcome a 10.7× bandwidth penalty. SoA wins for dense matmul where the access pattern is regular and compute-bound; for sparse networks with random from_idx, AoS + thread parallelism is structurally superior.
Reverted to the AoS baseline. Full investigation in performance.md — it’s a useful negative result and the kind of thing worth not relearning.
The general lesson
Phase 1-3 were each “the profile said X is hot, restructure X.” Phase 4 was “the profile said random loads are hot, restructure access patterns to be sequential” — which sounds correct but trades one bottleneck for a worse one. The diagnosis was right; the prescription was wrong because it ignored which level of the memory hierarchy the working set was sitting in.
For a network of this shape and size on this CPU, 1,667 steps/sec is the practical optimum without leaving Rust + scalar + rayon. Further gains would require either GPU offload (overhead probably swamps the win at this network size) or a fundamentally different algorithm.