Synth: An LLM Ran My Research

Synth is a neuroevolution system written in Rust. It evolves neural network topologies using NEAT-style genetic algorithms while training weights via SGD. Across six research streams (A, B, C, E, F, G), it has grown from 78% MNIST on a sparse linear classifier to 99.7% on an evolved sparse subnetwork, mapped per-task architectural conditionality across four MNIST-likes, lifted Group B’s manual findings back into the NEAT genome where ecological speciation rediscovers them on its own, ablated its own “online learning” framing and found it doesn’t matter, and — most recently — confirmed empirically that different data mixes produce genuine task-driven speciation (199× variance ratio over a same-data control) and built a working “neural ecosystem” where naive ensemble averaging across specialists beats any single network by 2.7pp.

The unusual part: Claude Code designed and ran all of these experiments autonomously. It wrote the code, chose what to test, ran the experiments, analysed the results, decided what to try next, and wrote up the findings. The human’s role was to review plans, give go/no-go decisions, and occasionally redirect.

The even more unusual part: the only thing that made this work was a lab notebook.

Headline numbers

Metric	Sparse linear (Exp 13)	Deep network (Exp 21)	Multi-task (Exp 20)	Group E EMNIST (E4)	Group C joint 4-way (C5d)
Accuracy	77.78% MNIST	99.73% MNIST	97.54% / 87.45%	80.09% EMNIST	87.1% overall (77 classes)
Connections	837	18,924	12,849	12,088	39,502
Architecture	input→output	784→128→64→10	784→128→20 (5 niches)	156 patches → 77	512 patches → 77

The deep network (Exp 21) gets 160 errors on 60,000 MNIST images at ~17% of dense parameters. The four-way joint task in Group C reaches 87% across MNIST+Fashion+KMNIST+EMNIST at 11× fewer parameters than a dense [128] MLP. Group E’s warm-start unblocks patch-count evolution and pushes EMNIST to 80% with ~156 patches, the system’s most under-capacitied niche.

The six research streams

Method — How an LLM-driven lab notebook keeps the research coherent across sessions, picks up where it left off, and produces a publishable record. The low-tech solution that actually works.
Stream A: Architecture and Scaling — 22 experiments on MNIST/Fashion-MNIST. Sparse linear → hidden layer → width scaling → depth → multi-task. Reaches 99.7% MNIST and 97.5%+87.5% multi-task with the [128, 64] architecture. (raw experiments · raw journal · performance)
Stream B: Typed Neuronal Species — 35 experiments on the patch-matcher primitive across MNIST/Fashion/KMNIST/EMNIST. Establishes that only the patch primitive transfers cleanly across all four datasets — every other architectural axis (size, aspect ratio, locality, depth, training schedule) is task-conditional. KMNIST inverts on multiple axes. Mechanistic prediction of per-task locality direction from class-discriminability. (journal · experiments)
Stream C: NEAT Integration — Lifts Group B’s patch primitive into the genome as a typed species. 11× compression vs dense MLPs on the 4-way joint task. Ecological speciation across 5 niches independently rediscovers Group B’s per-task locality findings without being told what each task is. Phase D extends with per-patch introspection, in-niche count mutation (clean negative), and depth (KMNIST +3.3pp, EMNIST −2.7pp — Group B replication exact). (journal · experiments)
Stream E: Cold-Mutation Rescue — Names and resolves Group C’s blocker. Warm-start (Net2Net-style) patch insertion makes structural mutations survive selection; with enough training, count growth translates to accuracy on niches with headroom (EMNIST +1.6pp, mixed +1.7pp). Split-ratio ablation: canonical 0.5/0.5 wins. Depth + warm-start are substitutive, not additive. Continual-learning battery (E6-E9) closes via replay — replay-100 at N=3 closes 64% of E6’s 43pp catastrophic forgetting; scaling buffer to 3000-per-task at N=8 brings forgetting to 6pp. Population diversity is a small (~2pp) modulator, not a CL mechanism. (journal · experiments)
Stream F: Online vs Offline SGD — Tests the project’s foundational positioning claim. Three experiments: equal-per-step LR (online wins on update-frequency confound), linear LR scaling on fixed architecture (online ≈ batched up to B=64 within seed noise), under-evolution comparison (online ≈ batched at 96.4% vs 96.6%, batched edges on patch growth). The “online per-example SGD” framing is empirically unsupported. The distinctive mechanism is the NEAT-style topology evolution; the update style is incidental. Batched SGD with linear scaling is a drop-in replacement. (journal · experiments)
Stream G: Speciation and the Neural Ecosystem — Tests the project’s other foundational claim: that different data mixes produce genuine task-driven speciation. G1 null test: variance ratio 199× on MNIST accuracy / 94× on Fashion vs same-data control — speciation is task-driven, not drift. G3: ecosystem of 5 specialists beats any single network via naive softmax averaging (+2.66pp; oracle ceiling +7.79pp). G4-G5: under temporal regime shift introducing KMNIST after pre-training on MNIST+Fashion, the ecosystem adapts in two complementary ways: existing species generalize via dead-time training (G4), OR new species spawn automatically when ensemble fails sustainedly (G4b spawned 1 species, G5 spawned 2 in response to graded pressure). The “lottery ticket” for novel tasks isn’t selected from a fixed ensemble — it’s evolved into existence by ecological pressure. (journal · experiments)
Source Code — The full Rust codebase. ~3,000 lines, dependencies limited to rand, byteorder, and rayon.

Why this matters

Andrej Karpathy’s AutoResearch vision imagines LLMs running scientific research autonomously. The hard problem isn’t getting an LLM to write code or run experiments — it’s keeping it coherent across sessions, preventing it from repeating failed approaches, and building on prior results instead of starting fresh each time.

Synth demonstrates that the solution is embarrassingly simple: give the LLM a lab notebook. A journal.md for chronological observations and an experiments.md for structured records, one set per research stream. The LLM reads these at the start of each session, picks up where it left off, and adds its new findings. No vector databases, no RAG pipelines, no special memory systems. Just markdown files in a notes/ directory.

The notebook works because it does exactly what a lab notebook does for a human researcher: it externalises working memory into a persistent, structured, reviewable format. The LLM doesn’t need to “remember” what it tried — it reads its own notes.

The Group B → C → E arc is the strongest evidence of the method working as intended. Group B’s manual mapping of per-task locality (35 experiments). Group C’s lift of the primitive into the NEAT genome and the niche dynamics rediscovering the same per-task findings without supervision. Group E’s mechanism for the structural blocker Group C surfaced. Each stream builds on the previous, each phase’s negatives become the next phase’s research questions, and the LLM keeps the thread for weeks of work at a time.

This project is a collaboration between a human (project setup, direction, review) and Claude Code (experiment design, implementation, execution, analysis, and writing — including this site). The commit history shows the full arc.