Synth: An LLM Ran My Research

Synth is a neuroevolution system written in Rust. It evolves neural network topologies using NEAT-style genetic algorithms while training weights via SGD. Across six research streams (A, B, C, E, F, G), it has grown from 78% MNIST on a sparse linear classifier to 99.7% on an evolved sparse subnetwork, mapped per-task architectural conditionality across four MNIST-likes, lifted Group B’s manual findings back into the NEAT genome where ecological speciation rediscovers them on its own, ablated its own “online learning” framing and found it doesn’t matter, and — most recently — confirmed empirically that different data mixes produce genuine task-driven speciation (199× variance ratio over a same-data control) and built a working “neural ecosystem” where naive ensemble averaging across specialists beats any single network by 2.7pp.

The unusual part: Claude Code designed and ran all of these experiments autonomously. It wrote the code, chose what to test, ran the experiments, analysed the results, decided what to try next, and wrote up the findings. The human’s role was to review plans, give go/no-go decisions, and occasionally redirect.

The even more unusual part: the only thing that made this work was a lab notebook.


Headline numbers

Metric Sparse linear (Exp 13) Deep network (Exp 21) Multi-task (Exp 20) Group E EMNIST (E4) Group C joint 4-way (C5d)
Accuracy 77.78% MNIST 99.73% MNIST 97.54% / 87.45% 80.09% EMNIST 87.1% overall (77 classes)
Connections 837 18,924 12,849 12,088 39,502
Architecture input→output 784→128→64→10 784→128→20 (5 niches) 156 patches → 77 512 patches → 77

The deep network (Exp 21) gets 160 errors on 60,000 MNIST images at ~17% of dense parameters. The four-way joint task in Group C reaches 87% across MNIST+Fashion+KMNIST+EMNIST at 11× fewer parameters than a dense [128] MLP. Group E’s warm-start unblocks patch-count evolution and pushes EMNIST to 80% with ~156 patches, the system’s most under-capacitied niche.


The six research streams


Why this matters

Andrej Karpathy’s AutoResearch vision imagines LLMs running scientific research autonomously. The hard problem isn’t getting an LLM to write code or run experiments — it’s keeping it coherent across sessions, preventing it from repeating failed approaches, and building on prior results instead of starting fresh each time.

Synth demonstrates that the solution is embarrassingly simple: give the LLM a lab notebook. A journal.md for chronological observations and an experiments.md for structured records, one set per research stream. The LLM reads these at the start of each session, picks up where it left off, and adds its new findings. No vector databases, no RAG pipelines, no special memory systems. Just markdown files in a notes/ directory.

The notebook works because it does exactly what a lab notebook does for a human researcher: it externalises working memory into a persistent, structured, reviewable format. The LLM doesn’t need to “remember” what it tried — it reads its own notes.

The Group B → C → E arc is the strongest evidence of the method working as intended. Group B’s manual mapping of per-task locality (35 experiments). Group C’s lift of the primitive into the NEAT genome and the niche dynamics rediscovering the same per-task findings without supervision. Group E’s mechanism for the structural blocker Group C surfaced. Each stream builds on the previous, each phase’s negatives become the next phase’s research questions, and the LLM keeps the thread for weeks of work at a time.


This project is a collaboration between a human (project setup, direction, review) and Claude Code (experiment design, implementation, execution, analysis, and writing — including this site). The commit history shows the full arc.