The Notebook Method

How giving an LLM a lab notebook makes autonomous research actually work.

The Problem

LLM-driven research has an obvious appeal: LLMs can write code, analyze data, and reason about experimental design. The hard part isn’t any of those individual capabilities. The hard part is coherence across time.

A typical LLM research session looks like this:

The LLM writes some code and runs an experiment
It analyzes the results and suggests next steps
The conversation ends (context limit, user takes a break, etc.)
A new session starts
The LLM has no memory of what was tried, what worked, or what failed
It proposes the same experiment again, or an experiment that was already proven unhelpful

This isn’t a hallucination problem or a capability problem. It’s a memory problem. Human researchers solve it the same way they’ve solved it for centuries: they keep a lab notebook.

The Solution

Synth uses three files to maintain research coherence across sessions:

`CLAUDE.md` — Project Invariants

Things that don’t change between experiments. Build instructions, architectural constraints, the Rust edition 2024 / rand 0.9 API quirks that would otherwise be rediscovered every session. This is read automatically by Claude Code at the start of every conversation.

## Key Design Invariants
- Genome is source of truth for topology.
- write_weights_to_genome() must be called before crossover.
- Output nodes use Identity activation — softmax applied in forward pass.

`notes/journal.md` — Chronological Observations

A running log of what happened, in order. Design decisions, debugging stories, unexpected observations, meta-reflections on the research process. Entries are dated and narrative. The journal is where the LLM’s reasoning lives — not just what was tried, but why, and what it implies for future work.

Example entry:

## Experiment 10: Aggressive Structural Mutation — diminishing returns

### Why more mutation doesn't mean more divergence

The key insight: structural divergence between niches comes from
**differential selection** on structural variants, not from **more
structural variants**. With 30% add_node/add_conn, every niche
generates enormous structural novelty — but selection applies the
same culling pressure everywhere.

Analogy: increasing mutation rate is like adding more paint cans.
Structural divergence requires different niches to *paint different
pictures* — that comes from different selection pressures, not more paint.

`notes/experiments.md` — Structured Records

Each experiment gets a standardized write-up: date, goal, changes from baseline, parameters, results tables, comparison deltas, analysis, key insight, conclusion, and next experiments to consider. This structure is critical — it makes it trivial for the LLM (or a human) to scan the full experiment history and understand what’s been tried.

The “next experiments to consider” section at the end of each experiment is especially important. It’s the LLM’s own recommendation for what to try next, written while the context of the current experiment is fresh. When a new session starts, the LLM reads the most recent experiment’s recommendations and picks up the thread.

How It Worked in Practice

Here’s the actual workflow across sessions:

Session starts. Claude reads CLAUDE.md (automatic) and the notes/ files (via memory instructions).
Claude reviews where things stand. It reads the last few experiment entries, sees what was tried, what worked, and what the recommended next steps were.
Claude proposes an experiment. Based on the notebook, it identifies the highest-leverage thing to try next. It writes a plan (which hyperparameters to change, what the hypothesis is, what to compare against).
Human reviews the plan. Usually approves it. Occasionally redirects (“try X instead” or “what about Y?”).
Claude implements the change. Modifies config values, sometimes adds new code (e.g., the migration system, the seeded hidden layer).
Claude runs the experiment. cargo run --release, waits ~3-5 minutes for results.
Claude analyzes the results. Compares cross-evaluation numbers against the baseline, computes deltas, examines structural metrics, identifies patterns.
Claude writes up the findings. Adds a full experiment entry to experiments.md, adds a journal entry with deeper analysis and meta-observations.
Claude commits and pushes. Reverts any temporary config changes, commits the notes.
Repeat from step 3, or end the session.

In a single session, Claude typically ran 2-4 experiments. Across sessions, the notebook maintained perfect continuity. There was never a case where Claude repeated a failed experiment or forgot a prior finding.

What the Human Did vs. What the LLM Did

Task	Who
Initial project setup and architecture	Human + LLM
Writing CLAUDE.md instructions	Human
Creating the notebook structure	Human (told LLM to keep notes)
Designing individual experiments	LLM
Choosing what to test next	LLM (with occasional human input)
Writing experiment code	LLM
Running experiments	LLM
Analyzing results	LLM
Writing experiment notes	LLM
Writing journal entries	LLM
Reviewing plans (go/no-go)	Human
Asking “what if we tried X?”	Human (occasionally)
Performance optimization (profiling + hot-path rewrites)	LLM
Deciding to add a seeded hidden layer	Human asked “what if we pushed accuracy higher?”, LLM designed the approach

The human’s role was genuinely supervisory. The LLM did the research.

Why This Works (And Why It’s Surprising)

The notebook method works for the same reason lab notebooks work for human researchers. It externalizes working memory into a persistent, structured format. The LLM doesn’t need to “remember” anything — it reads its own notes.

What’s surprising is how little infrastructure this requires. No vector database. No retrieval-augmented generation. No embedding-based memory system. No special “research agent” framework. Just two markdown files in a notes/ directory, and a line in CLAUDE.md that says “always record observations in notes/journal.md and experiments.md.”

The structure matters, though. Unstructured notes would work poorly — the LLM needs to quickly scan what’s been tried and what the results were. The standardized experiment format (goal, parameters, results table, delta comparison, analysis, key insight, conclusion, next steps) makes this trivial. Any LLM can read a markdown table of accuracy deltas and understand what happened.

Limitations

This approach has clear boundaries:

Scale: 14 experiments fit comfortably in context. At 100+ experiments, the notes files would need summarization or indexing.
Complexity: Single-system, single-metric research. Multi-system or multi-objective research might need more structure.
Verification: The human needs to verify that experiments are actually testing what they claim to test. The LLM can make subtle design errors that produce plausible-looking but misleading results.
Novelty: The LLM is good at systematic exploration of a known design space. It’s less likely to make the kind of creative leaps that come from cross-domain intuition or serendipity.
Reproducibility: Fixed seeds help, but the evolutionary stochasticity means small config changes can cascade. The LLM’s analysis treats 1-2pp differences as meaningful, which may be within noise.

Despite these limitations, the notebook method turned Claude Code into a functional research assistant that maintained coherence across weeks of experiments. For the class of problems that fit (systematic exploration, clear metrics, iterative refinement), it works remarkably well.

Next: What We Found — the actual research results from 14 experiments.