Group G: Ecological Speciation and the Neural Ecosystem

The sixth research stream, opened 2026-05-18. Direct response to a question that hovered across all five prior streams: does evolutionary pressure from differing data distributions actually create speciation per se, or are observed inter-niche differences just isolated-population drift? And if speciation is real — can we route across the resulting “neural ecosystem” to outperform any single network?

The pages

Headline results

Yes, mix-pressure creates speciation per se. The variance ratio between varied-mix and isolated-drift conditions on MNIST accuracy is 199×; on Fashion accuracy, 94×. The five varied-mix niches are genuine ecological species — networkae mnistia (100/0) and networkae fashionmnistia (0/100) at the extremes, with smoothly interpolated intermediate forms.

The static ecosystem works — collectively the specialists beat any individual. Naive softmax averaging across 5 specialists gives 88.42% joint accuracy vs the strongest single specialist’s 85.76% (+2.66pp) and vs an oracle upper bound of 93.55% (+7.79pp headroom).

Under temporal regime shift, the ecosystem adapts in two distinct ways. G4 shows existing species can adapt via dead-time training on a failure buffer (becomes generalist). G4b shows that with frozen species, a new species emerges in response to the novel task (true speciation event) — spawn fires automatically when rolling accuracy collapses. G5 adds knowledge-aware self-abstention and demonstrates multi-speciation under graded environmental pressure: two new species emerged at the two regime shifts (KMNIST introduction, then KMNIST-dominant phase).

G1: speciation null test

The clean control experiment that prior streams had only approached indirectly. Two conditions, same 30-individual 64-patch seed population, same evolution config, same training budget:

After 150K steps per niche, evaluate each best individual on held-out MNIST and Fashion test sets.

Varied condition — extreme specialization

niche MNIST Fashion edge_frac
100/0 93.07% 0.00% 0.707
75/25 92.26% 80.33% 0.704
50/50 91.37% 82.09% 0.665
25/75 90.01% 82.94% 0.701
0/100 0.00% 83.85% 1.000

The pure-task niches physically cannot classify the other task — they’ve never received gradient signal on the unseen output classes. Mixed niches sit on the trade-off curve.

Uniform control — near-zero divergence

All 5 niches landed at 91.4-91.9% MNIST and 81.0-82.0% Fashion. Connection count varies in a ±20-range. They are functionally identical. Isolated training on identical data does not produce meaningful divergence.

Variance ratios

metric σ_varied / σ_uniform
MNIST accuracy 199×
Fashion accuracy 94×
edge_frac 2.66×
avg connections 1.69×
avg patches 1.27×
row_std 1.52×

Functional speciation is the dominant signal: orders of magnitude beyond what isolated-population drift produces. Architectural divergence (connections, patches, geometry) is real but modest — consistent with prior findings that selection drives divergence, not mutation.

G3: the neural ecosystem

Take the 5 G1A specialists. Build a joint test set (10K MNIST + 10K Fashion = 20K examples with labels in [0..20)). Evaluate routing strategies.

Single-specialist baselines

specialist joint MNIST Fashion
100/0 46.74% 93.49% 0.00%
75/25 85.76% 92.94% 78.59%
50/50 85.76% 89.98% 81.55%
25/75 85.73% 88.68% 82.79%
0/100 41.74% 0.00% 83.48%

Best single specialist on the joint task: 85.76% (mixed niches).

Routing strategies

strategy joint MNIST Fashion
oracle (upper bound) 93.55% 96.85% 90.24%
confidence (max softmax) 76.41% 90.21% 62.60%
entropy (min entropy) 76.68% 90.41% 62.96%
naive ensemble (avg softmax) 88.42% 93.48% 83.36%
masked ensemble (class-aware) 88.42% 93.48% 83.36%

Three findings

  1. Naive ensemble: +2.66pp over the best single specialist. Collective beats individual. The neural-ecosystem framing pays off in practice.

  2. Oracle ceiling: +7.79pp above the best single specialist. A 5pp gap remains between naive ensemble and oracle — better routing is the unsolved problem.

  3. Confidence routing fails by 9pp below the best single specialist. The pure-task 100/0 specialist was picked for 2973 of the 10000 Fashion images — confidently misclassifying them as digits. Pure-task specialists are overconfident on out-of-distribution inputs because they’ve never been told to abstain.

Why naive ensemble works despite confidence failing

Averaging softmaxes is robust against minority overconfidence. For a Fashion shoe shown to the ecosystem:

Averaged: the wrong-class “digit 1” gets (0.92+0+0+0+0)/5 = 0.184. The right-class “Coat” gets contributions from all five and ends around 0.27. Argmax picks Coat.

Pick-the-most-confident routing picks 100/0 and reports digit 1. Averaging implicitly weighs consensus, not the loudest single vote.

G4: ecological routing with dead-time adaptation

User-redirected from G1/G3 to a richer “ecosystem under regime shift” question: what happens when novel data arrives that no specialist was trained for? The mechanism design: per-population liveness state with exponential-backoff death on failure, dead-time training on a failure buffer, automatic spawning of new species when ensemble fails sustainedly.

G4 pre-trains two populations on MNIST and Fashion, then runs a 3-phase online stream:

Result: no new species spawned, but both pre-trained populations adapted via failure-buffer training during their dead-time intervals. By Phase C end:

species MNIST acc Fashion acc KMNIST acc
mnist (pretrained) 83.4% 71.8% 71.7%
fashion (pretrained) 79.2% 72.6% 72.5%

Both became generalists. The anteater learned to eat capuchin food. Rolling ensemble accuracy 78-82%. This is adaptation, not speciation — the framework provides online continual learning via implicit replay (failure buffer + dead-time training), but doesn’t demonstrate new-species emergence because at least one species was usually correct.

G4b v2: frozen specialists force speciation

To isolate the speciation question, G4b freezes the pre-trained populations — no online weight updates allowed. The only way to handle KMNIST is for a new species to emerge from the ecosystem’s failure buffer.

Spawn fired at step 20,103 — only 103 steps into Phase B (KMNIST introduction). The rolling-100-example ensemble accuracy dropped below 55% for 30 consecutive steps as both frozen specialists failed every KMNIST example. species2 (parent=mnist) was spawned with mnist’s genome population, inheriting the failure buffer, trained silently for 2000 steps, then joined ensemble voting.

End of Phase C:

species MNIST Fashion KMNIST conn
mnist (frozen) 92.4% 0.0% 0.0% 1999
fashion (frozen) 0.0% 83.6% 0.0% 1953
species2 (new) 65.0% 57.2% 77.2% 2599

But ensemble Phase C KMNIST accuracy was only 66% — 11pp below species2’s individual KMNIST competence. The G3 confidence-wrong-vote problem in temporal form: frozen specialists’ overconfident wrong votes on MNIST/Fashion classes dilute species2’s correct KMNIST predictions in the averaging.

G5 v2: knowledge-aware self-abstention + multi-speciation

G5 adds per-species “class diet” tracking. Each frozen species suppresses its softmax outputs on classes outside its training diet (×0.1, then renormalize). New species use raw softmax (their diet is still being built).

Two new species spawned across the run in response to graded environmental pressure:

condition Phase A rolling Phase C KMNIST n_species spawned
G4 (adaptive) 80% 72% 0 (both pops became generalists)
G4b v2 (frozen + spawn) 70% 66% 1
G5 v2 (frozen + diet + spawn) 88% 79.5% 2

Final per-species accuracies (run completed at step 200K):

species parent M acc F acc K acc
mnist (frozen) 91.7% 0% 0%
fashion (frozen) 0% 83.7% 0%
species2 mnist 62.4% 57.2% 72.8%
species3 mnist 62.7% 56.2% 70.9%

Phase C ensemble: 88% rolling, M=89%, F=82%, K=79.5% — a +13.5pp improvement on KMNIST over G4b v2’s 66% plateau. The ecosystem is now compositional: multiple species can co-emerge in response to distinct stress events, sharing ancestral lineage (both spawned from MNIST parent) but training under different selection pressures and converging to similar KMNIST specializations through parallel evolution.

G4c: single-niche replay baseline (the fair comparison)

Does ecosystem partitioning actually buy anything over a monolithic network with the same failure-buffer replay? G4c runs a single niche of 60 individuals (matching G5 v2’s 2×30 specialists in total compute), pre-trained on 50/50 MNIST+Fashion, with the same 1000-example failure-buffer replay on the same 3-phase stream.

condition Phase C rolling Phase C K
G4 (2 species, adapt) 79% 72%
G4b v2 (frozen + spawn) 74% 66%
G4c (single niche + replay) 75% 75%
G5 v2 (frozen + diet + multi-spawn) 88% 79.5%

Single-niche replay matches the simpler ecosystem variants (G4 and G4b) at the same compute. Only G5 v2’s full mechanism (frozen specialists + diet-aware suppression + multi-spawn) beats the baseline by a meaningful margin (+13pp rolling, +5pp K). The ecosystem framework earns its keep with the full design — simpler partitionings are roughly equivalent to a single niche with replay.

G6: hybrid adapt + speciate (counter-intuitive result)

What if pre-trained species are allowed to adapt AND new species can spawn? The hybrid should get the best of both worlds. Instead, it underperforms G5 v2:

condition Phase C rolling Phase C K mnist specialty fashion specialty
G5 v2 (frozen + spawn) 88% 79.5% 91.7% (kept) 83.7% (kept)
G6 (adapt + spawn) 82% 80% 78% (lost) 68% (lost)

Letting pre-trained species adapt erodes their specialization. The mnist species’ MNIST accuracy dropped from 92% to 78% as it absorbed Fashion and KMNIST training. Net effect: −6pp rolling for a +0.5pp KMNIST gain. Specialization is precious; preserve it. Only one new species spawned in G6 (vs G5 v2’s two) because adapting species don’t fail sustainedly.

G7: cross-niche transfer is NEGATIVE

Can an evolved MNIST specialist’s architecture transfer as a useful prior to KMNIST? Group B established the two tasks have opposite locality preferences (MNIST → spatial, KMNIST → distributed). G7 tests transfer directly:

step warm-start (clone MNIST specialist) fresh-init delta
10K 0.720 0.739 −0.020
50K 0.801 0.817 −0.016
100K 0.822 0.833 −0.011

Warm-start trails fresh-init by 1-2pp throughout. The MNIST specialist’s spatial-bias patches are actively wrong for KMNIST, and evolution has to fight uphill to undo the inductive bias. Architectural specialization is task-conditional, and a wrong specialization actively interferes with learning a new task. This is why G5 v2’s “spawn fresh new species” approach works better than G6’s “let existing species adapt” — the parent’s inductive bias is net-harmful for a sufficiently different new task.

G8: longer sequences with EMNIST — one species per novel task

Does the ecosystem keep speciating as more novel tasks arrive? Extended G5 v2’s 3-phase stream to 5 phases, adding EMNIST (filtered to labels 0-9) after KMNIST. Same mechanics.

Two spawn events fired, one per novel-task introduction:

Final per-species lifetime accuracies:

species parent M F K E
mnist (frozen) 91.7% 0% 0% 0%
fashion (frozen) 0% 83.9% 0% 0%
species2 (KMNIST intro) mnist 64% 57% 74.9% 85%
species3 (EMNIST intro) mnist 63% 57% 61% 85.5%

Phase E ensemble: 87% rolling, M=85% F=83% K=73% E=92%.

Each new species specialized in the task that was novel when it spawned. species2 (Phase B) is a KMNIST specialist; species3 (Phase D) is an EMNIST specialist. The mechanism is self-regulating: spawn events fire only at novel-task introductions, so the ecosystem doesn’t accumulate species without bound.

This is the most direct experimental confirmation of the biological pattern: pre-existing species preserve their specializations forever, novel tasks trigger fresh speciation events, new species specialize in the task that triggered their emergence.

What Group G establishes (full battery)

After G1, G3, G4, G4b, G5, G4c, G6, G7, G8 — the user’s “neural ecosystem” hypothesis is now thoroughly characterized:

  1. G1: speciation is real (199× variance ratio vs same-data control).
  2. G3: static ecosystem beats single networks via naive averaging (+2.66pp; oracle ceiling +7.79pp).
  3. G4: existing species can adapt to novel tasks via dead-time training (no new species needed if existing ones can generalize).
  4. G4b: frozen species + spawn mechanism produces speciation; but ensemble averaging dilutes new species’ votes.
  5. G5 v2: diet-aware self-abstention + multi-spawn beats all variants and the single-niche baseline. Best design.
  6. G4c: single niche + replay matches G4/G4b but G5 v2 wins by +13pp — the framework earns its keep with the full mechanism.
  7. G6: adapt + speciate is worse than frozen + speciate by 6pp — preserving specialists matters more than letting them generalize.
  8. G7: cross-task warm-start is negatively transferable (−2pp) — evolved geometry is task-specific.
  9. G8: one new species per novel task introduction; ecosystem grows compositionally and self-regulates.

The design principle: preserve specialists, spawn-on-demand for novel tasks, knowledge-aware suppression on out-of-diet votes. This is the working “neural ecosystem” recipe.

Still open

  1. Multi-source warm-start: clone from multiple specialists for a more general prior (G7 follow-up).
  2. Lateral gene transfer: migration/crossover across species. Currently species don’t interact.
  3. Routing-time efficiency: each example currently requires N forward passes (one per alive species). A learned router could pick a subset.

G9 trichotomy: rederiving classical ecology from reward-and-training rules

After the G4–G8 sequence established speciation-under-regime-shift, the user asked a deeper question: what if the environment isn’t a phased sequence of novel tasks but a stationary heterogeneous mix — say 60% MNIST, 25% Fashion, 10% KMNIST, 5% EMNIST — and species have to survive on energy gained from solving puzzles against attempt costs and metabolic costs? Does carrying capacity emerge? Do niches partition by frequency? Do generalists invade or specialists dominate?

G9 implements ecological economics on top of the speciation framework:

The three variants differ in one rule each:

variant training rule reward rule
G9 baseline full failure-buffer split-the-kill (correct attempters share reward)
G9b niche-bound (target task only) split-the-kill
G9d full failure-buffer winner-take-all (most-confident correct gets all)

Headline results

variant dynamic survivors extinctions M F K E ensemble
G9 baseline generalist invasion 3 MNIST 80% 84% 65% 75% 82%
G9b carrying capacity 4 (all specialists) 0 92% 84% 71% 86% 87%
G9d runaway confidence 3 MNIST + species3 87% 81% 65% 72% 84%

G9b: Galapagos finch isolation

Niche-bound training prevents diet expansion. Each spawned species trains exclusively on its target niche’s failures, so its diet never broadens, and it never attempts examples outside its niche. Result: four specialists, one per niche, zero competition, zero extinctions. Per-niche accuracy strictly higher than any other variant. This is the textbook carrying-capacity result. Each specialist’s reward stream sustains it without overlap; the Lotka-Volterra math works out:

species per-task attempts per-task acc final energy
MNIST (frozen) M:240K only 92% +143K
Fashion (frozen) F:100K only 84% +179K
species2 (K specialist) K:39K only 71% +151K
species3 (E specialist) E:19K only 86% +221K

G9d: Fisher’s runaway display selection

The user predicted this one before the run: “I’ll be interested to see if you rederive the biological basis for arrogance.” Under winner-take-all, reward goes only to the species with the highest softmax peak on the truth class. Calibrated uncertainty loses to loud overconfidence.

The MNIST specialist had 91.6% accuracy — higher than any survivor — but went extinct at step 154,119 anyway. Two surviving generalists with concentrated full-buffer training had peakier softmax peaks (from cross-entropy gradient descent saturating on a small training set), and under WTA, peakiness beat accuracy. species2: 77% accurate, +258K energy. MNIST: 92% accurate, dead. species3 fast extinction at step 10,850 confirms the founder-advantage corollary: first species to develop loud signaling locks the niche; later species starve before they evolve competitive peaks.

This is the dynamic that produces peacock tails, mating displays, and status hierarchies in biology — sexually selected display traits that win competitions regardless of underlying fitness. In our system the “display” is softmax peak height; the “mate choice” is the WTA reward gate; the runaway is gradient descent + selection compounding the peakedness across generations. Honest accuracy loses to confident display.

What G9 establishes

Three biologically-distinct evolutionary regimes from the same neuroevolution substrate, distinguished only by reward-and-training rules:

We’ve now rederived three classical ecological mechanisms (allopatric speciation in G5/G8, niche partitioning in G9b, sexual selection in G9d) from a single neuroevolution framework. Each requires only a different rule for who eats what and how they train.

The “neural ecosystem” framing started as expressive language for a system that happens to use neural networks. By the end of the battery, it had earned a different status: it made specific experimental predictions that the data confirmed, and the predictions matched the metaphor’s mechanism rather than its surface imagery.

G6 (hybrid adapt + speciate) — biology predicted it would fail. Anteaters don’t gradually become omnivores when ants get scarce. Letting the pre-trained MNIST species “learn fruit” via the shared failure buffer should erode its anteater-ness without producing a meaningfully better generalist. Result: G6 dropped MNIST accuracy from 91.7% (G5 v2 frozen) to 78.2% (G6 adapted), with no ensemble gain. Specialization is precious; mixing pressures within a lineage degrades it.

G7 (cross-task transfer) — biology predicted it would be negative. A desert-adapted lizard in a swamp has the wrong adaptations. Group B established MNIST and KMNIST have opposite locality preferences, so the MNIST specialist’s spatial-bias patches should interfere with KMNIST learning. Result: warm-start trailed fresh-init by 1-2pp throughout 100K steps. Evolution had to undo the wrong inductive bias before finding the right one.

G8 (one species per novel task) — biology predicted the pattern exactly. This is allopatric speciation in textbook form: new environment, sustained selection pressure, reproductive isolation, distinct specialization. The G8 result: two spawn events, exactly one per novel-task introduction, each new species specialized in the task that triggered it. The system doesn’t accumulate species without bound — the spawn trigger fires only at regime shifts.

Parallel evolution in G5 v2. species2 and species3 spawned from the same parent and converged to similar KMNIST specializations. That’s the anteater/pangolin pattern — different lineages, similar selection pressure, convergent specialization.

The mechanism is small

The working ecosystem recipe is three rules:

  1. Preserve specialists. Don’t let pre-trained species adapt to new tasks; they get worse at their original task without becoming better at new ones.
  2. Suppress out-of-diet votes. Each species only votes within the classes it trained on. Knowledge-aware self-abstention prevents overconfident wrong votes from diluting correct ones.
  3. Spawn on demand. When the ensemble’s rolling accuracy collapses for a sustained window, clone a parent and train it on the accumulated failure buffer.

All the interesting behavior — one species per novel task, parallel evolution, self-regulating species count, the +13pp improvement over single-niche replay — emerges from those three rules interacting with environmental pressure. Biology isn’t complicated, it’s compositional — simple rules running for a long time on lots of substrate. The framing isn’t decorative; it’s a working theory.

(Full discussion in notes/group_g/biology_notes.md.)

G9 hexology: six evolutionary regimes from one framework

Extending the G9/G9b/G9d trichotomy with three one-variable variants asks targeted questions:

Results:

variant one-variable change regime survivors extinctions
G9 baseline generalist invasion 3 MNIST
G9b niche-bound training carrying capacity 4 0
G9d WTA reward Fisher’s runaway 3 MNIST + species3
G9bd niche-bound + WTA niche-binding dominates 4 0
G9e WTA + per-attempt calibration penalty mass extinction monoculture 1 8
G9f niche-bound + environment flip frequency-invariant sustainability 4 0

G9bd: niche-binding dominates over WTA

Identical to G9b’s result: 4 alive specialists, zero extinctions, M=92% / F=84% / K=70% / E=86%. The WTA reward distribution is silenced because niche-binding pre-empts the layer where WTA would act — multiple species never attempt the same example, so the “who wins the reward” question never arises. The mechanisms compose by one dominating the other, not by both contributing.

G9e: per-attempt calibration penalty produces monoculture

8 of 9 species went extinct, with the only survivor (species4) at +331K energy. The MNIST specialist died at step 1,854 with 92.7% accuracy. Even Fashion, with its narrow diet and pre-trained calibrated peaks, was driven extinct at step 389,858 by slow erosion of its income.

Why the prediction failed: cost is symmetric across attempters (winners and losers both pay), income is winner-take-all (only winner gets reward). Net for a peaky-correct species: +reward − cost > 0. Net for a peaky-wrong species: 0 − cost < 0. The penalty burden is uniform; the income asymmetry compounds dominance instead of bounding it.

Why real peacock tails do bound the runaway: real tails are costly to maintain (per-step survival cost), not costly to display (per-attempt cost). The tail imposes constant metabolic burden whether or not it’s currently displaying. Our calibration penalty modeled the wrong cost type. G9g follow-up: per-step metabolic cost proportional to softmax peakedness. Predict: produces the equilibrium I originally expected here.

G9f: rarity-weighted rewards are frequency-invariant by construction

Environment shifts 60/25/10/5 → 5/10/25/60 at step 200K. Prediction: MNIST starves when its abundant niche becomes rare. Reality: MNIST has the highest final energy of any species (+200K). Zero extinctions.

The math, doing it carefully: for a specialist with accuracy A on a niche of frequency f, income per step is f × A × (reward_per_solve) = f × A × (K/f) = A × K. The frequency cancels. Income is independent of environment composition.

Real biology: obligate specialists like pandas thrive on bamboo whether bamboo is rare or abundant — what matters is whether bamboo exists, not how much. The framework reproduces this exactly because rarity-weighting encodes “rare = valuable” by construction. Specialists are frequency-invariant.

To force extinction via environment in this framework, we’d need a niche to go to zero frequency (complete food loss), not just become rare. G9i territory.

Prediction scorecard

variant prediction result score
G9 carrying capacity generalist invasion wrong
G9b clean carrying capacity matched right
G9d arrogance runaway matched (user predicted) right
G9bd “cleanest yet” niche-binding dominates, WTA silent partial
G9e bounded equilibrium mass extinction monoculture wrong
G9f MNIST extinction frequency-invariant, no extinctions wrong

Half right, half wrong. The wrong predictions are more informative than the right ones — each surfaced a non-obvious dynamic the framework produces that careful prior math would have revealed:

Six classical biology mechanisms from one framework

variant classical analog
G9 (laissez-faire) invasive species, r-strategist generalists
G9b (niche-bound) allopatric speciation, Galapagos finches
G9d (WTA) Fisher’s runaway, peacock sexual selection
G9bd (niche-bound + WTA) reproductive isolation pre-empting mate choice
G9e (WTA + per-attempt calibration) competitive exclusion principle (Gause), monoculture
G9f (niche-bound + env shift) obligate specialist robustness (panda/bamboo)

Six textbook ecological mechanisms recovered from one ~1100-line Rust framework by varying which rule applies to whom. The metaphor isn’t decorative — it’s a working theory. Each rule change produces the biological outcome the metaphor implies, and when the metaphor implies something subtle (G9f’s frequency invariance, G9e’s monoculture collapse), the framework actually produces it.

Compute and methodology

Full Group G battery: ~2.5 hours wall time on a 16-thread i9-9900K. G1 + G3 + G4 + G4b v2 + G5 v2 + G4c + G6 + G7 + F4 (Group F follow-up) + G8 covered the speciation question end-to-end. The load-bearing work was experimental design — particularly the same-data control (G1), the spawn-trigger criterion (G4b/G5), and the fair single-niche baseline (G4c).