This is the raw performance optimization journal Claude maintained while investigating throughput on the [128, 64] architecture. Reproduced exactly as produced.

Performance Journal

Systematic measurements for optimization work. All benchmarks on [128, 64] @50% interlayer, pop 200, pure MNIST.

Baseline: AoS per-individual rayon (pre-batched, commit ca3ecdd)

Architecture: Each of 200 individuals processes examples independently via rayon. Per-individual HotConnection { from_idx: u32, weight: f32 } (8 bytes). Scalar inner loop, 16-thread parallelism.

Measurement: timeout 30 cargo run --release, count steps at first log line.

50,000 steps in 30 seconds = 1,667 steps/sec
Estimated full run (1.8M steps): ~18 minutes
Profile: forward 35.6%, backward 63.3%, rayon <0.5%
Estimated throughput: ~20 GFLOP/s (4% of i9-9900K peak 500 GFLOP/s)

Why 4% of peak: Inner loop does random-access post_activations[from_idx] per connection. Not vectorizable because from_idx varies. Each connection is a dependent load → multiply → accumulate. Scalar throughput limited by load latency, not compute.

Attempt 1: Monolithic SoA (all 200 individuals, single-threaded)

Idea: Build superset topology, SoA layout with 256-wide individual dimension, autovectorized inner loops.

Result: Confirmed autovectorization (vmulps/vaddps in perf annotate). But 20x SLOWER than baseline — lost 16-thread parallelism. 1 thread × 8-wide SIMD < 16 threads × scalar.

Lesson: Cannot trade thread parallelism for SIMD without also maintaining thread count.

Attempt 2: Group-parallel SoA (13 groups of 16, rayon across groups)

Idea: Split 200 individuals into groups of 16. Each group is a small SoA (16-wide, 2 AVX2 iterations). Rayon dispatches across groups. One rayon dispatch per batch (not per example).

Status: Built and compiles. Not yet properly measured — initial tests ran only 30s and showed no Step output, but this may be because batch_size=500 means the first log (step 5000) requires 10 batches of 500 examples each, and the batched path processes examples sequentially within each group (500 forward+backward passes per group per batch).

Expected throughput: 13 groups on ~8 cores (some cores handle 2 groups). Each group does 15K × 16 multiply-adds per forward per example. 500 examples per batch = 120M ops per group per batch. At ~10 GFLOP/s per core, ~12ms per group per batch. 10 batches to reach first log at step 5000 = ~120ms. Should show output within seconds.

Actual measurement (120 seconds):

10,000 steps in 120 seconds = 83 steps/sec
This is 20x slower than the baseline (1,667 steps/sec)
Crashed at first evolve() due to missing node in superset (fixed)

Why 20x slower: The batched approach processes examples sequentially within each group — 500 forward+backward passes per group per batch. The old approach processes examples sequentially within each individual too, but has 200 individuals × 16 threads = good parallelism. The batched approach has only 13 groups × 16 threads.

But 13 groups ≈ 200 individuals in terms of parallelism. The problem must be elsewhere:

Each group does 15K × 16 multiply-adds per forward = 240K ops. The old code does 15K × 1 = 15K ops per individual forward. So each group does 16x more work (the SIMD inner loop is 16-wide).
But SIMD should make that 16x work take ~2x time (16 elements / 8 AVX2 lanes = 2 iterations). So expected: 2x slower per group, but same number of groups as individuals → 2x slower overall.
Actual: 20x slower. Something is 10x worse than expected.

Hypothesis: The superset topology overhead. With ~98% shared connections + ~2% unique, the superset has ~2% more connections than any individual. But the bigger issue may be cache behavior: each group’s weight array is 15K × 16 × 4 = 960 KB. That doesn’t fit in L2 (256 KB). The old per-individual approach had 15K × 8 = 120 KB (HotConnection array), which fit comfortably in L2.

Hypothesis 2: The activation array random access. In the old code, post_activations is 987 × 4 = 4 KB (fits in L1). In the batched code, post_activations per group is 987 × 16 × 4 = 63 KB. Doesn’t fit in L1 (32 KB), spills to L2. Each random access by from_idx now loads 64 bytes (one cache line = 16 floats = exactly one group’s worth) from L2 instead of L1. L2 latency is ~4 cycles vs L1’s ~1 cycle.

Need to profile to confirm.

Profile result: ALL scalar vmulss — LLVM did NOT autovectorize the inner loops. The for k in 0..GROUP_SIZE loop compiles to scalar despite contiguous access. 7.7% of time in bitmask cleanup (zeroing weights).

Root cause: Aliasing. The inner loop reads group.post_activations[src_offset+k] and group.weights[w_offset+k] while writing to group.pre_activations[dest_offset+k]. All three are Vec<f32> fields on the same struct. LLVM cannot prove they don’t overlap (they don’t, but the compiler doesn’t know that from the raw pointer arithmetic). Without proving non-aliasing, LLVM falls back to scalar.

Fix: Pass separate slices (&mut [f32] and &[f32]) to the inner loop function. With distinct slice references, LLVM can apply noalias and vectorize.

Attempt 2a: Aliasing fix (separate slice arguments)

Change: Extracted fma_slice() and backward_conn_update() helper functions that take separate &mut [f32]/&[f32] arguments. LLVM can now prove non-aliasing.

Result: vmulps/vaddps confirmed in perf annotate. BUT still 42 steps/sec (even worse than the scalar 83 steps/sec — likely bounds checking overhead from slice indexing).

Profile breakdown:

40.6% vmovups (vectorized loads from memory)
29.1% vmulps (vectorized multiplies)
5.8% bitmask cleanup

Root cause: memory bandwidth wall. Each connection access in SoA loads:

64 bytes of weights (16 × f32)
64 bytes of source activations (16 × f32)
= 128 bytes per connection

vs the old AoS approach:

8 bytes (HotConnection: from_idx + weight)
~4 bytes (random activation load, usually in L1 cache)
= 12 bytes per connection

That’s 10.7× more memory traffic for the same compute. The SIMD gain (2× throughput from vmulps vs vmulss) can’t overcome the 10.7× bandwidth penalty.

Per forward pass: 15K connections × 128 bytes = 1.9 MB. Per batch of 500: 960 MB. At L3 bandwidth ~40 GB/s: 24 seconds per batch. 10 batches to step 5000 = 240 seconds. Matches observation.

Fundamental lesson: The SoA batching approach only wins when compute is the bottleneck. For sparse networks with random-access patterns, memory bandwidth is the bottleneck, and SoA makes it worse by widening every access. The AoS per-individual approach is bandwidth-efficient because each individual’s working set (120 KB for HotConnection array + 4 KB activations) fits in L2.

Conclusion: Revert to AoS baseline

The batched SoA approach is architecturally wrong for this problem. The sparse network structure means:

Memory access patterns are irregular (random from_idx per connection)
Per-connection data (8 bytes in HotConnection) is much smaller than a cache line
The working set per individual (124 KB) fits in L2

The AoS approach naturally exploits this: each rayon thread keeps one individual’s data hot in L2, processes all examples for that individual, then moves on. The 16-thread parallelism compensates for the scalar inner loop.

The SoA approach would win for dense networks (matrix multiplication) where the access pattern is regular and compute-bound. For sparse networks, AoS + thread parallelism is superior.

Action: Revert to AoS baseline (commit ca3ecdd). The 1,667 steps/sec (18 min for [128,64]) is the practical optimum for this architecture on CPU.

Measurement Protocol

For each configuration:

Run timeout 120 cargo run --release 2>&1 > /tmp/synth_perf_X.txt
Count Step lines: grep -c "^Step" /tmp/synth_perf_X.txt
Get last step: grep "^Step" /tmp/synth_perf_X.txt | tail -1
Compute steps/sec = last_step / 120
Profile if needed: perf record -F 999 -- timeout 30 cargo run --release
Check vectorization: perf annotate --stdio <function> | grep vmulps

Always use 120-second runs minimum. Don’t kill early and assume failure.