This is the raw performance optimization journal Claude maintained while investigating throughput on the [128, 64] architecture. Reproduced exactly as produced.
Performance Journal
Systematic measurements for optimization work. All benchmarks on [128, 64] @50% interlayer, pop 200, pure MNIST.
Baseline: AoS per-individual rayon (pre-batched, commit ca3ecdd)
Architecture: Each of 200 individuals processes examples independently via rayon. Per-individual HotConnection { from_idx: u32, weight: f32 } (8 bytes). Scalar inner loop, 16-thread parallelism.
Measurement: timeout 30 cargo run --release, count steps at first log line.
- 50,000 steps in 30 seconds = 1,667 steps/sec
- Estimated full run (1.8M steps): ~18 minutes
- Profile: forward 35.6%, backward 63.3%, rayon <0.5%
- Estimated throughput: ~20 GFLOP/s (4% of i9-9900K peak 500 GFLOP/s)
Why 4% of peak: Inner loop does random-access post_activations[from_idx] per connection. Not vectorizable because from_idx varies. Each connection is a dependent load → multiply → accumulate. Scalar throughput limited by load latency, not compute.
Attempt 1: Monolithic SoA (all 200 individuals, single-threaded)
Idea: Build superset topology, SoA layout with 256-wide individual dimension, autovectorized inner loops.
Result: Confirmed autovectorization (vmulps/vaddps in perf annotate). But 20x SLOWER than baseline — lost 16-thread parallelism. 1 thread × 8-wide SIMD < 16 threads × scalar.
Lesson: Cannot trade thread parallelism for SIMD without also maintaining thread count.
Attempt 2: Group-parallel SoA (13 groups of 16, rayon across groups)
Idea: Split 200 individuals into groups of 16. Each group is a small SoA (16-wide, 2 AVX2 iterations). Rayon dispatches across groups. One rayon dispatch per batch (not per example).
Status: Built and compiles. Not yet properly measured — initial tests ran only 30s and showed no Step output, but this may be because batch_size=500 means the first log (step 5000) requires 10 batches of 500 examples each, and the batched path processes examples sequentially within each group (500 forward+backward passes per group per batch).
Expected throughput: 13 groups on ~8 cores (some cores handle 2 groups). Each group does 15K × 16 multiply-adds per forward per example. 500 examples per batch = 120M ops per group per batch. At ~10 GFLOP/s per core, ~12ms per group per batch. 10 batches to reach first log at step 5000 = ~120ms. Should show output within seconds.
Actual measurement (120 seconds):
- 10,000 steps in 120 seconds = 83 steps/sec
- This is 20x slower than the baseline (1,667 steps/sec)
- Crashed at first evolve() due to missing node in superset (fixed)
Why 20x slower: The batched approach processes examples sequentially within each group — 500 forward+backward passes per group per batch. The old approach processes examples sequentially within each individual too, but has 200 individuals × 16 threads = good parallelism. The batched approach has only 13 groups × 16 threads.
But 13 groups ≈ 200 individuals in terms of parallelism. The problem must be elsewhere:
- Each group does 15K × 16 multiply-adds per forward = 240K ops. The old code does 15K × 1 = 15K ops per individual forward. So each group does 16x more work (the SIMD inner loop is 16-wide).
- But SIMD should make that 16x work take ~2x time (16 elements / 8 AVX2 lanes = 2 iterations). So expected: 2x slower per group, but same number of groups as individuals → 2x slower overall.
- Actual: 20x slower. Something is 10x worse than expected.
Hypothesis: The superset topology overhead. With ~98% shared connections + ~2% unique, the superset has ~2% more connections than any individual. But the bigger issue may be cache behavior: each group’s weight array is 15K × 16 × 4 = 960 KB. That doesn’t fit in L2 (256 KB). The old per-individual approach had 15K × 8 = 120 KB (HotConnection array), which fit comfortably in L2.
Hypothesis 2: The activation array random access. In the old code, post_activations is 987 × 4 = 4 KB (fits in L1). In the batched code, post_activations per group is 987 × 16 × 4 = 63 KB. Doesn’t fit in L1 (32 KB), spills to L2. Each random access by from_idx now loads 64 bytes (one cache line = 16 floats = exactly one group’s worth) from L2 instead of L1. L2 latency is ~4 cycles vs L1’s ~1 cycle.
Need to profile to confirm.
Profile result: ALL scalar vmulss — LLVM did NOT autovectorize the inner loops. The for k in 0..GROUP_SIZE loop compiles to scalar despite contiguous access. 7.7% of time in bitmask cleanup (zeroing weights).
Root cause: Aliasing. The inner loop reads group.post_activations[src_offset+k] and group.weights[w_offset+k] while writing to group.pre_activations[dest_offset+k]. All three are Vec<f32> fields on the same struct. LLVM cannot prove they don’t overlap (they don’t, but the compiler doesn’t know that from the raw pointer arithmetic). Without proving non-aliasing, LLVM falls back to scalar.
Fix: Pass separate slices (&mut [f32] and &[f32]) to the inner loop function. With distinct slice references, LLVM can apply noalias and vectorize.
Attempt 2a: Aliasing fix (separate slice arguments)
Change: Extracted fma_slice() and backward_conn_update() helper functions that take separate &mut [f32]/&[f32] arguments. LLVM can now prove non-aliasing.
Result: vmulps/vaddps confirmed in perf annotate. BUT still 42 steps/sec (even worse than the scalar 83 steps/sec — likely bounds checking overhead from slice indexing).
Profile breakdown:
- 40.6% vmovups (vectorized loads from memory)
- 29.1% vmulps (vectorized multiplies)
- 5.8% bitmask cleanup
Root cause: memory bandwidth wall. Each connection access in SoA loads:
- 64 bytes of weights (16 × f32)
- 64 bytes of source activations (16 × f32)
- = 128 bytes per connection
vs the old AoS approach:
- 8 bytes (HotConnection: from_idx + weight)
- ~4 bytes (random activation load, usually in L1 cache)
- = 12 bytes per connection
That’s 10.7× more memory traffic for the same compute. The SIMD gain (2× throughput from vmulps vs vmulss) can’t overcome the 10.7× bandwidth penalty.
Per forward pass: 15K connections × 128 bytes = 1.9 MB. Per batch of 500: 960 MB. At L3 bandwidth ~40 GB/s: 24 seconds per batch. 10 batches to step 5000 = 240 seconds. Matches observation.
Fundamental lesson: The SoA batching approach only wins when compute is the bottleneck. For sparse networks with random-access patterns, memory bandwidth is the bottleneck, and SoA makes it worse by widening every access. The AoS per-individual approach is bandwidth-efficient because each individual’s working set (120 KB for HotConnection array + 4 KB activations) fits in L2.
Conclusion: Revert to AoS baseline
The batched SoA approach is architecturally wrong for this problem. The sparse network structure means:
- Memory access patterns are irregular (random
from_idxper connection) - Per-connection data (8 bytes in HotConnection) is much smaller than a cache line
- The working set per individual (124 KB) fits in L2
The AoS approach naturally exploits this: each rayon thread keeps one individual’s data hot in L2, processes all examples for that individual, then moves on. The 16-thread parallelism compensates for the scalar inner loop.
The SoA approach would win for dense networks (matrix multiplication) where the access pattern is regular and compute-bound. For sparse networks, AoS + thread parallelism is superior.
Action: Revert to AoS baseline (commit ca3ecdd). The 1,667 steps/sec (18 min for [128,64]) is the practical optimum for this architecture on CPU.
Measurement Protocol
For each configuration:
- Run
timeout 120 cargo run --release 2>&1 > /tmp/synth_perf_X.txt - Count Step lines:
grep -c "^Step" /tmp/synth_perf_X.txt - Get last step:
grep "^Step" /tmp/synth_perf_X.txt | tail -1 - Compute steps/sec = last_step / 120
- Profile if needed:
perf record -F 999 -- timeout 30 cargo run --release - Check vectorization:
perf annotate --stdio <function> | grep vmulps
Always use 120-second runs minimum. Don’t kill early and assume failure.