This is the raw performance optimization journal Claude maintained while investigating throughput on the [128, 64] architecture. Reproduced exactly as produced.

Performance Journal

Systematic measurements for optimization work. All benchmarks on [128, 64] @50% interlayer, pop 200, pure MNIST.


Baseline: AoS per-individual rayon (pre-batched, commit ca3ecdd)

Architecture: Each of 200 individuals processes examples independently via rayon. Per-individual HotConnection { from_idx: u32, weight: f32 } (8 bytes). Scalar inner loop, 16-thread parallelism.

Measurement: timeout 30 cargo run --release, count steps at first log line.

Why 4% of peak: Inner loop does random-access post_activations[from_idx] per connection. Not vectorizable because from_idx varies. Each connection is a dependent load → multiply → accumulate. Scalar throughput limited by load latency, not compute.


Attempt 1: Monolithic SoA (all 200 individuals, single-threaded)

Idea: Build superset topology, SoA layout with 256-wide individual dimension, autovectorized inner loops.

Result: Confirmed autovectorization (vmulps/vaddps in perf annotate). But 20x SLOWER than baseline — lost 16-thread parallelism. 1 thread × 8-wide SIMD < 16 threads × scalar.

Lesson: Cannot trade thread parallelism for SIMD without also maintaining thread count.


Attempt 2: Group-parallel SoA (13 groups of 16, rayon across groups)

Idea: Split 200 individuals into groups of 16. Each group is a small SoA (16-wide, 2 AVX2 iterations). Rayon dispatches across groups. One rayon dispatch per batch (not per example).

Status: Built and compiles. Not yet properly measured — initial tests ran only 30s and showed no Step output, but this may be because batch_size=500 means the first log (step 5000) requires 10 batches of 500 examples each, and the batched path processes examples sequentially within each group (500 forward+backward passes per group per batch).

Expected throughput: 13 groups on ~8 cores (some cores handle 2 groups). Each group does 15K × 16 multiply-adds per forward per example. 500 examples per batch = 120M ops per group per batch. At ~10 GFLOP/s per core, ~12ms per group per batch. 10 batches to reach first log at step 5000 = ~120ms. Should show output within seconds.

Actual measurement (120 seconds):

Why 20x slower: The batched approach processes examples sequentially within each group — 500 forward+backward passes per group per batch. The old approach processes examples sequentially within each individual too, but has 200 individuals × 16 threads = good parallelism. The batched approach has only 13 groups × 16 threads.

But 13 groups ≈ 200 individuals in terms of parallelism. The problem must be elsewhere:

Hypothesis: The superset topology overhead. With ~98% shared connections + ~2% unique, the superset has ~2% more connections than any individual. But the bigger issue may be cache behavior: each group’s weight array is 15K × 16 × 4 = 960 KB. That doesn’t fit in L2 (256 KB). The old per-individual approach had 15K × 8 = 120 KB (HotConnection array), which fit comfortably in L2.

Hypothesis 2: The activation array random access. In the old code, post_activations is 987 × 4 = 4 KB (fits in L1). In the batched code, post_activations per group is 987 × 16 × 4 = 63 KB. Doesn’t fit in L1 (32 KB), spills to L2. Each random access by from_idx now loads 64 bytes (one cache line = 16 floats = exactly one group’s worth) from L2 instead of L1. L2 latency is ~4 cycles vs L1’s ~1 cycle.

Need to profile to confirm.

Profile result: ALL scalar vmulss — LLVM did NOT autovectorize the inner loops. The for k in 0..GROUP_SIZE loop compiles to scalar despite contiguous access. 7.7% of time in bitmask cleanup (zeroing weights).

Root cause: Aliasing. The inner loop reads group.post_activations[src_offset+k] and group.weights[w_offset+k] while writing to group.pre_activations[dest_offset+k]. All three are Vec<f32> fields on the same struct. LLVM cannot prove they don’t overlap (they don’t, but the compiler doesn’t know that from the raw pointer arithmetic). Without proving non-aliasing, LLVM falls back to scalar.

Fix: Pass separate slices (&mut [f32] and &[f32]) to the inner loop function. With distinct slice references, LLVM can apply noalias and vectorize.


Attempt 2a: Aliasing fix (separate slice arguments)

Change: Extracted fma_slice() and backward_conn_update() helper functions that take separate &mut [f32]/&[f32] arguments. LLVM can now prove non-aliasing.

Result: vmulps/vaddps confirmed in perf annotate. BUT still 42 steps/sec (even worse than the scalar 83 steps/sec — likely bounds checking overhead from slice indexing).

Profile breakdown:

Root cause: memory bandwidth wall. Each connection access in SoA loads:

vs the old AoS approach:

That’s 10.7× more memory traffic for the same compute. The SIMD gain (2× throughput from vmulps vs vmulss) can’t overcome the 10.7× bandwidth penalty.

Per forward pass: 15K connections × 128 bytes = 1.9 MB. Per batch of 500: 960 MB. At L3 bandwidth ~40 GB/s: 24 seconds per batch. 10 batches to step 5000 = 240 seconds. Matches observation.

Fundamental lesson: The SoA batching approach only wins when compute is the bottleneck. For sparse networks with random-access patterns, memory bandwidth is the bottleneck, and SoA makes it worse by widening every access. The AoS per-individual approach is bandwidth-efficient because each individual’s working set (120 KB for HotConnection array + 4 KB activations) fits in L2.


Conclusion: Revert to AoS baseline

The batched SoA approach is architecturally wrong for this problem. The sparse network structure means:

  1. Memory access patterns are irregular (random from_idx per connection)
  2. Per-connection data (8 bytes in HotConnection) is much smaller than a cache line
  3. The working set per individual (124 KB) fits in L2

The AoS approach naturally exploits this: each rayon thread keeps one individual’s data hot in L2, processes all examples for that individual, then moves on. The 16-thread parallelism compensates for the scalar inner loop.

The SoA approach would win for dense networks (matrix multiplication) where the access pattern is regular and compute-bound. For sparse networks, AoS + thread parallelism is superior.

Action: Revert to AoS baseline (commit ca3ecdd). The 1,667 steps/sec (18 min for [128,64]) is the practical optimum for this architecture on CPU.


Measurement Protocol

For each configuration:

  1. Run timeout 120 cargo run --release 2>&1 > /tmp/synth_perf_X.txt
  2. Count Step lines: grep -c "^Step" /tmp/synth_perf_X.txt
  3. Get last step: grep "^Step" /tmp/synth_perf_X.txt | tail -1
  4. Compute steps/sec = last_step / 120
  5. Profile if needed: perf record -F 999 -- timeout 30 cargo run --release
  6. Check vectorization: perf annotate --stdio <function> | grep vmulps

Always use 120-second runs minimum. Don’t kill early and assume failure.