L1 cache size vs performance

How much does L1 cache size matter for Ventium’s performance, and where does it stop mattering? This page measures it directly in the cycle-accurate model — the same cycle_mode the M4/M5 gates use, where the dual-issue U/V pipeline and the L1 miss-timing state machine are live — and shows that the answer has a sharp, well-understood structure: a knee governed by the ratio of the working set to the cache capacity, whose shape is the direct fingerprint of the cache’s two-way associativity.

../_images/l1-cache-performance.png

(A) CPI of the mb_dmiss cache-stress kernel as the L1 size is swept 2–128 KB (fixed 32 KB working set). (B) The same knee resolved at fine resolution by instead sweeping the working set against fixed 16 KB and 32 KB caches — revealing that the transition is a linear ramp from 1.0× to 1.5× of capacity, and that it scales exactly with cache size.

What is measured, and on what

The L1 timing model (a read-allocate, two-way-LRU hit/miss state machine with an 8-clock miss penalty, matching the p5trace.so oracle’s l1_access()) is only active in cycle_mode — i.e. under the testbench’s --cycle flag, which also enables dual-issue. In plain functional mode the caches still serve data but their timing is not charged, so cache size has no effect on the clock count. Every number here therefore comes from a --cycle run, and the metric is CPI = core clocks ÷ retired instructions, read from the testbench’s end-of-run retired N instructions in M clocks line.

The workload is ``mb_dmiss``, the directed D-cache-stress kernel from the M5 suite. It strides a load by exactly one 32-byte line through a buffer far larger than the cache, so — once the working set no longer fits — every load touches a freshly-evicted line and misses, adding the full 8-clock penalty. Because cache size changes only timing, never the retired-instruction stream, a fixed instruction budget makes the CPI of different geometries directly comparable.

Note

Why not Dhrystone? Two reasons. First, Dhrystone’s working set is a few KB — it fits in even the smallest L1 here, so it would sit on the fast plateau across the whole sweep and show no knee at all; mb_dmiss is purpose-built to expose the cache-size dependence. Second, and more fundamentally, Dhrystone cannot currently run in cycle_mode: the dual-issue path mishandles the set_thread_area / %gs TLS setup and diverges at the first ret after libc init. (It runs bit-exactly in functional mode — that path is verified against QEMU — but functional mode does not model cache timing.) cycle_mode is validated against the p5trace.so oracle only on the flat, syscall-free mb_* micro-kernels, which is exactly the class mb_dmiss belongs to.

Panel A — the cache-size knee

Sweeping the unified L1 size (I$ = D$) from 2 KB to 128 KB against mb_dmiss’s 32 KB working set:

L1 size

sets (2-way)

CPI

rel. perf

D-cache behaviour

2 KB

32

2.50

1.00×

100 % miss

4 KB

64

2.50

1.00×

100 % miss

8 KB

128

2.50

1.00×

100 % miss (production geometry)

16 KB

256

2.50

1.00×

100 % miss

32 KB

512

0.53

4.68×

working set fits

64 KB

1024

0.53

4.68×

fits

128 KB

2048

0.53

4.68×

fits

The knee lands exactly at 32 KB — where the 1024-line working set first fits the 512-set × 2-way cache. Below it, every load misses and the 8-clock penalty dominates (CPI 2.50). At and above it, misses vanish and the dual-issue pipeline pairs the loop body two instructions per clock, giving a sub-1.0 CPI of 0.53 — a 4.7× speedup. Past the knee, more cache buys nothing: the working set already fits.

The 8 KB point (the production geometry) reproduces the M5 golden CPI of 2.50, which is a useful cross-check that the parametrically-sized configurations match the shipping cache exactly.

Panel B — resolving the knee (and the 2-way fingerprint)

The cache size is quantized: with a 32-byte line the set count and the associativity are each a power of two (the index is a clean bit-slice, addr[5 +: $clog2(SETS)]), so a D-cache size is always sets × ways × 32 B with both factors powers of two — there is no arbitrary 20 KB or 24 KB point. Associativity itself is now a compile-time knob (VEN_L1_WAYS — see Parametric L1 Caches), which adds geometry points (e.g. 128 sets at 2/4/8-way = 8/16/32 KB), but the knee is most cleanly resolved by its dual: since it is governed by working set ÷ capacity, hold the cache fixed and sweep the working set at arbitrary resolution (just recompiled mb_dmiss variants — no RTL rebuild).

Doing so at a fixed 16 KB cache exposes the internal structure the coarse sweep hid:

Working set

ratio

CPI

what is happening

≤ 16 KB

≤ 1.0×

0.54

fits in two ways → ~0 misses

17 KB

1.06×

0.88

first sets see a 3rd competing line → thrash begins

18 KB

1.13×

1.19

20 KB

1.25×

1.72

22 KB

1.38×

2.14

24 KB

1.5×

2.50

every set sees ≥ 3 lines → 100 % miss (saturated)

The transition is a clean linear ramp from 1.0× to 1.5× capacity, and that slope is the two-way-associativity fingerprint. A sequential line-stride starts thrashing the instant a set sees a third competing line (just past 1.0×), and thrashes completely once every set does — which for this access pattern happens at exactly 1.5× capacity, not 2×. A direct-mapped cache would knee almost instantly; a fully-associative cache would stay flat until exactly 1.0× and then cliff. The 1.0×→1.5× ramp is specifically what two-way gives.

The knee also scales exactly with cache size: the 32 KB curve is the 16 KB curve stretched 2× (flat to 32 KB, ramp 32→48 KB, flat after), and the two curves coincide at equal ratios — e.g. a 1.125× working set gives CPI 1.19 on both — confirming that performance in this regime depends only on working-set/capacity, not on absolute size.

Methodology and reproduction

Cache geometry is selected by the IC_SETS / DC_SETS localparams in rtl/core/core.sv (default 128 sets = 8 KB; +VEN_CACHE_HALF = 64 sets = 4 KB), which parameterise the icache and dcache_timing modules (logic [TAGW-1:0] tag [SETS][2] etc., so the arrays scale automatically). For the sweep each size was built into its own obj_dir with an extra +define overriding the set count, then mb_dmiss (and a many-sweep variant, so the one mandatory cold sweep amortises to a clean steady state) was run under --cycle. The host-compiler optimisation level does not affect the cycle counts — only Verilator’s simulation speed — so it is irrelevant to the result.

This is a performance characterisation, not a verification gate: all geometries remain functionally bit-exact (a smaller cache only does more fills of the same bytes). The shipping 8 KB / 2-way / 32 B geometry is the one validated cycle-accurately against the p5trace.so oracle; see the M5 cache bands and The r4 SRT divider and the FDIV bug for the broader cycle-model methodology, and References for the timing-model sources.