================================================== L1 cache size vs performance ================================================== How much does L1 cache size matter for Ventium's performance, and *where* does it stop mattering? This page measures it directly in the **cycle-accurate model** — the same ``cycle_mode`` the M4/M5 gates use, where the dual-issue U/V pipeline and the L1 miss-timing state machine are live — and shows that the answer has a sharp, well-understood structure: a **knee** governed by the ratio of the working set to the cache capacity, whose shape is the direct fingerprint of the cache's **two-way associativity**. .. figure:: img/l1-cache-performance.png :width: 100% :align: center **(A)** CPI of the ``mb_dmiss`` cache-stress kernel as the L1 size is swept 2–128 KB (fixed 32 KB working set). **(B)** The same knee resolved at fine resolution by instead sweeping the *working set* against fixed 16 KB and 32 KB caches — revealing that the transition is a linear ramp from 1.0× to 1.5× of capacity, and that it scales exactly with cache size. What is measured, and on what ============================= The L1 **timing** model (a read-allocate, two-way-LRU hit/miss state machine with an 8-clock miss penalty, matching the ``p5trace.so`` oracle's ``l1_access()``) is only active in ``cycle_mode`` — i.e. under the testbench's ``--cycle`` flag, which also enables dual-issue. In plain functional mode the caches still serve data but their *timing* is not charged, so cache size has no effect on the clock count. Every number here therefore comes from a ``--cycle`` run, and the metric is **CPI = core clocks ÷ retired instructions**, read from the testbench's end-of-run ``retired N instructions in M clocks`` line. The workload is **``mb_dmiss``**, the directed D-cache-stress kernel from the M5 suite. It strides a load by exactly one 32-byte line through a buffer far larger than the cache, so — once the working set no longer fits — *every* load touches a freshly-evicted line and misses, adding the full 8-clock penalty. Because cache size changes only *timing*, never the retired-instruction stream, a fixed instruction budget makes the CPI of different geometries directly comparable. .. note:: **Why not Dhrystone?** Two reasons. First, Dhrystone's working set is a few KB — it fits in even the smallest L1 here, so it would sit on the fast plateau across the whole sweep and show no knee at all; ``mb_dmiss`` is purpose-built to *expose* the cache-size dependence. Second, and more fundamentally, Dhrystone cannot currently run in ``cycle_mode``: the dual-issue path mishandles the ``set_thread_area`` / ``%gs`` TLS setup and diverges at the first ``ret`` after libc init. (It runs bit-exactly in functional mode — that path is verified against QEMU — but functional mode does not model cache timing.) ``cycle_mode`` is validated against the ``p5trace.so`` oracle only on the flat, syscall-free ``mb_*`` micro-kernels, which is exactly the class ``mb_dmiss`` belongs to. Panel A — the cache-size knee ============================= Sweeping the unified L1 size (I$ = D$) from 2 KB to 128 KB against ``mb_dmiss``'s 32 KB working set: .. list-table:: :header-rows: 1 :widths: 18 14 16 18 34 * - L1 size - sets (2-way) - CPI - rel. perf - D-cache behaviour * - 2 KB - 32 - 2.50 - 1.00× - 100 % miss * - 4 KB - 64 - 2.50 - 1.00× - 100 % miss * - 8 KB - 128 - 2.50 - 1.00× - 100 % miss (production geometry) * - 16 KB - 256 - 2.50 - 1.00× - 100 % miss * - **32 KB** - **512** - **0.53** - **4.68×** - **working set fits** * - 64 KB - 1024 - 0.53 - 4.68× - fits * - 128 KB - 2048 - 0.53 - 4.68× - fits The knee lands **exactly at 32 KB** — where the 1024-line working set first fits the 512-set × 2-way cache. Below it, every load misses and the 8-clock penalty dominates (CPI 2.50). At and above it, misses vanish and the dual-issue pipeline pairs the loop body two instructions per clock, giving a **sub-1.0 CPI of 0.53 — a 4.7× speedup**. Past the knee, more cache buys nothing: the working set already fits. The 8 KB point (the production geometry) reproduces the M5 golden CPI of 2.50, which is a useful cross-check that the parametrically-sized configurations match the shipping cache exactly. Panel B — resolving the knee (and the 2-way fingerprint) ======================================================== The cache size is **quantized**: with a 32-byte line the set count and the associativity are each a power of two (the index is a clean bit-slice, ``addr[5 +: $clog2(SETS)]``), so a D-cache size is always ``sets × ways × 32 B`` with both factors powers of two — there is no *arbitrary* 20 KB or 24 KB point. Associativity itself **is** now a compile-time knob (``VEN_L1_WAYS`` — see :doc:`l1-parametric`), which adds geometry points (e.g. 128 sets at 2/4/8-way = 8/16/32 KB), but the knee is most cleanly resolved by its **dual**: since it is governed by **working set ÷ capacity**, hold the cache fixed and sweep the *working set* at arbitrary resolution (just recompiled ``mb_dmiss`` variants — no RTL rebuild). Doing so at a fixed 16 KB cache exposes the internal structure the coarse sweep hid: .. list-table:: :header-rows: 1 :widths: 22 16 14 48 * - Working set - ratio - CPI - what is happening * - ≤ 16 KB - ≤ 1.0× - 0.54 - fits in two ways → ~0 misses * - 17 KB - 1.06× - 0.88 - first sets see a 3rd competing line → thrash begins * - 18 KB - 1.13× - 1.19 - * - 20 KB - 1.25× - 1.72 - * - 22 KB - 1.38× - 2.14 - * - **24 KB** - **1.5×** - **2.50** - every set sees ≥ 3 lines → 100 % miss (saturated) The transition is a **clean linear ramp from 1.0× to 1.5× capacity**, and that slope *is* the two-way-associativity fingerprint. A sequential line-stride starts thrashing the instant a set sees a third competing line (just past 1.0×), and thrashes *completely* once every set does — which for this access pattern happens at exactly **1.5×** capacity, not 2×. A direct-mapped cache would knee almost instantly; a fully-associative cache would stay flat until exactly 1.0× and then cliff. The 1.0×→1.5× ramp is specifically what two-way gives. The knee also **scales exactly with cache size**: the 32 KB curve is the 16 KB curve stretched 2× (flat to 32 KB, ramp 32→48 KB, flat after), and the two curves coincide at equal *ratios* — e.g. a 1.125× working set gives CPI 1.19 on both — confirming that performance in this regime depends only on working-set/capacity, not on absolute size. Methodology and reproduction ============================ Cache geometry is selected by the ``IC_SETS`` / ``DC_SETS`` localparams in ``rtl/core/core.sv`` (default 128 sets = 8 KB; ``+VEN_CACHE_HALF`` = 64 sets = 4 KB), which parameterise the ``icache`` and ``dcache_timing`` modules (``logic [TAGW-1:0] tag [SETS][2]`` etc., so the arrays scale automatically). For the sweep each size was built into its own ``obj_dir`` with an extra ``+define`` overriding the set count, then ``mb_dmiss`` (and a many-sweep variant, so the one mandatory cold sweep amortises to a clean steady state) was run under ``--cycle``. The host-compiler optimisation level does not affect the cycle counts — only Verilator's simulation speed — so it is irrelevant to the result. This is a *performance* characterisation, not a verification gate: all geometries remain functionally bit-exact (a smaller cache only does more fills of the same bytes). The shipping 8 KB / 2-way / 32 B geometry is the one validated cycle-accurately against the ``p5trace.so`` oracle; see the M5 cache bands and :doc:`srt-divider` for the broader cycle-model methodology, and :doc:`../reference-library` for the timing-model sources.