=================================================
Parametric L1 Caches
=================================================

Ventium's L1 instruction and data caches are **8 KB, 2-way set-associative,
32-byte line, 128 sets, LRU** — the Pentium silicon geometry, and the one the
cycle model is validated against. That geometry is now **parametric**: the set
count *and* the associativity of each L1 can be set at compile time, so the core
can be re-built as a 4 KB direct-mapped-ish, a 16 KB 4-way, a 32 KB 8-way, or any
power-of-two combination — for FPGA area/Fmax studies and cache design-space
exploration. The default build is **bit- and cycle-identical** to the original.

Compile-time knobs
==================

Geometry is selected by ``+define`` flags (no source edit needed). Size of each
L1 = ``sets × ways × 32 B``.

.. list-table::
   :header-rows: 1
   :widths: 26 40 14

   * - Define
     - Effect
     - Default
   * - ``VEN_L1_SETS=N``
     - set count for **both** I$ and D$
     - 128
   * - ``VEN_L1_WAYS=N``
     - associativity for **both** I$ and D$
     - 2
   * - ``VEN_IC_SETS=N`` / ``VEN_IC_WAYS=N``
     - I-cache only (overrides the ``VEN_L1_*`` value)
     - —
   * - ``VEN_DC_SETS=N`` / ``VEN_DC_WAYS=N``
     - D-cache only (overrides the ``VEN_L1_*`` value)
     - —
   * - ``VEN_CACHE_HALF``
     - legacy shortcut: 64 sets (4 KB) both L1s
     - —

``N`` is a power of two. Examples (built into a private ``obj_dir`` so they never
clobber the canonical one)::

   # 16 KB 4-way (both L1s)
   make -C verif/tb rtl OBJDIR=obj_dir_4way VL_EXTRA_DEFINES="+define+VEN_L1_WAYS=4"

   # 8-way I$, 256-set 2-way D$
   make -C verif/tb rtl OBJDIR=obj_dir_mix \
        VL_EXTRA_DEFINES="+define+VEN_IC_WAYS=8 +define+VEN_DC_SETS=256"

The set-index width (``$clog2(SETS)``), tag width (``32-5-idx``) and way-index
width (``$clog2(WAYS)``) all derive automatically; the tag/valid/data arrays and
the line store scale with the parameters.

Generalising the LRU without disturbing the default
===================================================

The original 2-way replacement carried **one bit per set** — ``lru`` = the
most-recently-used way, with ``victim = ~lru``. That does not generalise: an
N-way victim needs the full recency order, not just the MRU way.

The replacement is a **per-way age-counter true-LRU**. Each way of a set holds a
rank ``age[set][way]`` ∈ ``0 … WAYS-1`` (0 = MRU, WAYS-1 = LRU); the ranks of a
set are always a permutation of ``0 … WAYS-1``. The **victim** is the way whose
age is ``WAYS-1``. On any access (hit or fill) to way *k*, every way more recent
than *k* (age < *k*'s age) ages by one and *k* becomes MRU (age 0). The icache
exposes the chosen victim as ``ic_victim_o`` so the spine's fill-way selection
stays uniform (it simply reads ``ic_victim_o`` instead of the old ``~ic_lru_o``).

This encoding **reduces exactly to the old behaviour at WAYS=2**: reset ages
``{0,1}`` make the first victim way 1 — identical to ``~lru`` with ``lru`` reset
to 0; a hit/fill on way *k* moves *k* to MRU and the other way to LRU, exactly as
``lru <= k`` did. So the default build's hit test, victim sequence, and recency
updates are byte-for-byte the same — which the cycle gates confirm below.

Default is preserved; non-default is functionally correct
==========================================================

**The default 8 KB / 2-way / 128-set build is bit- and cycle-identical** to the
pre-parameterisation core:

* ``make verify`` — 77 / 77 programs functionally diff-clean vs QEMU.
* ``make m5`` — every cycle band green, with the cache kernels unchanged:
  ``mb_dmiss`` CPI **2.504** (abs-cyc **+0.10 %** vs the oracle) and ``mb_imiss``
  CPI **6.002** (**+0.03 %**) — the sub-0.1 % deltas are the proof the age-LRU's
  victim sequence matches the original 2-way LRU clock-for-clock.
* the standalone ``l1d`` RTL gate passes.

**Non-default geometries are functionally correct** but are *not* matched by the
fixed 2-way / 128-set ``p5trace.so`` cycle oracle, so ``make verify`` / M4 / M5
certify only the default geometry. A 16 KB 4-way build was checked directly:

* it builds clean;
* a CoreMark free-run on the 4-way cache produces CRC output **byte-identical to
  qemu-native** (the cache is architecturally transparent, so any valid geometry
  yields identical results);
* associativity does what it should — a 12 KB working set striding one load per
  32-byte line over 128 sets puts **3 lines in every set**, which **thrashes the
  2-way cache (CPI 2.50, every access a conflict miss)** but **fits the 4-way
  cache (CPI 0.54)** — a **4.65× speedup** from conflict-miss elimination alone:

.. list-table::
   :header-rows: 1
   :widths: 28 18 18 36

   * - Geometry (12 KB working set)
     - lines/set
     - CPI
     - outcome
   * - 8 KB, 2-way, 128 sets
     - 3 > 2 ways
     - 2.50
     - conflict thrash (100 % miss)
   * - 16 KB, 4-way, 128 sets
     - 3 ≤ 4 ways
     - 0.54
     - fits (≈ 0 miss) — 4.65× faster

Fidelity caveat
===============

Only the **2-way / 128-set** geometry is silicon-accurate and cycle-validated
against the oracle. Other geometries change the miss *sequence* (and at non-2-way
the replacement is no longer the P5's), so they are area/perf experiments, not a
verification config. See :doc:`l1-cache-performance` for what the cache size
actually buys on the cycle-accurate model, and :doc:`srt-divider` for the
broader cycle-model methodology.