Parametric L1 Caches

Ventium’s L1 instruction and data caches are 8 KB, 2-way set-associative, 32-byte line, 128 sets, LRU — the Pentium silicon geometry, and the one the cycle model is validated against. That geometry is now parametric: the set count and the associativity of each L1 can be set at compile time, so the core can be re-built as a 4 KB direct-mapped-ish, a 16 KB 4-way, a 32 KB 8-way, or any power-of-two combination — for FPGA area/Fmax studies and cache design-space exploration. The default build is bit- and cycle-identical to the original.

Compile-time knobs

Geometry is selected by +define flags (no source edit needed). Size of each L1 = sets × ways × 32 B.

Define

Effect

Default

VEN_L1_SETS=N

set count for both I$ and D$

128

VEN_L1_WAYS=N

associativity for both I$ and D$

2

VEN_IC_SETS=N / VEN_IC_WAYS=N

I-cache only (overrides the VEN_L1_* value)

VEN_DC_SETS=N / VEN_DC_WAYS=N

D-cache only (overrides the VEN_L1_* value)

VEN_CACHE_HALF

legacy shortcut: 64 sets (4 KB) both L1s

N is a power of two. Examples (built into a private obj_dir so they never clobber the canonical one):

# 16 KB 4-way (both L1s)
make -C verif/tb rtl OBJDIR=obj_dir_4way VL_EXTRA_DEFINES="+define+VEN_L1_WAYS=4"

# 8-way I$, 256-set 2-way D$
make -C verif/tb rtl OBJDIR=obj_dir_mix \
     VL_EXTRA_DEFINES="+define+VEN_IC_WAYS=8 +define+VEN_DC_SETS=256"

The set-index width ($clog2(SETS)), tag width (32-5-idx) and way-index width ($clog2(WAYS)) all derive automatically; the tag/valid/data arrays and the line store scale with the parameters.

Generalising the LRU without disturbing the default

The original 2-way replacement carried one bit per setlru = the most-recently-used way, with victim = ~lru. That does not generalise: an N-way victim needs the full recency order, not just the MRU way.

The replacement is a per-way age-counter true-LRU. Each way of a set holds a rank age[set][way]0 WAYS-1 (0 = MRU, WAYS-1 = LRU); the ranks of a set are always a permutation of 0 WAYS-1. The victim is the way whose age is WAYS-1. On any access (hit or fill) to way k, every way more recent than k (age < k’s age) ages by one and k becomes MRU (age 0). The icache exposes the chosen victim as ic_victim_o so the spine’s fill-way selection stays uniform (it simply reads ic_victim_o instead of the old ~ic_lru_o).

This encoding reduces exactly to the old behaviour at WAYS=2: reset ages {0,1} make the first victim way 1 — identical to ~lru with lru reset to 0; a hit/fill on way k moves k to MRU and the other way to LRU, exactly as lru <= k did. So the default build’s hit test, victim sequence, and recency updates are byte-for-byte the same — which the cycle gates confirm below.

Default is preserved; non-default is functionally correct

The default 8 KB / 2-way / 128-set build is bit- and cycle-identical to the pre-parameterisation core:

  • make verify — 77 / 77 programs functionally diff-clean vs QEMU.

  • make m5 — every cycle band green, with the cache kernels unchanged: mb_dmiss CPI 2.504 (abs-cyc +0.10 % vs the oracle) and mb_imiss CPI 6.002 (+0.03 %) — the sub-0.1 % deltas are the proof the age-LRU’s victim sequence matches the original 2-way LRU clock-for-clock.

  • the standalone l1d RTL gate passes.

Non-default geometries are functionally correct but are not matched by the fixed 2-way / 128-set p5trace.so cycle oracle, so make verify / M4 / M5 certify only the default geometry. A 16 KB 4-way build was checked directly:

  • it builds clean;

  • a CoreMark free-run on the 4-way cache produces CRC output byte-identical to qemu-native (the cache is architecturally transparent, so any valid geometry yields identical results);

  • associativity does what it should — a 12 KB working set striding one load per 32-byte line over 128 sets puts 3 lines in every set, which thrashes the 2-way cache (CPI 2.50, every access a conflict miss) but fits the 4-way cache (CPI 0.54) — a 4.65× speedup from conflict-miss elimination alone:

Geometry (12 KB working set)

lines/set

CPI

outcome

8 KB, 2-way, 128 sets

3 > 2 ways

2.50

conflict thrash (100 % miss)

16 KB, 4-way, 128 sets

3 ≤ 4 ways

0.54

fits (≈ 0 miss) — 4.65× faster

Fidelity caveat

Only the 2-way / 128-set geometry is silicon-accurate and cycle-validated against the oracle. Other geometries change the miss sequence (and at non-2-way the replacement is no longer the P5’s), so they are area/perf experiments, not a verification config. See L1 cache size vs performance for what the cache size actually buys on the cycle-accurate model, and The r4 SRT divider and the FDIV bug for the broader cycle-model methodology.