Parametric L1 Caches¶
Ventium’s L1 instruction and data caches are 8 KB, 2-way set-associative, 32-byte line, 128 sets, LRU — the Pentium silicon geometry, and the one the cycle model is validated against. That geometry is now parametric: the set count and the associativity of each L1 can be set at compile time, so the core can be re-built as a 4 KB direct-mapped-ish, a 16 KB 4-way, a 32 KB 8-way, or any power-of-two combination — for FPGA area/Fmax studies and cache design-space exploration. The default build is bit- and cycle-identical to the original.
Compile-time knobs¶
Geometry is selected by +define flags (no source edit needed). Size of each
L1 = sets × ways × 32 B.
Define |
Effect |
Default |
|---|---|---|
|
set count for both I$ and D$ |
128 |
|
associativity for both I$ and D$ |
2 |
|
I-cache only (overrides the |
— |
|
D-cache only (overrides the |
— |
|
legacy shortcut: 64 sets (4 KB) both L1s |
— |
N is a power of two. Examples (built into a private obj_dir so they never
clobber the canonical one):
# 16 KB 4-way (both L1s)
make -C verif/tb rtl OBJDIR=obj_dir_4way VL_EXTRA_DEFINES="+define+VEN_L1_WAYS=4"
# 8-way I$, 256-set 2-way D$
make -C verif/tb rtl OBJDIR=obj_dir_mix \
VL_EXTRA_DEFINES="+define+VEN_IC_WAYS=8 +define+VEN_DC_SETS=256"
The set-index width ($clog2(SETS)), tag width (32-5-idx) and way-index
width ($clog2(WAYS)) all derive automatically; the tag/valid/data arrays and
the line store scale with the parameters.
Generalising the LRU without disturbing the default¶
The original 2-way replacement carried one bit per set — lru = the
most-recently-used way, with victim = ~lru. That does not generalise: an
N-way victim needs the full recency order, not just the MRU way.
The replacement is a per-way age-counter true-LRU. Each way of a set holds a
rank age[set][way] ∈ 0 … WAYS-1 (0 = MRU, WAYS-1 = LRU); the ranks of a
set are always a permutation of 0 … WAYS-1. The victim is the way whose
age is WAYS-1. On any access (hit or fill) to way k, every way more recent
than k (age < k’s age) ages by one and k becomes MRU (age 0). The icache
exposes the chosen victim as ic_victim_o so the spine’s fill-way selection
stays uniform (it simply reads ic_victim_o instead of the old ~ic_lru_o).
This encoding reduces exactly to the old behaviour at WAYS=2: reset ages
{0,1} make the first victim way 1 — identical to ~lru with lru reset
to 0; a hit/fill on way k moves k to MRU and the other way to LRU, exactly as
lru <= k did. So the default build’s hit test, victim sequence, and recency
updates are byte-for-byte the same — which the cycle gates confirm below.
Default is preserved; non-default is functionally correct¶
The default 8 KB / 2-way / 128-set build is bit- and cycle-identical to the pre-parameterisation core:
make verify— 77 / 77 programs functionally diff-clean vs QEMU.make m5— every cycle band green, with the cache kernels unchanged:mb_dmissCPI 2.504 (abs-cyc +0.10 % vs the oracle) andmb_imissCPI 6.002 (+0.03 %) — the sub-0.1 % deltas are the proof the age-LRU’s victim sequence matches the original 2-way LRU clock-for-clock.the standalone
l1dRTL gate passes.
Non-default geometries are functionally correct but are not matched by the
fixed 2-way / 128-set p5trace.so cycle oracle, so make verify / M4 / M5
certify only the default geometry. A 16 KB 4-way build was checked directly:
it builds clean;
a CoreMark free-run on the 4-way cache produces CRC output byte-identical to qemu-native (the cache is architecturally transparent, so any valid geometry yields identical results);
associativity does what it should — a 12 KB working set striding one load per 32-byte line over 128 sets puts 3 lines in every set, which thrashes the 2-way cache (CPI 2.50, every access a conflict miss) but fits the 4-way cache (CPI 0.54) — a 4.65× speedup from conflict-miss elimination alone:
Geometry (12 KB working set) |
lines/set |
CPI |
outcome |
|---|---|---|---|
8 KB, 2-way, 128 sets |
3 > 2 ways |
2.50 |
conflict thrash (100 % miss) |
16 KB, 4-way, 128 sets |
3 ≤ 4 ways |
0.54 |
fits (≈ 0 miss) — 4.65× faster |
Fidelity caveat¶
Only the 2-way / 128-set geometry is silicon-accurate and cycle-validated against the oracle. Other geometries change the miss sequence (and at non-2-way the replacement is no longer the P5’s), so they are area/perf experiments, not a verification config. See L1 cache size vs performance for what the cache size actually buys on the cycle-accurate model, and The r4 SRT divider and the FDIV bug for the broader cycle-model methodology.