UltraRAM L2 Cache¶
The KV260 part (xck26-sfvc784) carries 64 UltraRAM blocks — 18 Mb that
the integer core and its 8 KB L1s leave entirely idle. ven_l2 turns all of
them into a large, write-back, set-associative L2 that sits between the L1’s
word-granular backing port and the PS-DDR4 AXI master, to hide the long DDR
round-trip behind a multi-megabyte on-chip working set. It is a standalone
block as of this drop — fully synthesizable and unit-checked, but nothing in the
core/SoC instantiates it yet; the RTL lives at rtl/northbridge/ven_l2.sv (the
memory-controller-hub home for the L2 and the DDR path), with the full design
blueprint in fpga/L2_DESIGN.md.
Where it fits — a drop-in for the backing-port seam¶
The memory subsystem already speaks one word-granular backing port that
ven_l1d drives and ven_axi_master consumes:
core mem_* ─► ven_l1d ─(backing port)─► ven_axi_master ─► AXI4 ─► PS-DDR
ven_l2 is a drop-in for that wire. It is a slave of the backing protocol
on its upstream port (u_*, so ven_l1d’s master plugs straight in) and a
master of the same protocol on its downstream port (d_*, so
ven_axi_master’s slave plugs straight in):
ven_l1d.m_* ─► [u_*] ven_l2 [d_*] ─► ven_axi_master.m_*
Wiring it in later is therefore a pure rewire — no change to ven_l1d or
ven_axi_master. Because the L1 fill is blocking, the L2 only ever sees one
upstream transaction at a time, so it needs no MSHR / hit-under-miss: the
latency hiding comes from capacity, fast on-chip hits, and write-back absorption,
not request-level parallelism.
Geometry → exact UltraRAM tiling¶
Defaults: 2 MiB, 32-byte line, 8-way. The 32-byte line matches the L1 fill
granularity and the p5trace oracle line, so cache behaviour stays consistent
and one L2 miss is exactly one downstream fill. The data array tiles onto the
URAM column perfectly:
Quantity |
Value |
|---|---|
lines |
2 MiB / 32 B = 65536 |
DATA array |
65536 deep × 256 bits wide |
URAM tiling |
256 b = 4 URAM-wide (64 data bits/tile; each URAM288 word is 72 b → 4×8 = 32 spare parity bits/word, reserved for SECDED ECC) × 65536/4096 = 16 URAM-deep cascade = 64 URAM288 (every block on the device) |
sets |
65536 / 8 = 8192 (index = |
tag |
|
META array |
8192 deep × 135 b BRAM: per set = 8 × |
Vivado cascades URAM up to 16-high, so the data array maps to exactly 4 columns × 16 = the 64 URAM288 on the part. Capacity / line / associativity are parameters; the unit gates instantiate a tiny 1 KiB / 2-way L2 so init and evictions simulate fast.
Microarchitecture¶
Policy — write-back, write-allocate. A store hit RMWs the resident line in place and marks it dirty (no DDR); a store miss fills then merges (no DDR write until eviction). A clean victim is dropped; a dirty victim is written back (8 word writes) before its slot is refilled. This absorbs the L1’s write-through store stream — a dirty line reaches DDR once, on eviction, not per store.
Hit path — serial TAG → DATA: read the indexed set’s packed tags, compare all ways, then read the selected way’s 256-bit line from URAM. A read hit streams the 8 words upstream; a write hit byte-writes the store into the line in place.
Read-miss path — forward-during-fill. Each DDR word is handed straight to the L1 (
u_ack = d_ack) as it arrives and simultaneously captured into the fill buffer, so the L2 adds ≈ 0 latency over a bare DDR fill while still populating itself for next time.Replacement — tree-PLRU, but an invalid way is always preferred (fill cold capacity before evicting anything).
Line-straddling stores are split into line L and line L+1 and run as two sequential transactions with a single upstream ack — mirroring the per-word split
ven_axi_masterdoes today, but at line granularity.Init — URAM/BRAM have no global reset, so an init FSM walks all sets clearing valid bits after reset;
readyis low until done.
A stuck DDR is handled out-of-band: ven_axi_master raises its own sticky
bus_err (it holds AXI VALID, it does not synthesize an ack), and that
wires straight to the core’s bus_err → S_HALT override — it is not part
of the backing-port wire the L2 splices into. The core halts and the PS resets
the PL, so a wedge in the L2 fill/writeback FSM is moot under that reset.
Verification¶
Three standalone Verilator gates (run from the repo root, all green with
--assert):
verif/l1/run-l2-gate.sh— focused L2 logic against a synthetic DDR with a configurable per-word latency and access counters: hit = 0 DDR reads, write-back = 0 DDR writes, byte-correct dirty eviction, sub-word / unaligned / cross-line straddle stores,dirty == modified, single-outstanding.verif/l1/run-l2-axi-gate.sh— end-to-end through the realven_axi_master+axi_slave_bfm(RD/WR latency + a mid-burst RVALID bubble + the 32→40-bit remap): fills become one INCR8 burst, dirty writebacks become single-beat writes, an L2 hit issues no AR.verif/l1/run-l2-stress-gate.sh— an 8× capacity overrun that thrashes the cache (conflict + capacity misses, dirty-eviction churn), golden-checks every read, captures the DDR transaction trace, and a final cold-sweep flush proves the DDR backing equals the golden model word-for-word — no dirty data lost across the overrun (e.g. one run: 1449 fills, 656 writebacks, 65 KiB moved).
The design was hardened by a five-dimension adversarial review (backing-port faithfulness, FSM liveness, cache correctness, URAM inference, protocol corners) with every finding cross-checked by independent skeptics.
FPGA results — standalone OOC on the K26¶
Synthesized out-of-context (the module alone, no core/SoC — its u_*/d_*
ports become OOC pins) and run all the way through place + route on
xck26-sfvc784-2LV-c in Vivado 2025.2, default geometry (2 MiB / 8-way,
RAM_RD_LAT=5). Reproduce with:
vivado -mode batch -source fpga/scripts/synth_l2_ooc.tcl -notrace -tclargs <period_ns>
Fmax ≈ 155 MHz (post-route). Fmax = 1 / (period − WNS):
Target period |
WNS |
Fmax |
Note |
|---|---|---|---|
6.7 ns (149 MHz) |
+0.25 ns |
155.0 MHz |
meets timing — the trustworthy number |
2.5 ns (400 MHz) |
−4.18 ns |
149.8 MHz |
over-constrained cross-check (agrees) |
Hold is met (WHS +0.10 ns). Area (the block is almost pure memory):
Resource |
Used |
Available |
Utilisation |
|---|---|---|---|
URAM288 |
64 |
64 |
100 % |
Block RAM (RAMB36E2) |
34 |
144 |
23.6 % |
CLB LUTs |
2 250 |
117 120 |
1.92 % |
CLB Registers (FF) |
1 261 |
234 240 |
0.54 % |
CARRY8 / F7 mux |
14 / 46 |
— |
≈ 0.1 % |
DSP |
0 |
1 248 |
0 % |
The headline is the 100 % URAM utilisation — the whole point of the block, now confirmed on silicon-accurate tooling: it consumes every one of the device’s 64 UltraRAM blocks for 2 MiB of data, plus ~24 % of BRAM for tags + PLRU, behind a tiny ~2 % LUT / 0.5 % FF control footprint and zero DSP.
The Fmax limiter — URAM cascade pipelining¶
The critical path is the URAM cascade read (l2_data_reg_uram_*_Cas_AddrA)
plus the tag compare. Vivado flags the data array as under-pipelined:
Note
[Synth 8-6013] UltraRAM "ven_l2/l2_data_reg" is under-pipelined and may not
meet performance target : Pipeline stages found = 4; Recommended = 9.
A 16-tile-deep URAM column wants its read pipelined roughly one register per few
tiles. The default RAM_RD_LAT=5 lets Vivado absorb 4 cascade stages →
155 MHz; raising RAM_RD_LAT toward ~10 would let it absorb the recommended 9
and lift Fmax further, at the cost of a few more hit-latency cycles. Either way
155 MHz is ~3–4× the core’s ~40–50 MHz operating point (see
The FP execute/commit pipeline for the routed-Fmax wall the integer/FP core hits), so
the L2 is comfortably not a timing bottleneck even under-pipelined.
Status & follow-on¶
Standalone and verified; not yet wired into the core. The integration path is
an I/D backing-port arbiter (merging the icache and L1-D backing streams) plus a
bus_mode to route through the L2; then SECDED ECC over the 32 spare URAM
parity bits, a writeback buffer so a dirty eviction does not serialize in front
of the fill, burst (vs per-word) dirty writeback, and a next-line/stride
prefetcher.
See Parametric L1 Caches and L1 cache size vs performance for the L1 cache design-space and what cache capacity buys on the cycle-accurate model.