=================================================
UltraRAM L2 Cache
=================================================

The KV260 part (``xck26-sfvc784``) carries **64 UltraRAM blocks — 18 Mb** that
the integer core and its 8 KB L1s leave entirely idle. ``ven_l2`` turns all of
them into a large, write-back, set-associative **L2** that sits between the L1's
word-granular backing port and the PS-DDR4 AXI master, to **hide the long DDR
round-trip** behind a multi-megabyte on-chip working set. It is a **standalone**
block as of this drop — fully synthesizable and unit-checked, but nothing in the
core/SoC instantiates it yet; the RTL lives at ``rtl/northbridge/ven_l2.sv`` (the
memory-controller-hub home for the L2 and the DDR path), with the full design
blueprint in ``fpga/L2_DESIGN.md``.

Where it fits — a drop-in for the backing-port seam
===================================================

The memory subsystem already speaks **one** word-granular *backing port* that
``ven_l1d`` drives and ``ven_axi_master`` consumes::

   core mem_* ─► ven_l1d ─(backing port)─► ven_axi_master ─► AXI4 ─► PS-DDR

``ven_l2`` is a drop-in for that wire. It is a **slave** of the backing protocol
on its upstream port (``u_*``, so ``ven_l1d``'s master plugs straight in) and a
**master** of the same protocol on its downstream port (``d_*``, so
``ven_axi_master``'s slave plugs straight in)::

   ven_l1d.m_* ─► [u_*] ven_l2 [d_*] ─► ven_axi_master.m_*

Wiring it in later is therefore a **pure rewire** — no change to ``ven_l1d`` or
``ven_axi_master``. Because the L1 fill is *blocking*, the L2 only ever sees one
upstream transaction at a time, so it needs **no MSHR / hit-under-miss**: the
latency hiding comes from capacity, fast on-chip hits, and write-back absorption,
not request-level parallelism.

Geometry → exact UltraRAM tiling
================================

Defaults: **2 MiB, 32-byte line, 8-way**. The 32-byte line matches the L1 fill
granularity and the ``p5trace`` oracle line, so cache behaviour stays consistent
and one L2 miss is exactly one downstream fill. The data array tiles onto the
URAM column **perfectly**:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Quantity
     - Value
   * - lines
     - 2 MiB / 32 B = 65536
   * - DATA array
     - 65536 deep × **256 bits** wide
   * - **URAM tiling**
     - 256 b = **4 URAM-wide** (64 data bits/tile; each URAM288 word is 72 b →
       4×8 = 32 spare parity bits/word, reserved for SECDED ECC) × 65536/4096 =
       **16 URAM-deep cascade** = **64 URAM288** (every block on the device)
   * - sets
     - 65536 / 8 = 8192 (index = ``addr[17:5]``)
   * - tag
     - ``addr[31:18]`` = 14 b
   * - META array
     - 8192 deep × 135 b BRAM: per set = 8 × ``{valid,dirty,tag14}`` + 7 tree-PLRU
       bits ≈ **1.1 Mb**

Vivado cascades URAM up to 16-high, so the data array maps to exactly **4 columns
× 16 = the 64 URAM288** on the part. Capacity / line / associativity are
parameters; the unit gates instantiate a tiny 1 KiB / 2-way L2 so init and
evictions simulate fast.

Microarchitecture
=================

* **Policy** — write-back, write-allocate. A store *hit* RMWs the resident line
  in place and marks it dirty (no DDR); a store *miss* fills then merges (no DDR
  write until eviction). A clean victim is dropped; a dirty victim is written
  back (8 word writes) before its slot is refilled. This **absorbs the L1's
  write-through store stream** — a dirty line reaches DDR once, on eviction, not
  per store.
* **Hit path** — serial **TAG → DATA**: read the indexed set's packed tags,
  compare all ways, then read the selected way's 256-bit line from URAM. A read
  hit streams the 8 words upstream; a write hit byte-writes the store into the
  line in place.
* **Read-miss path — forward-during-fill.** Each DDR word is handed straight to
  the L1 (``u_ack = d_ack``) *as it arrives* and simultaneously captured into the
  fill buffer, so the L2 adds ≈ 0 latency over a bare DDR fill while still
  populating itself for next time.
* **Replacement** — tree-PLRU, but an **invalid** way is always preferred (fill
  cold capacity before evicting anything).
* **Line-straddling stores** are split into line *L* and line *L+1* and run as
  two sequential transactions with a single upstream ack — mirroring the per-word
  split ``ven_axi_master`` does today, but at line granularity.
* **Init** — URAM/BRAM have no global reset, so an init FSM walks all sets
  clearing valid bits after reset; ``ready`` is low until done.

A stuck DDR is handled **out-of-band**: ``ven_axi_master`` raises its own sticky
``bus_err`` (it holds AXI ``VALID``, it does *not* synthesize an ack), and that
wires straight to the core's ``bus_err`` → ``S_HALT`` override — it is not part
of the backing-port wire the L2 splices into. The core halts and the PS resets
the PL, so a wedge in the L2 fill/writeback FSM is moot under that reset.

Verification
============

Three standalone Verilator gates (run from the repo root, all green with
``--assert``):

* ``verif/l1/run-l2-gate.sh`` — focused L2 logic against a synthetic DDR with a
  configurable per-word latency and access counters: hit = **0 DDR reads**,
  write-back = **0 DDR writes**, byte-correct dirty eviction, sub-word / unaligned
  / cross-line straddle stores, ``dirty == modified``, single-outstanding.
* ``verif/l1/run-l2-axi-gate.sh`` — **end-to-end through the real**
  ``ven_axi_master`` **+** ``axi_slave_bfm`` (RD/WR latency + a mid-burst RVALID
  bubble + the 32→40-bit remap): fills become one **INCR8** burst, dirty
  writebacks become single-beat writes, an L2 hit issues **no AR**.
* ``verif/l1/run-l2-stress-gate.sh`` — an **8× capacity overrun** that thrashes
  the cache (conflict + capacity misses, dirty-eviction churn), golden-checks
  every read, captures the DDR transaction trace, and a final cold-sweep flush
  proves the DDR backing equals the golden model **word-for-word — no dirty data
  lost** across the overrun (e.g. one run: 1449 fills, 656 writebacks, 65 KiB
  moved).

The design was hardened by a five-dimension adversarial review (backing-port
faithfulness, FSM liveness, cache correctness, URAM inference, protocol corners)
with every finding cross-checked by independent skeptics.

FPGA results — standalone OOC on the K26
========================================

Synthesized **out-of-context** (the module alone, no core/SoC — its ``u_*``/``d_*``
ports become OOC pins) and run all the way through **place + route** on
``xck26-sfvc784-2LV-c`` in Vivado 2025.2, default geometry (2 MiB / 8-way,
``RAM_RD_LAT=5``). Reproduce with::

   vivado -mode batch -source fpga/scripts/synth_l2_ooc.tcl -notrace -tclargs <period_ns>

**Fmax ≈ 155 MHz** (post-route). ``Fmax = 1 / (period − WNS)``:

.. list-table::
   :header-rows: 1
   :widths: 34 16 16 34

   * - Target period
     - WNS
     - Fmax
     - Note
   * - 6.7 ns (149 MHz)
     - **+0.25 ns**
     - **155.0 MHz**
     - meets timing — the trustworthy number
   * - 2.5 ns (400 MHz)
     - −4.18 ns
     - 149.8 MHz
     - over-constrained cross-check (agrees)

Hold is met (WHS +0.10 ns). **Area** (the block is almost pure memory):

.. list-table::
   :header-rows: 1
   :widths: 34 22 22 22

   * - Resource
     - Used
     - Available
     - Utilisation
   * - **URAM288**
     - **64**
     - 64
     - **100 %**
   * - Block RAM (RAMB36E2)
     - 34
     - 144
     - 23.6 %
   * - CLB LUTs
     - 2 250
     - 117 120
     - 1.92 %
   * - CLB Registers (FF)
     - 1 261
     - 234 240
     - 0.54 %
   * - CARRY8 / F7 mux
     - 14 / 46
     - —
     - ≈ 0.1 %
   * - DSP
     - 0
     - 1 248
     - 0 %

The headline is the **100 % URAM utilisation** — the whole point of the block,
now confirmed on silicon-accurate tooling: it consumes every one of the device's
64 UltraRAM blocks for 2 MiB of data, plus ~24 % of BRAM for tags + PLRU, behind
a tiny ~2 % LUT / 0.5 % FF control footprint and **zero DSP**.

The Fmax limiter — URAM cascade pipelining
==========================================

The critical path is the **URAM cascade read** (``l2_data_reg_uram_*_Cas_AddrA``)
plus the tag compare. Vivado flags the data array as *under-pipelined*:

.. note::

   ``[Synth 8-6013] UltraRAM "ven_l2/l2_data_reg" is under-pipelined and may not
   meet performance target : Pipeline stages found = 4; Recommended = 9.``

A 16-tile-deep URAM column wants its read pipelined roughly one register per few
tiles. The default ``RAM_RD_LAT=5`` lets Vivado absorb **4** cascade stages →
155 MHz; raising ``RAM_RD_LAT`` toward ~10 would let it absorb the recommended 9
and lift Fmax further, at the cost of a few more hit-latency cycles. Either way
**155 MHz is ~3–4× the core's ~40–50 MHz operating point** (see
:doc:`fp-commit-pipeline` for the routed-Fmax wall the integer/FP core hits), so
the L2 is comfortably **not** a timing bottleneck even under-pipelined.

Status & follow-on
==================

Standalone and verified; **not yet wired into the core**. The integration path is
an I/D backing-port arbiter (merging the icache and L1-D backing streams) plus a
``bus_mode`` to route through the L2; then SECDED ECC over the 32 spare URAM
parity bits, a writeback buffer so a dirty eviction does not serialize in front
of the fill, burst (vs per-word) dirty writeback, and a next-line/stride
prefetcher.

See :doc:`l1-parametric` and :doc:`l1-cache-performance` for the L1 cache
design-space and what cache capacity buys on the cycle-accurate model.