===================
Verilog build flags
===================

Ventium's RTL is built around a single convention: **the default build — no
defines at all — is the verification baseline**. It is the configuration every
differential gate runs, byte- and cycle-identical to the golden model
(QEMU for architectural state, ``p5trace`` for the cycle traces). Every build
flag in the project *opts in* to something on top of that baseline, and every
opted-in configuration is itself pushed back through the differential battery
before it is trusted.

The flags divide into two kinds. **Behaviour-preserving implementation
options** change *how* a result is computed — an iterative engine instead of a
combinational cone, a BRAM instead of LUTRAM, a pipeline register on a critical
path — and exist purely for FPGA Fmax and area. These must stay architecturally
bit-exact, and either cycle-identical or inside the loose M4/M5 cycle bands;
they are what the KV260 bitstream configurations set. **Behaviour-adding
options** change *what* the core does — the authentic FDIV flaw, eight new
transcendental instructions, a P5 cycle-model refinement, a real AXI bus —
and each one carries its own oracle and its own gate, because the default
golden no longer applies to it.

This page is the complete reference: every conditional-compilation define and
build parameter in the tree, what it gates, who sets it, and how (or whether)
it is verified. File references are given as paths; the authoritative location
is always the ``ifdef`` in the source itself.

.. contents::
   :local:
   :depth: 1

Quick reference
===============

.. list-table::
   :header-rows: 1
   :widths: 24 16 22 38

   * - Flag
     - Kind
     - Default
     - Purpose
   * - **x87 FPU implementation**
     -
     -
     -
   * - ``VEN_FP_PIPE``
     - boolean define
     - off — single-cycle FP execute
     - 1-deep FP-execute commit pipeline (defer commit to N+1) for Fmax.
   * - ``VEN_FP_PIPE2``
     - boolean define
     - off (inert without ``VEN_FP_PIPE``)
     - Second commit stage at the ``f_eval_s1``/``f_eval_s2`` boundary; the K26
       60 MHz close.
   * - ``VEN_FP_OVERLAP``
     - boolean define
     - off — serialized, matches oracle
     - P5 cycle gap 1: independent integer ops overlap an in-flight FDIV
       (own oracle variant).
   * - ``VEN_FXCH_FREE``
     - boolean define
     - off — FXCH costs 1 clock
     - P5 cycle gap 2: an FXCH following a stack push folds in for free
       (own oracle variant).
   * - ``VEN_SRT_ITER``
     - boolean define
     - off — combinational FDIV/FSQRT
     - Multi-cycle iterative SRT divide and square-root engines, bit-exact.
   * - ``VEN_SRT_DIV``
     - boolean define
     - no-op (legacy)
     - Historical name; the genuine SRT divider is now the unconditional
       default.
   * - ``VEN_SRT_FDIV_BUG``
     - boolean define
     - off — correct PLA
     - The authentic Pentium FDIV flaw (buggy quotient-selection PLA).
   * - ``VEN_DIV_EXACT``
     - boolean define
     - off — SRT engine is default
     - Opt *back* to the behavioural wide divide (fast-sim escape hatch).
   * - ``VEN_BCD_ITER``
     - boolean define
     - off — combinational BCD cones
     - Iterative FBSTP/FBLD packed-BCD conversion engines, bit-exact.
   * - ``VEN_BCD_DIV100``
     - boolean define
     - off everywhere
     - One divide-by-100 per BCD step instead of two chained divide-by-10
       (Fmax micro-optimisation inside ``ven_bcd``).
   * - ``VEN_TRANSCENDENTAL``
     - boolean define
     - off — opcodes HALT
     - Adds the 8 x87 transcendental instructions via iterative engines.
   * - ``VEN_TRSC_SILICON``
     - boolean define
     - inert (reserved)
     - Planned silicon-accuracy mode select for the transcendental engines;
       not implemented.
   * - **Front-end and L1 caches**
     -
     -
     -
   * - ``VEN_BTB_PIPE``
     - boolean define
     - off — combinational resolve
     - Registers the BTB resolve inputs; cycle-neutral Fmax fix.
   * - ``VEN_UOPCACHE``
     - boolean define
     - off — live byte-window decode
     - Predecode-on-fill micro-op cache; the fix for the x86 byte-window MUXF
       congestion wall.
   * - ``VEN_UOPCACHE_CHECK``
     - boolean define
     - off
     - Sim scaffold re-instantiating the reference decode beside the µop-cache
       slot read.
   * - ``VEN_FE_PIPE``
     - boolean define
     - off — live TLB lookups
     - Page-keyed fetch/data micro-TLB register; breaks the ``eip`` self-loop
       (+1 clock per page crossing).
   * - ``VEN_IC_BRAM``
     - boolean define
     - off — async LUTRAM lines
     - Block-RAM icache line store with registered reads and predicted-target
       prefetch.
   * - ``VEN_IC_NARROWB``
     - boolean define
     - off — full-width port B
     - Halves the icache straddle read port (distributed-RAM front end only).
   * - ``VEN_CACHE_HALF``
     - boolean define
     - off — 128 sets per L1
     - Both L1s halved to 64 sets (4 KB each); FPGA area/congestion option.
   * - ``VEN_IC_SETS``
     - valued define (int)
     - 128
     - I-cache set-count override for geometry sweeps.
   * - ``VEN_IC_WAYS``
     - valued define (int)
     - 2
     - I-cache associativity override (parametric true LRU).
   * - ``VEN_DC_SETS``
     - valued define (int)
     - 128
     - D-cache timing-model set-count override.
   * - ``VEN_DC_WAYS``
     - valued define (int)
     - 2
     - D-cache timing-model associativity override.
   * - **Integer core and build plumbing**
     -
     -
     -
   * - ``VEN_IDIV_ITER``
     - boolean define
     - off — combinational divide
     - Multi-cycle restoring DIV/IDIV engine replacing the native ``/``/``%``
       cone.
   * - ``SYNTHESIS``
     - boolean define
     - unset in simulation
     - Strips sim-only ``$fatal`` elaboration guards and bound SVA.
   * - ``VTM_NO_DPI``
     - boolean define
     - unset in simulation
     - Elides the four DPI-C retire-observation imports and call sites.
   * - ``VTM_HAVE_DPI_HEADER``
     - C++ macro (self-defined)
     - auto-detected
     - ``dpi_retire.cpp`` header-detection plumbing; never user-set.
   * - ``M7_PROXY_DEBUG``
     - boolean define
     - off
     - ``$display`` taps on the syscall-proxy / sreg-load / indirect-call
       paths.
   * - ``F2_DBG``
     - boolean define
     - off
     - ``$display`` tap in the F2XM1 engine's pack stage.
   * - **Memory system and SoC platform**
     -
     -
     -
   * - ``VEN_L1_AXI``
     - boolean define
     - off — BFM memory port
     - Real L1 data cache plus AXI4 master to PS-DDR (bus mode 2).
   * - ``VEN_L1_SETS``
     - valued define (int)
     - 128
     - Umbrella set-count knob for both in-core L1s at once.
   * - ``VEN_L1_WAYS``
     - valued define (int)
     - 2
     - Umbrella associativity knob for both in-core L1s at once.
   * - ``VEN_AXI_CDC``
     - boolean define
     - off — single-clock bypass
     - Dual-clock async-FIFO bridge inside the L1/AXI subsystem.
   * - ``VEN_KV260_SOC``
     - boolean define
     - off
     - PS-facing seams: ``retire_count``, F3 interrupt injection, DDR-carveout
       address remap.
   * - ``VEN_PS_PROXY``
     - boolean define
     - off — same-clock proxy
     - Stalling ``S_SYSCALL_WAIT`` handshake for the future PS syscall bridge.
   * - ``VEN_PBLOCK``
     - environment variable (Tcl)
     - unset
     - Vivado soft-Pblock floorplan experiment; **not a Verilog define**.
   * - **Peripheral RTL/PS split**
     -
     -
     -
   * - ``VEN_UART_PS``
     - boolean define
     - off — device in RTL
     - COM1 16550 UART served by the PS C model.
   * - ``VEN_VGA_PS``
     - boolean define
     - off — device in RTL
     - VGA register file served by the PS C model (framebuffer goes dormant).
   * - ``VEN_RTC_PS``
     - boolean define
     - off — device in RTL
     - MC146818 RTC/CMOS served by the PS C model.
   * - ``VEN_PIT_PS``
     - boolean define
     - off — keep off
     - Latent hook only; the 8254 PIT is pinned in PL (IRQ0 timebase).
   * - ``VEN_PIC_PS``
     - boolean define
     - off — keep off
     - Latent hook only; the 8259 PIC is pinned in PL (INTR/INTA latency).
   * - ``VEN_I8042_PS``
     - boolean define
     - off — device in RTL
     - 8042 keyboard controller served by the PS C model.
   * - ``VEN_FDC_PS``
     - boolean define
     - off — device in RTL
     - 82077/8272A floppy controller served by the PS C model.
   * - ``VEN_ACPIPM_PS``
     - boolean define
     - off — device in RTL
     - ACPI PM timer served by the PS C model.
   * - ``VEN_PORT92_PS``
     - boolean define
     - off — keep off
     - Latent hook only; port-92 fast-A20 is pinned in PL (gates the address
       bus).
   * - **Peripheral tunables and debug instrumentation**
     -
     -
     -
   * - ``VEN_IDE_DISK_HEX``
     - valued define (path)
     - undefined — no disk load
     - Hex disk-image path ``$readmemh``-loaded into ``ven_ide``'s backing
       store.
   * - ``VEN_IDE_DISK_SECTORS``
     - valued define (int)
     - 128
     - Disk size in sectors; sizes the array, the LBA28 bound and IDENTIFY.
   * - ``VEN_IDE_CYLS``
     - valued define (int)
     - 2
     - IDE logical cylinder count (IDENTIFY geometry).
   * - ``VEN_IDE_HEADS``
     - valued define (int)
     - 16
     - IDE logical head count (IDENTIFY + CHS→LBA translation).
   * - ``VEN_IDE_SECS``
     - valued define (int)
     - 63
     - IDE sectors per track (IDENTIFY + CHS→LBA translation).
   * - ``VEN_IDE_TRACE``
     - boolean define
     - off
     - Five ``$display`` trace points on the IDE register/data paths.
   * - ``VEN_RTC_EXTMEM_KB``
     - valued define (int)
     - 0 — legacy CMOS
     - Seeds the CMOS extended-memory registers SeaBIOS sizes RAM from.
   * - ``VEN_PIT_TICK_DIV``
     - valued define (int)
     - 1024
     - PIT tick prescaler — simulation IRQ0 cadence only.
   * - ``VEN_DBG_WD``
     - boolean define
     - off
     - One-shot 150 000-cycle no-retire watchdog ``$display``.
   * - ``VEN_DBG_SSBASE``
     - boolean define
     - off
     - Real-mode segment-base / high-ESP forensic probes.

x87 FPU implementation
======================

This family covers the flags around the x87 FPU: the radix-4 SRT FDIV/FSQRT
datapath, the FBSTP/FBLD packed-BCD converters, the FP-execute commit
pipeline, the in-core transcendental engines, and two P5 cycle-model "gap"
features (FDIV/integer overlap, free FXCH). The default build uses
single-cycle combinational x87 functions (``rtl/fpu/fpu_x87_pkg.sv``,
``rtl/core/ventium_x87_pkg.sv``) that are bit-exact against QEMU softfloat at
every retire — and note that the genuine SRT divider with the *correct* PLA is
now the ``fx_div`` **default** (still bit-exact); ``VEN_DIV_EXACT`` opts back
to the behavioural wide divide.

The family splits cleanly in two. The implementation/Fmax options
(``VEN_SRT_ITER``, ``VEN_BCD_ITER``, ``VEN_FP_PIPE``, ``VEN_FP_PIPE2``,
``VEN_BCD_DIV100``, ``VEN_DIV_EXACT``) must stay bit-exact and band-safe; the
first four are what the KV260 FPGA builds set, while ``VEN_BCD_DIV100`` and
``VEN_DIV_EXACT`` are manual/sim-side options in the same class. The architectural opt-ins change visible
behaviour and carry their own oracles and gates: ``VEN_SRT_FDIV_BUG`` (the
authentic Pentium FDIV flaw), ``VEN_TRANSCENDENTAL`` (eight new instructions),
and ``VEN_FP_OVERLAP``/``VEN_FXCH_FREE`` (cycle-trace changes graded against
``fpovl=1``/``fxchfree=1`` variants of the ``p5trace`` oracle). Two names in
this family are *not* live ``ifdef``\ s at all: ``VEN_SRT_DIV`` is a legacy flag
whose semantics became the default in commit ``3791e46``, and
``VEN_TRSC_SILICON`` is a reserved, intentionally inert name.

``VEN_FP_PIPE``
---------------

Enables the 1-deep FP-execute commit pipeline that splits the
``eip → icache → decode → f_eval → fpr`` critical cone across two clocks.
``fx_add`` alone is roughly 14 ns of logic, so a single-cycle FP execute can
never clear 66 MHz (problem P0-5 in ``fpga/TIMING_PROBLEMS.md``). Without the
flag, ``f_eval`` is computed *and* committed in the same clock as issue (fast
arm) or as ``S_FEXEC`` dispatch (slow arm); the ``fpp_*`` registers are absent
and the ``we_wabs`` port is tied 0.

On the **fast arm** (cycle-mode ``S_PIPE`` arithmetic), issue captures the
operands, aluop, rounding control and *absolute* destination into ``fpp_*``
pipeline registers; the result is computed the next clock from the registered
operands and written through a new absolute-indexed
``we_wabs``/``we_wabs_fstat`` port on ``rtl/fpu/fpu_top.sv``. A one-clock
read-after-write hazard (``fp_pipe_rd_haz``) stalls a same-clock role-0/1
reader of the in-flight target. On the **slow arm** (``S_FEXEC``
memory-operand arithmetic), the ``s_arf`` cone is dropped and a new FSM state
``S_FEXEC_EX`` evaluates from the registered ``fpp_*`` and commits via
``we_wabs`` *in the same clock as the retire* (``rtl/core/core_fp_exec.svh``).
That same-clock pairing is mandatory: the functional differential harness
checks architectural state at retire, so a plain one-clock defer would read
stale state — the documented functional-mode-retire-check trap. The FP
scoreboard already publishes results at ``issue + latency`` (``fadd`` = 3), so
the N+1 commit lands before any role ≥ 2 consumer reads it and the M5 FP
cycle bands hold by construction. The flag also cut area: LUT utilisation
91.7 % → 82.85 %.

It gates the ``S_FEXEC_EX`` state, the ``fpp_*`` capture/advance registers,
the ``we_wabs`` port group, the suppression of the same-cycle
``fp_we_top``/``fp_we_fstat`` writes, and the issue-side commit bubble — all
in ``rtl/core/core.sv``, ``rtl/core/core_fp_exec.svh``,
``rtl/core/core_fastpath.svh`` and ``rtl/fpu/fpu_top.sv``.

Set by every KV260 SoC bitstream config (``fpga/scripts/impl_kv260_soc.tcl``),
the half-cache and path-probe scripts (``fpga/scripts/apr_hc_pnr.tcl``,
``fpga/scripts/synth_paths_retime.tcl``,
``fpga/scripts/synth_paths_icbram.tcl``), and the A/B cycle gate's base define
set (``verif/fppipe/run-fp-pipe2-ab.sh``). It composes with ``VEN_SRT_ITER``:
with it, the deferred commit uses the split
``f_eval_s2(f_eval_s1(...))`` sharing a single ``round_pack`` cone (−999 LUT),
safe only because the iterative engine owns divide; without it the full
``f_eval`` is kept, because divides do defer through this port. It is the
parent of ``VEN_FP_PIPE2`` (whose ``ifdef`` is nested inside it), and the
``VEN_FXCH_FREE`` fold only absorbs an FXCH after a *push* because the
arithmetic path defers under this flag.

One honest caveat: ``fp_pipe_rd_haz`` protects role-0/1 readers
(FXCH/FLDSTI/slow FST/FCOM), but the icache LRU-touch driver does not
replicate the bubble; the code argues this is band-safe because those readers
never appear in the graded throughput kernels. ``VEN_FP_PIPE`` is therefore
*band-preserving*, not proven byte-cycle-identical for arbitrary role-0/1
sequences. Validation: the default ``make verify`` 75/75 is unaffected; with
the flag, ``make m3`` 75/75, ``make verify`` 75/75, and ``mb_faddchain`` CPI
2.989 / ``mb_fpindep`` 1.152 — both in band.

``VEN_FP_PIPE2``
----------------

The second pipeline stage for the FP commit — the flag behind the K26
half-cache 60 MHz milestone (see :doc:`/microarch/fp-commit-pipeline`). With
``VEN_FP_PIPE`` alone, the deferred commit runs the *whole* ``f_eval``
(``s1`` front → ``s2`` round) in the single commit clock — an ~80-level /
~43-CARRY8 cone from ``fpp_*`` through ``fx_round_pack`` to ``fpr`` that caps
routed Fmax around 52 MHz. ``VEN_FP_PIPE2`` inserts exactly one register
(``fpp2_s1``, of type ``fx_pipe_t``) at the ``f_eval_s1``/``f_eval_s2``
boundary: the captured op's front half (unpack/align/add-or-mul) runs in clock
N+1 and only the short round-pack plus the ``we_wabs`` write runs in N+2. The
result lands in ``fpr`` at issue+2, still at or before the scoreboard publish
at issue+3, so per-retire architectural state and the FP cycle bands are
*identical* to the 1-stage build with no oracle change. The read-after-write
hazard window widens to span both in-flight slots.

Measured on the half-cache OOC build: routed 63.0 MHz (then 65.3 MHz with
``VEN_BCD_DIV100``); synthesis sees the FADD cone leave the critical path
(59.4 → 78.4 MHz synth), the worst path becomes the iterative BCD engine
(``u_bcd``), and only with ``VEN_BCD_DIV100`` added does the binder move to
the ``fpp → fpp2_s1`` stage-1 front (``docs/fpga-synthesis.md``). It is proven
**cycle-safe by a dedicated A/B gate** (``verif/fppipe/run-fp-pipe2-ab.sh``,
``make verify-fppipe2``) that builds two testbenches differing only by this
define and requires the A and B ``--cycle`` traces to be byte-identical on
both kernels (``mb_faddchain`` and ``mb_fpindep``), plus a separate band
check of B against the ``p5trace`` golden (12 % tolerance) on each.

Set by ``fpga/scripts/impl_kv260_soc.tcl`` (production define list), by
``fpga/scripts/apr_hc_pnr.tcl`` when the ``FP_PIPE2=1`` environment knob is
given (output tagged ``_fp2``), and as TB "B" of the A/B gate. It **requires
``VEN_FP_PIPE``** (nested ``ifdef`` — silently inert without it) **and
``VEN_SRT_ITER``**: the split eval's divide arm returns an early zero, which
is only unreachable because the iterative engine owns divide. There is no
compile-time error if ``VEN_SRT_ITER`` is missing — such a build would
silently mis-divide deferred FDIVs; the guard is by convention. Note also that
the A/B gate proves equivalence to the 1-stage build, not directly to the
default — it deliberately inherits ``VEN_FP_PIPE``'s band/oracle validation.

``VEN_FP_OVERLAP``
------------------

Gap 1 of the P5 cycle-model gaps identified from the RTL Engineering
transcripts and shared with the ``p5trace.c`` oracle: the real P5 overlaps
independent integer work with a long FDIV, the default model does not. Without
the flag there is one in-order ``pipe_free_at = issue + occ`` — an FDIV
(occupancy 39) blocks even independent *integer* ops behind it for the full
39 clocks, matching the default oracle.

With the flag, a 32-bit ``fp_busy_cyc`` register models when the single x87
execution unit becomes free. An issuing FP arithmetic op holds the integer
pipe only ``P5_FP_ISSUE_OCC = 2`` clocks (retiring at issue+2) while the real
occupancy window moves onto ``fp_busy_cyc`` — so following independent integer
ops issue and retire inside the FDIV shadow, while a following FP op (FXCH
exempt, as a rename) waits until ``fp_busy_cyc``. The wait is mirrored in all
three replicated guards — the fast-path issue arm
(``rtl/core/core_fastpath.svh``), the icache LRU-touch driver and the
``fp_we_*`` commit driver (``rtl/core/core.sv``) — so spine and drivers agree
on the commit clock.

This is an **architectural (cycle-trace) change**: ``mb_fdivint``'s CPI drops
from the serialized ~1.73 to the 1.20–1.47 band. It therefore has its own
differential gate: ``verif/run-m5.sh`` builds a separate testbench into
``obj_dir_fpovl`` with ``+define+VEN_FP_OVERLAP`` and grades only the manifest
``"fpovl": true`` kernels against an ``fpovl=1`` variant of the ``p5trace``
oracle; every other kernel keeps the default golden and testbench
byte/cycle-identical. ``verif/m5_metrics.py`` enforces a two-part band:
(1) CPI below the 1.50 ceiling of the serialized default (overlap present)
and (2) absolute cycles tracking the ``fpovl=1`` golden within tolerance —
the trailing FP producer still paying the unit-busy wait is a property of
the kernel and its golden (the manifest band description), pinned only
indirectly through the absolute-cycle match. No FPGA config or default
verify build sets it. It must never be
combined with the default golden — and note the inversion of fidelity: this
flag is the *more* faithful P5 model; the default build is the one matching
the unmodified oracle, not real silicon.

``VEN_FXCH_FREE``
-----------------

Gap 2: the real P5 FXCH is a free stack rename; by default every FXCH costs
``occ = 1`` (its own commit clock), so a ``fadd``+``fxch`` pair is 2 clocks —
matching the default oracle. With the flag, decode marks ``D9 C8+i`` with
``is_fxch_free = 1`` and ``fp_occ = 0`` (``rtl/core/decode.sv``; the field
exists unconditionally in ``rtl/core/ventium_decode_pkg.sv``). In the cycle
fast path, a free ``fxch %st(i)`` (i ≠ 0) that directly follows a push
(``FK_FLDC`` constant load or ``FK_FLDSTI``) folds into that push's commit
clock for zero added cycles: the architectural effect is computed as one
combined write — ``we_push`` places the old ``st(i-1)`` into the new ST0 slot
and ``we_sti(i-1)`` places the pushed constant/copy into ``st(i)``, using
pre-edge ``ftop`` addressing — while the spine retires the FXCH as the V-pair
member (``retire2_*``) and advances ``eip`` past both instructions
(``rtl/core/core_fastpath.svh``). A lone or post-arithmetic FXCH falls through
to its own ``occ = 1`` commit; only a push absorbs it, because under
``VEN_FP_PIPE`` the arithmetic result is deferred and not yet in ``fpr`` at
the would-be fold clock.

Architectural cycle-trace change with its own gate: ``verif/run-m5.sh`` builds
a separate testbench into ``obj_dir_fxch`` and grades only the manifest
``"fxchfree": true`` kernels (``mb_fxch``) against the ``fxchfree=1`` oracle.
No FPGA config or default build sets it. Gotchas: ``fxch %st(0)`` is
deliberately *not* folded (the ``fp_sti != 0`` guard), and the combined write
relies on pre-edge ``ftop`` addressing (``we_push`` targets ``fpr[ftop-1]``,
``we_sti(i-1)`` targets ``fpr[ftop+i-1]``) — easy to mis-derive when reasoning
about the swap.

``VEN_SRT_ITER``
----------------

Routes normal-operand FDIV/FDIVR and positive-normal FSQRT through
multi-cycle iterative engines — the D8b/P0-1 rework that made the core fit the
KV260 at all (commit ``3791e46``: LUT 518 % → 111 %, DSP 320 → 95). Without
it, FDIV/FSQRT execute combinationally in one ``S_FEXEC`` clock via
``fx_srt_div`` (36 radix-4 steps fully unrolled — roughly 126 256-bit adders
in synthesis) and ``fx_sqrt`` (a 128-step restoring isqrt): fine for
Verilator, hopeless for FPGA timing and area.

The flag adds the FSM state ``S_FP_BUSY`` and instantiates
``rtl/fpu/fpu_srt_div.sv`` (one radix-4 SRT step per clock, ``NSTEP = 36``,
the per-step body lifted verbatim from ``fx_srt_div`` so the committed
``floatx80`` is bit-identical) and ``fpu_sqrt_iter``. Eligibility is computed
in the ``fp_we_*`` driver: finite-nonzero ``a`` and ``b`` for divide (all six
``FX_AR_*`` forms with aluop 6/7, excluding runtime-errata divides),
finite-nonzero non-negative ST0 for sqrt. Eligible ops start the engine from
``S_FEXEC``, the FSM busy-waits, and result plus ``fstat`` commit through the
same write ports on the engine-done clock, retiring that clock. Because every
operand that would reach the deep loops is engine-routed, both combinational
loops are **stubbed out of synthesis** under the flag (``fx_srt_div`` returns
a zero-significand placeholder for finite-nonzero operands; ``fx_sqrt``'s
finite-nonzero arm likewise stubs out the isqrt loop, returning the operand
unchanged) — the zero/Inf/NaN guards remain live. ``rtl/core/decode.sv``
additionally drops ``fdiv``/``fdivr`` from the cycle fast-path FP
classification so they always take the slow FSM.

Functionally bit-exact: ``make m3`` with the flag passes 74/74 including
``tx_sqrt``, and the standalone ``verify-srt-iter`` gates assert bit-exactness
against the golden for *both* PLAs. Execution latency becomes the engine step
count — an implementation artifact, **not** the P5 occ-39; the FP scoreboard
deliberately keeps the fixed oracle latencies, and the functional harness
needs a larger quiesce window (``verif/run-m3.sh``'s ``QUIESCE`` env knob
defaults to 64 — pass ``QUIESCE=512`` for the ``+VEN_SRT_ITER`` build, the
in-script comment's recommended value; an FSQRT idles ~66 clocks).

Set by every FPGA define list and synthesis probe, by the A/B gate's base
defines, and documented as the canonical ``VL_EXTRA_DEFINES`` example in
``verif/tb/Makefile``. It is the by-convention prerequisite for
``VEN_FP_PIPE``'s split-eval optimisation and for ``VEN_FP_PIPE2``; the
engine's buggy-PLA input mirrors ``VEN_SRT_FDIV_BUG`` (``ENG_DIV_BUGGY``), so
the FDIV flaw survives the iterative rewrite.

Two gotchas. First, bit-exact is not timing-neutral: the quiesce window must
be raised and cycle-mode fdiv bands do not apply. Second, the errata
interplay: a runtime ``ERR_FDIV``-enabled divide is excluded from the engine
route and falls back to ``fx_div_errata → fx_div → fx_srt_div`` — whose
finite-nonzero path is *stubbed* under this flag — so the M6 Erratum-23
runtime mode appears broken under ``+VEN_SRT_ITER`` (even the canonical
published vector takes sign/exponent from the stubbed "clean" value). The M6
errata gates run on the default build; that combination is unverified
territory.

``VEN_SRT_DIV``
---------------

**Legacy, vestigial — defining it today changes nothing**; there is no live
``ifdef VEN_SRT_DIV`` anywhere in the RTL. At M8.5 (commit ``9e6e329``) the
``fx_div`` dispatcher read ``ifdef VEN_SRT_DIV → fx_srt_div`` (with nested
``VEN_SRT_FDIV_BUG`` selecting the buggy PLA), ``else → fx_div_exact``.
Commit ``3791e46`` inverted the polarity: the genuine SRT datapath with the
correct PLA became the unconditional hardware default ("bit-exact vs
``fx_div_exact``/QEMU"), and the opt-*out* became ``VEN_DIV_EXACT``.

The name now survives only in comments and documentation — the feature header
in ``rtl/fpu/fpu_x87_pkg.sv``, a ``Makefile`` note, and *stale* docs that
still describe the old default: ``verif/srt/README.md`` ("with no defines the
divider is ``fx_div_exact``") and ``docs/sphinx/microarch/srt-divider.rst``
both contradict the current code, while ``fpga/TARGETS.md`` has the current
truth. Nothing in the build system sets it; both PLAs are graded by
``verify-srt`` against ``tools/srt/srt_model.py`` with the testbench driving
the ``buggy`` input directly, no define needed. Anyone auditing "which divider
is default" from the README will get the wrong answer.

``VEN_SRT_FDIV_BUG``
--------------------

Selects the **buggy quotient-selection PLA** — the authentic Pentium FDIV
flaw, reproduced from first principles with no operand special-cased.
``fx_srt_pla`` in ``rtl/fpu/fpu_x87_pkg.sv`` models the five PLA cells that
should hold +2 but were never programmed (Edelman: ``8*P_Bad`` in
{23, 27, 31, 35, 39} for divisor columns ``d4`` in {1, 4, 7, 10, 13}); with
``buggy = 1`` those cells return digit 0. The define flips two sites: the
combinational dispatcher ``fx_div`` passes ``buggy = 1'b1``, and — under
``VEN_SRT_ITER`` — the iterative engine's ``ENG_DIV_BUGGY = 1'b1``
(``rtl/core/core.sv``), so the flaw survives the multi-cycle rewrite.

Validated outcomes: ``4195835/3145727`` flaws to
``0x3FFF_AAB7F6392A768638`` (the documented double ``0x3FF556FEC7254ED1``,
wrong at the 13th significant bit, the bug hit at iteration 8);
``5505001/294911`` also flaws; the negative controls ``7654321/3145727`` and
``4195835/3.0`` stay clean. This is an architectural change — deliberately
*not* bit-exact vs QEMU — so it can never join the default verify track. Its
oracle is the single-source Python golden ``tools/srt/srt_model.py``; the
``verify-srt``/``verify-srt-iter`` gates grade both PLAs by driving the
``buggy`` input directly, so they pass without the define being set. It is
distinct from the M6 runtime Erratum-23 model (``fx_div_errata``,
``errata_en`` bit 0), which only reproduces the one published vector and stays
in the default build.

No committed config sets it — purely a user opt-in. Gotchas: QEMU computes
the *correct* quotient, so any differential gate against QEMU will (correctly)
fail under this flag; ``VEN_DIV_EXACT`` is checked first in ``fx_div``, so
combining the two silently yields the exact divider with no flaw; and a
triggering divisor alone is not sufficient to flaw (``7654321/3145727`` is
clean), so spot-checking with one operand pair gives false confidence either
way.

``VEN_DIV_EXACT``
-----------------

The opt-**out** escape hatch: routes ``fx_div`` to ``fx_div_exact``, a plain
behavioural wide-integer divide ("kept as a fast-sim escape hatch; NOT the
hardware path"). Because both ``fx_div_exact`` and the correct-PLA
``fx_srt_div`` are correctly-rounded ``floatx80`` division — validated equal
over an 8 000-divide random corpus — the flag is architecturally neutral;
results stay bit-exact vs QEMU. It exists purely to make pure-Verilator
simulation of the combinational divider cheaper (the unrolled SRT function is
a 36-step loop per call in sim).

It is checked *first* in the ``fx_div`` dispatcher, so it takes silent
precedence over ``VEN_SRT_FDIV_BUG``. Under ``VEN_SRT_ITER`` it is largely
moot for committed results — normal divides are engine-routed before
``fx_div`` is consulted — but the non-engine fallback paths (specials, the
errata path) still call ``fx_div``, which makes ``VEN_DIV_EXACT`` the actual
workaround if runtime ``ERR_FDIV`` semantics were ever needed under
``VEN_SRT_ITER``. Nothing in committed configs sets it (documented in
``fpga/TARGETS.md``). Despite the name it is not "more exact" than the
default — both are correctly rounded — and the stale ``verif/srt/README.md``
still calls ``fx_div_exact`` the default; the code comment in
``rtl/fpu/fpu_x87_pkg.sv`` is authoritative.

``VEN_BCD_ITER``
----------------

Replaces both rare-but-deep packed-BCD conversion cones with multi-cycle
iterative engines. Without it, FBSTP converts FP → packed BCD via the
combinational ``fx_fx_to_bcd`` (18 *chained* divide-by-10 stages, an
~182-deep CARRY8 cone — historically the whole core's worst timing path)
inside the ``fstore_val`` mux, and FBLD pushes ``fx_bcd_to_fx`` (18 chained
multiply-by-10, ~189 levels — the worst pure-logic path once FP arithmetic was
pipelined) directly in ``S_FEXEC``.

The flag adds two FSM states. **FBSTP**: ``S_FEXEC`` starts
``rtl/fpu/ven_bcd.sv``, which performs the FP → int64 conversion once in its
idle state (including the 10^18/int64 overflow check producing the packed-BCD
indefinite and IE), then extracts the 18 BCD digits at two divide-by-10 steps
per clock (~9 clocks); the FSM waits in ``S_BCD_BUSY``, latches
``{ie, pe, bcd}`` on done, feeds it to ``fstore_val``, and the IE/PE ``fstat``
sticky is *deferred* to the engine-done clock, because the flags are not known
at dispatch. **FBLD**: ``S_FEXEC`` defers the push and starts
``rtl/fpu/ven_bcd_to_fp.sv``, which accumulates digits MSD-first at two
multiply-by-10 steps per clock and converts int64 → ``floatx80`` once at the
end; the push of the engine result is driven on the done clock, the *same*
clock ``S_FBLD_BUSY`` retires, so per-retire architectural state is exact.

Both engines are bit-exact to their combinational originals (``make
verify-bcd`` / ``make verify-fbld``; ``make m3`` with the flag is 74/74
including ``tx_bcd_st``/``tx_bcd_ld``). A pure implementation/Fmax-area
option — no architectural change, no dedicated differential oracle needed
beyond the standalone bit-exact gates plus the m3 run. Set by every KV260
config, the half-cache and probe scripts, and the A/B gate base defines. It is
the parent of ``VEN_BCD_DIV100``, and the FBLD engine was motivated by
``VEN_FP_PIPE`` (which made ``fx_bcd_to_fx`` the worst logic path). Gotcha:
FBSTP's IE/PE flags *must* be deferred — the generic store sticky-flag arm is
explicitly skipped for ``FX_FBSTP``; forgetting that pairing is the classic
way to double-set or zero the flags.

``VEN_BCD_DIV100``
------------------

A micro-optimisation *inside* ``ven_bcd``'s per-clock digit-extract step, so
it is meaningless without ``VEN_BCD_ITER``. By default the step extracts two
digits with two chained divide-by-10 operations (~54 levels / ~41 CARRY8 in
series). With the flag it computes a single divide-by-100 (the only wide
reciprocal-multiply on the ``q`` feedback) and derives both low digits from
the 7-bit remainder, halving the per-clock carry cone. It is bit-exact *and*
cycle-neutral — the two-digits-per-clock cadence, state machine and outputs
are unchanged; ``make verify-bcd`` (40 k vectors) passes either way and the
default ``make verify`` stayed 77/77 (commit ``a26045d``).

A pure Fmax option with a narrow applicability window: it only helps when
``u_bcd`` is the worst path, which is true in the OOC
half-cache + ``VEN_FP_PIPE2`` configuration (synthesis 78.4 → 80.6 MHz,
routed 63.0 → 65.3 MHz on the XCK26) — in the full SoC the eip/TLB front-end
cluster binds long before ``u_bcd``, so the committed configurations leave it
**off** there. No committed
script or config sets it; it was applied manually (an extra define on the
``apr_hc_pnr.tcl`` define list) during the half-cache experiments. The
default is deliberate: even though strictly bit-exact and
cycle-neutral, it only pays when ``u_bcd`` is the binding path — which the
full SoC is not.

``VEN_TRANSCENDENTAL``
----------------------

The architectural opt-in for the eight x87 transcendental instructions
(project issue #11). Without it, the ``D9 F0/F1/F2/F3/F9/FB/FE/FF`` opcodes
(F2XM1, FYL2X, FPTAN, FPATAN, FYL2XP1, FSINCOS, FSIN, FCOS) stay
``d_unknown`` in decode and the core **HALTs** on them, keeping ``make
verify`` 77/77 byte-identical.

With the flag, decode arms appear, ``S_FEXEC`` routes the eight fxops to a new
busy-wait state ``S_TRSC_BUSY``, and four engine instances are created:
``rtl/fpu/fpu_f2xm1.sv`` (a verbatim transcription of qemu-8.2.2's
``helper_f2xm1`` and its softfloat routines, with a 65-entry ROM),
``fpu_fpatan``, ``fpu_fyl2x`` (a mode input serves both FYL2X and FYL2XP1)
and ``fpu_fsincos`` (one engine computes sin *and* cos, serving
FSIN/FCOS/FSINCOS/FPTAN; octant reduction via a 3-part Cody-Waite π/2 plus
Taylor). Commit happens on the engine-done clock, which is the retire clock:
F2XM1 overwrites ST0 in place; FPATAN/FYL2X/FYL2XP1 write ST1 then pop;
FSINCOS does ``we_top(sin)`` plus ``we_push(cos)``; FPTAN commits
``tan = fx_div(sin, cos)`` and pushes +1.0; out-of-range trig (\|x\| ≥ 2^63)
sets C2 and leaves ST0 unchanged.

The oracle is split (``verif/trsc/README.md``). **Group B**
(F2XM1/FPATAN/FYL2X/FYL2XP1) is bit-exact vs ``qemu-i386``, whose softfloat is
deterministic there; the executable reference is ``tools/p5xtrans/qref.c``,
itself validated bit-for-bit against the pinned QEMU. **Group A**
(FSIN/FCOS/FSINCOS/FPTAN) deliberately *diverges* from QEMU (which calls the
host glibc at double precision) and is instead bit-exact vs the
silicon-accuracy shared-polynomial model (~1.8 ulp vs ``__float128``; the
core-vs-qemu spread of ~1000 ulp is reported, not gated). The flag therefore
has its own gate suite: standalone engine gates against ``qref`` in all four
rounding modes, plus in-core gates that build the cosim testbench with the
define into ``obj_dir_trsc`` and run the ``tx_*`` differential programs. FP
scoreboard timing is unchanged (engine done-latency is an implementation
artifact per ``docs/m11-transcendental-spec.md``). It is **not** in any FPGA
config.

Interactions: the engines reuse ``fx_mul``/``fx_add`` from ``fpu_x87_pkg``,
and the FPTAN commit calls ``fx_div`` — so the division flags
(``VEN_DIV_EXACT``/``VEN_SRT_FDIV_BUG``) influence FPTAN's tangent. Under
``VEN_SRT_ITER`` that ``fx_div`` call sits *outside* the engine-routed path,
i.e. on the combinational ``fx_div`` whose finite-nonzero SRT body is stubbed
— verify before combining the two flags for FPTAN; the committed trsc gates
run without ``VEN_SRT_ITER``. Gotchas: a naive QEMU diff on FSIN/FCOS fails
by design (the checker compares against the model instead), and the
default-build behaviour for these opcodes is HALT, not a #UD trap — real-world
code using transcendentals requires the flag.

``VEN_TRSC_SILICON``
--------------------

Reserved and currently **unimplemented**: there is no
``ifdef VEN_TRSC_SILICON`` anywhere in the RTL or build system — the name
exists only in comments (``rtl/fpu/fpu_f2xm1.sv``, ``tools/p5xtrans/qref.c``);
the M11 spec documents the ``SILICON`` parameter seam and the design
conclusion without ever naming the macro. The seam it would drive does exist: every transcendental
engine declares ``parameter bit SILICON = 1'b0`` (``fpu_f2xm1``,
``fpu_fpatan``, ``fpu_fyl2x``, ``fpu_fsincos``), and the instances in
``rtl/core/core.sv`` never override it.

The inertness is a design conclusion, not an omission.
``docs/m11-transcendental-spec.md`` records that (a) bit-exact-to-silicon is
unachievable from public data (undumped transcendental microcode, unpublished
Remez coefficients), and (b) for Group B, QEMU's softfloat algorithm *is
itself* the accuracy-faithful silicon one (80-bit table + Horner +
reconstruct, ~0.5 ulp) — a single datapath serves both the bit-exact-vs-qemu
gate and silicon fidelity, so the parameter is reserved/inert. For Group A the
shipped shared-polynomial model already is the silicon-accuracy target.
``tools/p5xtrans/p5xtrans.c`` remains the independent silicon-accuracy oracle
(agrees with ``qref`` to <1 ulp). If the flag were ever implemented it would
be an architectural-mode change requiring its own silicon-model oracle, since
the QEMU differential gates for Group B would no longer apply. Do not assume
it does anything because comments reference it.

Front-end and L1 caches
=======================

This family covers the fetch/decode front end — the BTB, the icache, the
predecoded micro-op cache, the page-keyed fetch micro-TLB — and the parametric
L1 geometry knobs. Per project convention every flag is opt-in: the default
build is the byte/cycle-identical golden-matched reference, and each flag is
one of (a) a bit-exact Fmax/area implementation option whose cycle behaviour
is identical or stays inside the loose M4/M5 bands (``VEN_BTB_PIPE``,
``VEN_IC_NARROWB``, ``VEN_IC_BRAM``); (b) an architecturally bit-exact but
explicitly non-cycle-faithful demonstrator (``VEN_FE_PIPE``,
``VEN_CACHE_HALF``) plus the common-case-correct, not-yet-verify-hardened
``VEN_UOPCACHE``; or
(c) a numeric geometry parameter for cache sweeps (``VEN_IC_SETS``,
``VEN_IC_WAYS``, ``VEN_DC_SETS``, ``VEN_DC_WAYS``, plus the umbrella
``VEN_L1_SETS``/``VEN_L1_WAYS`` pair described with the SoC platform family).

The deployable KV260 SoC bitstream sets ``VEN_BTB_PIPE`` + ``VEN_IC_BRAM`` +
``VEN_UOPCACHE`` + ``VEN_CACHE_HALF`` + ``VEN_FE_PIPE`` (plus the FP and
iterative-engine flags), via ``fpga/scripts/impl_kv260_soc.tcl``; simulation
sweeps set the geometry knobs through ``verif/tb/Makefile``'s
``VL_EXTRA_DEFINES`` passthrough into separate object directories. The
narrative arc documented in ``fpga/TIMING_PROBLEMS.md`` and
``docs/fpga-synthesis.md``: the single-cycle x86 byte-window decode's MUXF
density is the congestion wall, the micro-op cache is the only lever that ever
reduced it (first OOC route, 51.7 MHz), half-cache relieved the tail, and
``VEN_FE_PIPE`` made the full SoC route at all.

``VEN_BTB_PIPE``
----------------

Registers the BTB resolve inputs (``resolve_valid``/``resolve_pc``/
``resolve_taken``) one clock, so the 2-bit-counter update's clock enable comes
from a flop instead of the combinational ``issue_arm`` net
(``rtl/core/bpred_btb.sv``). This lifts the ~13-level BTB tail off the
~63-level ``eip → icache → decode → issue_arm → btb_ctr`` critical path. The
predict ports always read pre-update state (read-before-write off the
registered arrays), so applying the update one clock later only shifts *when*
the predictor warms — it never changes which instructions retire, only (in
principle) prediction timing.

A three-agent investigation established that BRAM was the wrong fix for this
path (the BTB is 64 deep, won't infer BRAM, and is only 21 % of the path); the
registered update is the cycle-safe lever, and it measured **cycle-neutral** —
``mb_brloop``/``mb_brrandom`` absolute cycle counts identical to baseline —
while taking OOC synthesis from 58 to 59.5 MHz (P0-6, commit ``ab3001e``).
Purely an implementation/Fmax option, included in essentially every FPGA
config (the deployable SoC, both ``apr_run.tcl`` configurations,
``apr_hc_pnr.tcl``, ``bd_kv260_soc.tcl``, the path/probe scripts — which mark
it as a removable define) and in the A/B sim builds of the FP_PIPE2 gate, so
the cycle-equivalence gate implicitly exercises it in simulation.

One nuance: the in-module comment claims the delayed update is absorbed by the
loose cycle bands, but P0-6 measured it stronger — absolute cycles identical
on the branch kernels. Strictly, it *can* change a prediction by one resolve,
so treat it as bit-exact-architectural / band-validated-cycle rather than
formally cycle-identical.

``VEN_UOPCACHE``
----------------

The P0-11 predecode-on-fill micro-op cache — the textbook P6/Sandy-Bridge fix
for the x86 byte-window MUXF congestion wall. Without it, the fast path
gathers twelve bytes via the ``ub[]``/``vb[]`` 32:1 byte selects out of the
two icache line buffers and decodes U and the V candidate live each clock with
two ``decode`` instances.

With the flag, ``rtl/mem/uopcache.sv`` is enabled: a 3-state walker FSM
re-runs the *same* pure ``decode`` leaf over a freshly filled 32-byte line
after the multi-cycle ``S_PF`` fill completes (the spine never stalls on the
walker — its ``pd_busy`` output is unconnected) — decoding the fixed bottom
6 bytes of a
``residual`` register and barrel-shifting right by the decoded length, so no
``flin``-indexed wide byte mux ever exists — and stores ``NSLOT = 8``
fixed-width ``fpd_t`` slots plus a 32-entry byte→slot boundary map per
(set, way) in registered-read (``ram_style = "block"``) BRAM stores replicated
A/B. In ``rtl/core/core.sv`` the spine accumulates the fill burst into
``pf_line_buf`` and pulses ``pd_start`` the clock after ``fill_done`` (also
invalidating that set/way's predecode-valid); the fast path then *deletes* the
twelve 32:1 byte gathers and the live decoders, reading
``u_d = slotsA[bslot(flin[4:0])]`` and ``v_d`` as literally the next slot
(killing the ``flin + lenU`` V-base serialization too), with ``br_taken`` —
the only flag-dependent ``fpd_t`` field — re-evaluated against live
``eflags``.

Results: the first-ever MUXF reduction (MUXF7 −22 %, MUXF8 −33 %) and the
first configuration to *route* OOC at 15 ns (51.7 MHz routed, where the
narrowb config dies with 42 392 overlaps). Its registered slot-read timing
mirrors ``VEN_IC_BRAM``, and every setter defines the two together: the
deployable SoC, ``apr_run.tcl`` (CONFIG=uopcache), ``apr_hc_pnr.tcl``, and the
uopcache path/probe scripts.

The major caveat: **not verify-hardened**. The architectural-correctness
backstops are unbuilt — ``uop_hit`` and ``uc_v_avail`` are computed but
consumed *nowhere*, and the planned ``S_PIPE !uop_ready`` stall arm and
branch-into-the-middle re-predecode are explicitly still unbuilt
(``fpga/TIMING_PROBLEMS.md``, ``docs/fpga-synthesis.md``). A fetch from a
resident-but-not-yet-predecoded line, or a branch into the middle of a
predecoded boundary, reads stale or wrong slots with no stall. It is a
synthesis/APR Fmax demonstrator (common-case correct), not a
differential-gate-verified configuration — yet it *is* in the deployable
KV260 bitstream. The default/narrowb build is the verified production
reference. It also conflicts with any I-cache wider than 2 ways: the uopcache way
ports are 1 bit and the store is sized ``IC_SETS*2``, so
``VEN_IC_WAYS``/``VEN_L1_WAYS`` greater than 2 with ``VEN_UOPCACHE`` would
silently alias ways (1-way is safe but wastes half the slot store).

``VEN_UOPCACHE_CHECK``
----------------------

A sim-only structural-equivalence scaffold layered inside the
``VEN_UOPCACHE`` branch: it re-instantiates the live ``ub[]``/``vb[]``
byte-window gather plus two reference decoders (producing
``u_d_ref``/``v_d_ref`` from the actual icache bytes) alongside the slot-read
``u_d``/``v_d``, with the stated intent of asserting that the slot read equals
the live gather decode for every issued ``flin``. The ``ifdef`` block lives
solely in ``rtl/core/core.sv``; ``rtl/mem/uopcache.sv`` carries only a header
comment describing the flag, whose "asserts slot read == live decode" wording
is aspirational. Because it resurrects the 12 × 32:1 gather it must
never appear in a synthesis build.

As of the current tree the define only builds the parallel reference path:
**no assert or comparison statement consumes** ``u_d_ref``/``v_d_ref``
anywhere (true since the introducing commit ``4cdf7eb``). The "equivalence
gate" is scaffolding for manual waveform comparison or future hardening, not
an automated checker; any claim that the uopcache is machine-checked
equivalent via this flag is unsupported. No build script sets it — manual
``VL_EXTRA_DEFINES`` use only, and it is a no-op without ``VEN_UOPCACHE``
(the block is nested). Closing the gap — a real per-issue assert plus the
``uop_ready`` stall — is the documented verify-hardening work.

``VEN_FE_PIPE``
---------------

A page-keyed micro-TLB register that breaks the architectural ``eip``
self-loop (``eip`` → I-TLB hit carry chain → ``ic_present`` → uop slot read →
next ``eip``) — the cone that made the full-SoC KV260 build unroutable at
~41 MHz. Without the flag, ``xlate_miss`` and ``mem_xlate`` read the live
combinational I-TLB/D-TLB lookups (the 16-entry vpn-compare carry chain) on
every access; ``fe_xlate_pend`` is tied 0 so every added guard constant-folds
away and the default build is byte/cycle-identical.

The flag adds one registered translation per side —
``fe_itlb_{page,phys,hit,v}`` and ``fe_dtlb_{page,phys,hit,dirty,v}``, keyed
on ``cur_lin[31:12]`` — sampled by a dedicated ``always_ff`` that also
invalidates on ``tlb_flush`` (CR3 write) and, per side, on any page-walk fill
so a fresh walk is always re-registered. When the current access's page is not
the registered page, ``fe_xlate_pend`` asserts for exactly one clock: the FSM
holds in a dedicated stall arm (no state/eip/issue/bus change that edge) and
the bus driver squashes the request so no stale physical address ever leaves
(``rtl/core/core_bus_driver.svh``); the next clock, ``xlate_miss`` and
``mem_xlate`` read the *registered* hit/frame instead of the live carry chain.

Architecturally bit-exact (``make verify`` GREEN with it on); cycle-wise it
adds one clock per 4 KiB page crossing (under ~1 % perturbation, zero in real
mode or within a page) — an Fmax-only demonstrator, cycle bands not claimed
(option B of ``docs/fe-pipeline-spec.md``). Payoff: the full SoC goes from
"not legally routed, 8 766 node overlaps" at 60 MHz to a clean route
(~50.4 MHz Fmax); the deployed 40 MHz Linux image includes it. The only
setter is ``fpga/scripts/impl_kv260_soc.tcl`` via the ``FE_PIPE=1``
environment knob.

It only matters when paging is on, and coherence couples it to ``tlb_flush``
and the page-walk fill ports — the invalidate-wins-over-register ordering at
one edge is what keeps a freshly walked or flushed page from being cached
stale. The risk profile is the gotcha: a bug here is *silent* (wrong physical
page = data corruption, not a crash), so the spec requires
architectural-bit-exact verification on the paging-heavy gates before trusting
changes to it. Small mispredict/fill cycle shifts on an FE_PIPE build are
expected, not regressions.

``VEN_IC_BRAM``
---------------

Forces the L1 I-cache line store into Block RAM (RAMB36), dissolving the
distributed-RAM read mux. The store becomes two replicated
``ram_style = "block"`` arrays (one per read port, each a single-read SDP
BRAM), reads become *synchronous* (data valid the clock after the address),
and the per-word fill becomes the canonical UltraScale byte-write-enable
template writing both copies (``rtl/mem/icache.sv``). Without it, the line
store is a single distributed-RAM array with asynchronous reads.

Because data now arrives one clock late, the spine grows a content-addressed
front end (``rtl/core/core.sv``): the read addresses are registered as tags in
lock-step with the data; ``ic_byte`` content-addresses whichever buffer holds
the needed set; ``ic_fetch_ready`` reports whether the decode window's line(s)
are currently buffered (sequential fetch is bubble-free because port B always
prefetches the next line, so a sequential line crossing finds the new line
already buffered); and a new ``S_PIPE`` bubble arm burns one no-issue clock
only on a redirect to an un-buffered line (``rtl/core/core_fastpath.svh``).
A sub-step removes even that: when a predicted-taken branch sits in a
non-straddling window, port B is repurposed to prefetch the predicted *target*
line (``pf_redir``/``pf_redir_tgt``) — literally the real P5's BTB-driven
prefetch.

Fully verified: functional 75/75 bit-exact, all 20 cycle bands pass
(``mb_accimm`` +20 % before the target prefetch, +0.39 % after; ``mb_imiss``
+3.97 %, inside the 10 % band), and a Quake 300 k-instruction lockstep run is
EQUIVALENT (``verif/m7/quake_icbram.log``). Area trade: 3 072 LUT-as-memory →
0, +5 RAMB36 in the full core. It did *not* break the OOC congestion wall by
itself — the byte-window MUXF mass lives in the spine, not the cache (the 99 %
``u_icache`` congestion attribution was a flatten-hierarchy artifact) — but it
became the required base for ``VEN_UOPCACHE`` and is in the deployable SoC
config. Set by the deployable SoC, ``apr_run.tcl`` (uopcache config),
``apr_hc_pnr.tcl``, the icbram/uopcache probes, a standalone OOC icache probe,
and ad-hoc sim builds (no checked-in verif script pins it).

Gotchas: it is band-validated, *not* cycle-identical — redirects to
un-buffered lines and the +1 buffer-fill clock after ``S_PF`` shift cycle
counts, so a strict cycle diff against the default build will differ even
though all bands pass. It supersedes ``VEN_IC_NARROWB`` (which lives in its
``else`` branch). And the replicated A/B stores mean every fill writes both
copies — forgetting that when reasoning about SMC/fill coherence is a trap.

``VEN_IC_NARROWB``
------------------

Static width-pruning of the icache's second (straddle) read port in the
*distributed-RAM* (non-BRAM) front end. ``rd_lineB`` is only ever byte-sliced
at low positions — the worst case is ``flin[4:0] = 31`` plus a 6-byte U length
plus index 5, minus 32, i.e. byte 10 — so the flag drives only the low
128 bits and ties the high 128 to zero, letting Vivado prune half of the
256-wide distributed-RAM read port (``rtl/mem/icache.sv``). Since the pruned
bytes are never read, the fetched bytes are bit-identical: verified ``make
verify`` plus ``make m3`` cycle-identical, ``mb_imiss`` +0.03 % noise. It is
the only flag in this family (besides the pure parameters at default values)
that is both bit-exact *and* cycle-identical, so it needs no gate of its own.

Measured: LUT-as-memory 4 096 → 3 072, MUXF7 16 235 → 14 457 / MUXF8
7 147 → 6 202 (−12 %), OOC synthesis WNS −1.808 → −0.587 ns (59.5 → 64.1 MHz)
— the icache left the synthesis critical path — but *placed* Fmax was
unchanged (47.6 MHz) because the irreducible ``rd_lineA`` 256:1 read and the
spine byte-window MUXF remain. It was the "narrowb production" OOC config
until P0-14 showed it does not actually route at 15 ns (42 392 overlaps),
after which the uopcache config took over as the routable base —
``fpga/scripts/impl_kv260_soc.tcl`` documents that narrowb "hits the
single-cycle byte-window MUXF congestion wall and does NOT route on the small
xck26".

Set by ``synth_paths_narrowb.tcl``, ``apr_run.tcl`` (CONFIG=narrowb), several
floorplan/strategy scripts, and the older ``bd_kv260_soc.tcl`` validate flow
(the deployable impl script switched to the uopcache config). It is mutually
dead with ``VEN_IC_BRAM`` — the narrowb code sits in the BRAM ``ifdef``'s
``else`` branch, so defining both silently behaves as BRAM-only (several
scripts list it harmlessly). The safety argument depends on the 6-byte U plus
6-byte V window geometry; if the decode window ever widens, the static bound
must be re-derived.

``VEN_CACHE_HALF``
------------------

An FPGA area/congestion experiment that halves *both* L1s to 64 sets (4 KB
I-cache + 4 KB D-cache), changing the index/tag geometry from idx 7/tag 20 to
idx 6/tag 21. It is the third-priority arm of the ``IC_SETS`` and ``DC_SETS``
precedence chains in ``rtl/core/core.sv``. Because the index/tag split
changes, the hit/miss *sequence* differs from the fixed-128-set ``p5trace``
cycle oracle, so it is explicitly **not** a verification or cycle-fidelity
configuration — the M4/M5 bands are only claimed for the default. It stays
functionally bit-exact (a cache is a hit/miss plus a data store; smaller means
more fills, same bytes): both the default and ``+VEN_CACHE_HALF`` builds pass
``make verify`` 77/77 vs QEMU, run manually via
``VL_EXTRA_DEFINES=+define+VEN_CACHE_HALF make verify``.

FPGA payoff (on the uopcache base, K26, 15 ns): ``u_icache`` placed cells
−45 %, MUXF7/F8 −29 %/−24 %, placer congestion level 6 → 5, failing endpoints
−23 %, TNS −38 % — but routed Fmax roughly flat (51.7 → 50.1 MHz), because the
worst path was the FP deferred-commit cone, not the cache. It was kept as a
free area/tail win and, combined with ``VEN_FP_PIPE2``, became the
63–65.3 MHz OOC configuration and part of the deployable SoC image. Set by
``impl_kv260_soc.tcl``, ``apr_run.tcl`` (``HALF=1`` env), and
``apr_hc_pnr.tcl``; in simulation only by manual ``VL_EXTRA_DEFINES``.

It is overridden by ``VEN_IC_SETS``/``VEN_DC_SETS`` and by ``VEN_L1_SETS``
(the ``elsif`` chain ranks them above it); the geometry flows into the icache,
uopcache and ``dcache_timing`` parameters. Under ``VEN_L1_AXI`` the D-cache
half has no effect at all (``dcache_timing`` is compiled out). Never run the
cycle bands on a half-cache build and call a mismatch a regression — the
in-source comment brands it "NOT a verification config".

``VEN_IC_SETS``
---------------

Per-I-cache set-count override (an integer-valued define), the
highest-priority arm of the ``IC_SETS`` chain
(``VEN_IC_SETS`` > ``VEN_L1_SETS`` > ``VEN_CACHE_HALF`` (64) > 128). The value
must be a power of two, since ``IC_IDXW = $clog2(IC_SETS)`` and
``IC_TAGW = 32-5-IC_IDXW`` derive directly from it; it resizes the parametric
``rtl/mem/icache.sv`` arrays and flows into the uopcache parameter as well.
Non-default values stay functionally correct (the cache is
architecture-transparent) but are *not* matched by the fixed-128-set/2-way
``p5trace`` cycle oracle — they are area/perf experiments.

Its real consumer is the L1 cache-size sweep: ``build/rerun/cache_sweep.sh``
(dhrystone, lockstep replay) and ``build/rerun/dmiss_sweep.sh`` (``mb_dmiss``,
pure ``--cycle``) — local working artifacts in the gitignored ``build/``
tree, not committed — build one testbench per geometry via
``VL_EXTRA_DEFINES="+define+VEN_IC_SETS=$SETS +define+VEN_DC_SETS=$SETS"``
into separate ``obj_dir_sweep_*`` directories and measure CPI vs size — the
sweep that produced the 32 KB-knee result (``mb_dmiss`` CPI 2.50 → 0.53).
Cache timing is cycle-mode-only and cycle mode diverges on syscall/TLS
programs, so sweeps either use flat syscall-free kernels or the PC-keyed
lockstep replay path — that constraint is baked into the scripts. Each
geometry must build into its own object directory or it clobbers the canonical
``obj_dir/tb_ventium`` used by ``make verify``.

``VEN_IC_WAYS``
---------------

Per-I-cache associativity override (chain ``VEN_IC_WAYS`` > ``VEN_L1_WAYS``
> 2; powers of two, with a guard so even 1-way elaborates). The icache was
generalized from the original 1-bit MRU 2-way scheme to a per-way age-counter
**true LRU** (rank 0 = MRU … WAYS−1 = LRU) that provably reduces to the
original behaviour at ``WAYS == 2`` — same hit test, same victim sequence,
same touch/fill recency; the reset encoding (age = way index, reproducing the
old victim = way 1) was deliberately matched, and any "cleanup" there breaks
golden identity. The replacement victim is exported per set as ``ic_victim_o``
so the spine's fill-way selection stays uniform.

No script — committed or local — sets it directly; the associativity
demonstration used the umbrella ``VEN_L1_WAYS=4`` via the local
``build/rerun/nway_4way.sh`` (gitignored), which proved a
12 KB conflict working set thrashes 2-way but fits 4-way (CPI drop) and that
4-way free-runs CoreMark with CRC output identical to qemu-native. It
conflicts with ``VEN_UOPCACHE`` at any value greater than 2 (1-bit way ports,
stores sized ``IC_SETS*2``; 1-way is safe but wastes half the slot store). The fixed-2-way oracle means there is no cycle
gate for other way counts — only functional/CRC validation.

``VEN_DC_SETS``
---------------

Set-count override for the L1 D-cache **timing model**
(``rtl/mem/dcache_timing.sv``) — crucially a timing-only structure: there is
no data array; load data always comes from the BFM/``mem_rdata``, and
``dcache_timing`` just tracks tag/valid/LRU so the spine knows *when* a load
completes (a read miss adds the ``dmiss = 8`` penalty, misalignment +3,
mirroring ``p5_mem()``/``l1_access()`` in the ``p5trace.c`` oracle). It can
therefore never change architectural results — only cycle counts, and only in
cycle mode. Like the I-side knob it is functionally safe but breaks the
fixed-128-set cycle-oracle match; it is used by the same two cache sweeps,
always set equal to ``VEN_IC_SETS``.

It is entirely moot under ``VEN_L1_AXI``: the whole ``dcache_timing`` instance
is compiled out and ``dc_lu_hit`` tied 1 (mode-2 timing is functional-only),
so SoC/board builds ignore it. The D-miss penalty is also deferred one
instruction (``pending_mem_pen``), so per-geometry cycle deltas land on the
*following* retire — relevant when eyeballing sweep traces. Because it is
timing-only, a functional-mode run returns identical data values regardless of
the setting; the sweep scripts exploit exactly this, which is what makes
fixed-``--max-insn`` CPI comparisons valid.

``VEN_DC_WAYS``
---------------

Associativity override for the D-cache timing model, the D-side twin of
``VEN_IC_WAYS`` (chain ``VEN_DC_WAYS`` > ``VEN_L1_WAYS`` > 2).
``dcache_timing`` was generalized to any power-of-two way count with a per-way
age-counter true LRU that reduces *exactly* to the original 1-bit ``dc_lru``
behaviour at 2-way — same hit test, victim sequence and recency update, the
reset encoding reproducing the old ``victim = ~dc_lru``. That exact-reduction
property is the load-bearing reason the default build needed no
re-verification after the parameterization commit. Since the block is
timing-only, changing ways only changes when loads complete in cycle mode,
never what they return.

No script — committed or local — sets it directly; the 4-way demonstration
used the umbrella ``VEN_L1_WAYS=4`` via the local ``build/rerun/nway_4way.sh``
(gitignored), which together with the I-side showed the
conflict-kernel CPI collapse at 4-way and CRC-identical CoreMark output as the
functional proof. Moot under ``VEN_L1_AXI``; usually swept together with the
I-side so the two L1s stay symmetric, mirroring the oracle's single L1
geometry parameterization. Its "differential gate" is a functional/CRC
equivalence run plus a CPI plausibility check — never the cycle bands.

Integer core and build plumbing
===============================

This family covers the integer-core implementation flag ``VEN_IDIV_ITER``
plus the housekeeping tokens (``SYNTHESIS``, ``VTM_NO_DPI``,
``VTM_HAVE_DPI_HEADER``, ``M7_PROXY_DEBUG``, ``F2_DBG``) that separate
simulation builds from FPGA synthesis builds. ``VEN_IDIV_ITER`` swaps the
default single-cycle combinational x86 DIV/IDIV — native SystemVerilog ``/``
and ``%`` inside ``rtl/core/core_exec.svh``'s ``S_EXEC`` arm, which
synthesized to the core's worst path once the FPU went iterative (post-P0-1:
87.8 ns, 667 levels, 585 of them CARRY8) — for a
multi-cycle restoring-divide engine plus a new ``S_DIV_BUSY`` wait state:
architecturally bit-exact, but only cycle-band-equivalent. The non-VEN tokens
follow the convention that the default (no-flag) Verilator build is the
golden-comparable one: ``VTM_NO_DPI`` strips the DPI-C retire-observation
channel so the RTL lints and synthesizes with no C testbench, ``SYNTHESIS``
strips sim-only ``$fatal`` guards and bound SVA, and the rest are debug or
C++-side plumbing. None of the non-VEN tokens change architectural behaviour;
``VEN_IDIV_ITER`` has its own differential gate (``make verify-idiv``) plus
full-suite, cycle-band and ``verify-sys`` #DE re-verification.

``VEN_IDIV_ITER``
-----------------

Without the flag, x86 DIV (``q_md = 6``) and IDIV (``q_md = 7``) execute with
native combinational ``/`` and ``%`` directly in the ``S_EXEC`` arm (64-bit
``{EDX,EAX}`` dividend for the 32-bit form), committing GPRs in one clock and
charging the modelled P5 occupancy as a deferred ``pending_mem_pen`` penalty
burned in ``S_PIPE`` before the next issue (occ−7: DIV 10/18/34, IDIV
15/23/39 for r/m8/16/32). ``rtl/core/ven_idiv.sv`` is always in the filelist
but uninstantiated (``rtl/ventium.f`` notes this is harmless).

With the flag, both ``K_MULDIV`` divide arms reroute: ``S_EXEC`` stops
computing anything and parks in ``S_DIV_BUSY``; a combinational driver block
pulses ``eng_int_start`` that same clock, assembling the dividend from ``gpr``
and the divisor from ``srcv`` (recomputed identically to the exec arm), and
instantiates ``ven_idiv``. The engine is a 4-state FSM doing restoring
division on magnitudes at two radix-2 steps per clock, with the x86 sign
fix-up (quotient sign = dividend ^ divisor, remainder takes the dividend sign,
truncation toward zero) and the *exact* per-width #DE predicates from
``core_exec.svh`` (zero divisor short-circuits; per-width quotient-overflow
checks) — bit-exact to the native path. ``S_DIV_BUSY`` busy-waits on done,
then either delivers #DE vector 0 (system-mode ``start_fault`` through the
IDT, user-mode loud-HALT — identical policy to the inline path) or commits
quotient/remainder into EAX/EDX with the same per-width merge, tops up
``pending_mem_pen`` with a residual (IDIV +6, DIV +1, tuned against ``make
m5`` so issue-to-next-issue equals the P5 occupancies 17/25/41 DIV,
22/30/46 IDIV), sets ``retire_valid`` and returns to ``S_PIPE``.

This removed the post-P0-1 worst path of the whole core (87.8 ns / 667
levels). With the flag, ``make verify`` stays 74/74 green and the
``verify-sys`` "pde" #DE program is EQUIVALENT, while the divide cycle kernels
move ``mb_div8`` +5.45 %, ``div16`` +0.13 %, ``div32`` +2.36 %, ``idiv32``
+2.12 % — all inside the 10 % band. The standalone gate is ``make
verify-idiv`` (``verif/idiv/run-idiv-gate.sh``, ``verif/idiv/tb_idiv.sv``: a
golden mirroring ``core_exec.svh``, directed edge cases plus 80 k random
vectors over all six forms). Set by every full-core FPGA configuration from P0-2 onward
(``bd_kv260_soc.tcl``, ``impl_kv260_soc.tcl``, the APR scripts and the
``synth_paths_*``/``probe_muxf_*``/``probe_lut_quick`` probes); the pre-P0-2
baseline probes ``synth_probe_core.tcl`` and ``synth_probe_core_iter.tcl``
deliberately omit it (they are the before/after measurement configs). The
only verif builder that sets it is the FP_PIPE2 A/B base define set.

Gotchas: cycle-mode A/B diffs against a no-flag build show line deltas on
divide-heavy traces — expected, not a bug. The fixed residual works across
widths only because the engine's run-state clock count scales exactly with the
P5 occupancy deltas; changing steps-per-clock breaks the model. The engine
driver recomputes ``srcv`` combinationally and must stay identical to the exec
arm's expression or operands diverge silently. ``tb_idiv`` deliberately
substitutes a well-defined overflow vector for the INT_MIN/−1 corner (the
testbench's native-``/`` golden would hit C++ UB in Verilator); on real x86
that corner is #DE-overflow, which the engine's predicate covers. And the
busy-wait serializes — no instruction issues until the EDX:EAX commit,
matching the native path's non-pipelined U-pipe hold.

``SYNTHESIS``
-------------

The standard simulation-versus-synthesis fence, used exclusively as
``ifndef SYNTHESIS`` around code Vivado cannot or should not synthesize. Six
sites, all in the L1/AXI/SoC-bridge files and none in the core proper:
``rtl/mem/ven_cdc_afifo.sv`` (initial ``$fatal`` guards requiring DEPTH be a
power of two and ≥ 4, plus SVA forbidding write-while-full/read-while-empty);
``rtl/mem/ven_axi_master.sv`` (``$fatal`` guards converting two
silent-corruption bugs into build failures — ``REMAP_BASE`` must be 32-byte
aligned for the INCR8 4 KiB rule, ``LINE_BEATS`` must be 8 to match the
``ven_l1d`` 32 B line — plus bound AXI protocol SVA: VALID-with-stable-payload
until handshake, 32 B-aligned read-burst base, no 4 KiB crossing, no AR while
a write is outstanding); ``rtl/mem/ven_axi_cdc.sv`` (single-outstanding CDC
invariants); and ``rtl/soc/ven_soc_axil.sv`` (``io_ack`` one-cycle-pulse and
pending-required checks, the F3 INTA seam checks, AXI-Lite BVALID/RVALID
hold). Defining it removes all of this, leaving pure synthesizable logic;
nothing architectural changes, so no gate is needed — but verification must
*not* set it, or the L1/AXI gates lose their checkers (those gates build with
``verilator --assert``).

Set by the Vivado SoC/BD flows (``bd_l1axi.tcl``, ``bd_kv260_soc.tcl``,
``impl_kv260_soc.tcl``, ``synth_cdc.tcl``), always paired with
``VTM_NO_DPI``. The guarded files are only in SoC/L1-AXI builds — none are in
the core filelist ``rtl/ventium.f`` — which is why older core-only probe
scripts set only ``VTM_NO_DPI`` and still synthesize cleanly. Vivado also
pre-defines ``SYNTHESIS`` on its own, but the project sets it explicitly
rather than relying on the tool default. In the default testbench build the
concurrent assertions are compiled but inert, because ``--assert`` is only
added by the dedicated SVA/gate targets — keeping the default build
byte-identical and unburdened by assertion machinery.

``VTM_NO_DPI``
--------------

Elides the entire DPI retire-observation contract so the RTL lints,
elaborates and synthesizes with no C testbench present. Without it, four
``import "DPI-C"`` declarations in ``rtl/ventium_pkg.sv`` — ``vtm_retire``
(integer architectural state), ``vtm_retire_x87`` (80-bit st0–st7 split
lo/hi plus fctrl/fstat/ftag), ``vtm_retire_cycle`` (U/V pipe attribution and
pairing) and ``vtm_retire_sys`` (cr0/cr2/cr3/cr4) — are called once per
retired instruction from the retire points in ``rtl/ventium_top.sv`` and
``rtl/soc/ventium_soc.sv`` (including the dual-issue ``retire2`` second
calls); this *is* the observation channel every differential gate depends on
(one ``.vtrace`` record per ``vtm_retire`` call). A small stub fence also
exists in the M0 selftest top (``verif/tb/selftest/ventium_top.sv``).

With the define, ``retire_n`` bookkeeping still ticks but nothing observes it
— the synthesized core carries zero trace overhead. It is mandatory for
synthesis (``fpga/TIMING_PROBLEMS.md`` P2-2, ``fpga/PLAN.md``). Purely an
observability token: it cannot change architectural behaviour (the calls are
pure observers), but a build with it set produces no trace and therefore
cannot be differentially gated — every golden comparison requires it *unset*.
Set by every ``fpga/scripts/*.tcl`` synthesis/impl/probe script and by the
standalone lint flow documented in ``rtl/README.md``; never by
``verif/tb/Makefile`` or any gate script.

Gotchas: the import signatures are declared verbatim from
``docs/rtl-interface.md`` §2 because the comparator parses by field name — do
not "clean up" the import inside the fence. ``ventium_top.sv`` calls
``vtm_retire_x87`` on *every* retirement, not just FP ops, because QEMU
reports live FP state on every instruction in x87-trace mode; eliding it
per-op would silently break x87 diffs. Re-enabling DPI on a previously
NO_DPI-configured build needs no other change.

``VTM_HAVE_DPI_HEADER``
-----------------------

The C++-side companion of ``VTM_NO_DPI`` in the same DPI contract — **a C++
preprocessor macro, not a Verilog define**, and never passed on any command
line. ``verif/tb/dpi_retire.cpp`` implements the bodies of the four
``vtm_retire*`` functions; it uses ``__has_include`` to detect the
Verilator-generated ``Vventium_top__Dpi.h``. When the header is present it is
included (so the definitions are type-checked against the generated extern-C
prototypes) and ``VTM_HAVE_DPI_HEADER`` is self-defined; when absent (a
standalone compile without ``-Iobj_dir``), the ``#ifndef`` branch supplies
hand-written fallback ``extern "C"`` prototypes matching the documented SV→C
type mapping (``longint unsigned`` → ``unsigned long long``, and so on). Pure
build plumbing with zero behavioural effect; it is listed here because it is
the only other ``VTM_*`` token in the tree. If the fallback prototypes ever
drift from the generated header, builds with the header present fail
type-check — the intended safety net. Grep hits land in ``.cpp`` only.

``M7_PROXY_DEBUG``
------------------

A leave-undefined simulation debug token from the M7 Win95-cosim /
syscall-proxy bring-up. Three ``ifdef``'d ``$display`` sites, tagged
``[M7DBG]``: the int-0x80 proxy fold (pc, retire count, injected EAX, gs
apply/base) in ``rtl/core/core_fetch_decode.svh``; MOV-to-Sreg selector and
hidden-base resolution (the gs-selector-0x33 seam) in
``rtl/core/core_exec.svh``; and indirect-CALL target resolution (``q_pc``,
segment base, EA, loaded target), also in ``core_exec.svh``. Purely
observational — no logic, no state, no architectural effect, no gate needed.
Nothing in the repository sets it (manual ``+define+M7_PROXY_DEBUG`` only),
and ``fpga/TIMING_PROBLEMS.md`` explicitly instructs leaving it undefined for
synthesis. The proxy-fold site lives inside the same decode path that
underlies ``VEN_PS_PROXY``'s ``S_SYSCALL_WAIT``, so it is the natural tap when
debugging PS-proxy syscall injection. Output interleaves with testbench
stdout; enabling it on long runs floods logs — intended for short directed
reproductions.

``F2_DBG``
----------

A leave-undefined debug token for the iterative F2XM1 transcendental engine
(issue #11 bring-up). One site, in ``rtl/fpu/fpu_f2xm1.sv``'s ``S_PACK``
state: a ``$display`` of the polynomial-accumulator internals (iteration,
accumulator exponent/sign/significand halves and the ``y`` exponent) just
before ``norm_round_pack_x`` produces the result. Pure observation, no
behavioural or timing effect, no gate needed; nothing sets it. It is only
meaningful when the engine is actually instantiated, i.e. under
``VEN_TRANSCENDENTAL``.

Testbench value macros and sweep hygiene
----------------------------------------

For completeness, the unit-testbench *value* macros are not core feature
flags: ``BCD_N``, ``SRT_N``, ``SQRT_N``, ``IDIV_N``, ``FBLD_N``, ``FPPIPE_N``,
``F2_NV``/``F2_RC``, ``FA_NV``/``FA_RC``, ``FY_NV``/``FY_RC``/``FY_MODE`` and
``FS_NV``/``FS_OP`` live only in the standalone gate testbenches under
``verif/bcd``, ``verif/srt``, ``verif/idiv``, ``verif/fbld``, ``verif/fppipe``
and ``verif/trsc``, and are vector-count / rounding-mode / operation-select
parameters set by their own gate scripts. A full ``ifdef``/``ifndef`` sweep
over ``rtl/`` finds no other unknown tokens beyond the flags on this page, and
the remaining ``#ifndef`` tokens in ``verif/tb`` C++ are ordinary include
guards. The ad-hoc sweep object directories under ``verif/tb`` split two ways:
``obj_dir_sweep_32`` … ``2048`` and ``obj_dir_4way`` are produced by the
local, uncommitted sweep scripts in the gitignored ``build/rerun/`` tree,
while ``obj_dir_fe`` alone is a manual ``VL_EXTRA_DEFINES`` build with no
script at all.

Memory system and SoC platform
==============================

This family is the build-time ladder that turns the verification core (BFM
memory, same-cycle ack, DPI retire tracing) into a real KV260 FPGA SoC, plus
the umbrella L1 geometry knobs and one physical-implementation switch. The
ladder is strictly layered: ``VEN_L1_AXI`` swaps the bus-functional memory
port for a real stalling L1 data array plus an AXI4 master to PS-DDR
(bus mode 2, runtime ``l1axi_en``); ``VEN_AXI_CDC`` optionally splits that
subsystem into two clock domains via async-FIFO bridges; ``VEN_KV260_SOC``
(nested inside ``VEN_L1_AXI``'s port block, so it *requires* it) adds the
PS-facing seams used by the ``ventium_kv260_core`` full-SoC top; and
``VEN_PS_PROXY`` converts the zero-latency int-0x80 lockstep proxy into a
stalling handshake for the future F4 PS syscall bridge. All of these honour
the project convention that the default build stays byte/cycle-identical,
though the protections vary: the double gate (``ifdef`` absent by default
*and* a runtime enable tied 0) holds for ``VEN_L1_AXI``
(``l1axi_en``/``real_bus``), ``VEN_PS_PROXY`` (``proxy_en``) and
``VEN_KV260_SOC``'s interrupt seam (``soc_en``), while ``VEN_AXI_CDC`` is a
compile-time-only path swap shielded by ``l1axi_en`` at the subsystem level;
per-flag differential gates prove architectural bit-exactness for
``VEN_L1_AXI``, ``VEN_AXI_CDC`` and ``VEN_PS_PROXY`` but *not* for
``VEN_KV260_SOC`` (board-level certification only); only cycle *timing* is
allowed to change (mode-2 timing is explicitly functional-only). ``VEN_L1_SETS``/``VEN_L1_WAYS`` are
valued geometry knobs for the in-core caches, and ``VEN_PBLOCK`` is not a
Verilog define at all but an environment variable read by two Vivado Tcl
scripts.

``VEN_L1_AXI``
--------------

The P1-1 "bus mode 2" build: routes the core's same-cycle-ack 32-bit memory
port through ``rtl/mem/ventium_l1_axi.sv`` — ``ven_l1d`` (an 8 KB / 2-way /
32 B-line write-through L1 data array) plus ``ven_axi_master`` (a burst
engine) — out to a full AXI4 master bundle that either the testbench's
behavioural DDR slave or the PS ``S_AXI_HPC0_FPD`` services. Without the flag
there are no AXI ports, no ``ventium_l1_axi`` instance, no
``real_bus``/``bus_err`` core ports, and the ``dcache_timing`` instance is
kept — the core's memory port wires directly (or via the bus-mode-1 BIU) to
the testbench BFM, byte/cycle-identical to all M0–M14 gates.

The flag adds the ``l1axi_en``/``flush_all``/``m_axi_*``/``bus_err`` port
group to ``rtl/ventium_top.sv`` and the ``u_l1axi`` instance with a 3-way
memory mux that holds the direct ``mem_*`` port inert when ``l1axi_en = 1``.
Inside the core it adds ``real_bus`` and ``bus_err`` inputs and arms the
fast-path miss-stall gates: the dual-issue fast path otherwise latches
``mem_rdata`` the same clock it requests (the BFM assumption), which a real L1
cannot honour on a miss — so I-side fill word-0 capture and the ``S_PF``
launch wait for ``mem_ack``, a D-side register-base load bubbles the whole
issue until the fill completes, and the modelled P5 D-miss penalty is
suppressed (real DDR latency becomes the only timing source — "mode-2 timing
is functional-only"). A fatal AXI fault (watchdog timeout or SLVERR) parks the
core in ``S_HALT`` with ``cpu_hung`` so the PS can observe and reset. As an
area side-effect the build drops the ~4.4 k-cell ``dcache_timing`` instance
(``dc_lu_hit`` tied 1 — functionally inert because the penalty it fed is
suppressed).

Architecturally bit-exact: the entire 77-program suite is re-verified against
the QEMU golden through this path (``verif/l1/run-l1axi-verify.sh``); the
byte-accurate/cross-line L1 fix took mode-2 equivalence from 34/77
(aligned-only) to 76/77, and the final program was recovered by the high
quiesce window (the default 64 falsely declared an idle core), giving
``L1AXI-VERIFY-OK`` 77/77.
All gates are doubly protected (``ifdef`` plus runtime
``l1axi_en``/``real_bus = 0``), so even the flagged build in modes 0/1 is
byte-identical. Set by the ``verif/tb/Makefile`` targets ``l1axi`` and
``l1axi_cdc`` (which append ``rtl/mem/ven_l1d.sv``,
``rtl/mem/ven_axi_master.sv`` and ``rtl/mem/ventium_l1_axi.sv`` — deliberately
absent from ``rtl/ventium.f``, so the macro alone is an elaboration error), by
``run-l1axi-verify.sh``, and by the production KV260 define lists
(``bd_kv260_soc.tcl``, ``impl_kv260_soc.tcl``, together with
``VEN_KV260_SOC``). The ``-DVEN_L1_AXI`` C flag also reaches the C++
testbench, gating the behavioural DDR slave leg in ``tb_main.cpp``.

Gotchas: the cycle oracle is not claimed in mode 2 — only the functional trace
is graded. Verification needs a high quiesce (``--quiesce 200000`` vs the
default 64) because a multi-cycle mode-2 access (an icache fill, a 27-beat
FNSAVE) goes many clocks without retiring. ``ven_l1d`` holds ``m_req``
*continuously* across back-to-back transactions, so any downstream consumer
that waits for ``!m_req`` deadlocks (documented in ``ven_axi_cdc.sv``). The
``flush_all`` input exists for cosim coherency: the syscall proxy writes the
backing memory model directly, bypassing the L1. The standalone unit gates
(``run-l1axi-gate.sh``, ``run-l1d-gate.sh``, ``run-l1axi-wd-gate.sh``) do
*not* need the define — the leaf modules are unconditional; the ``ifdef`` only
controls wiring into ``ventium_top``/``core``.

``VEN_L1_SETS``
---------------

Sets the set count for *both* in-core L1s at once: ``IC_SETS`` for the
functional I-cache and ``DC_SETS`` for the D-cache timing model. It sits in a
precedence chain — the per-cache ``VEN_IC_SETS``/``VEN_DC_SETS`` override it,
it overrides ``VEN_CACHE_HALF`` (the legacy 64-set shortcut), which overrides
the 128 default (the Pentium silicon geometry the cycle oracle is validated
against; at the default, the index/tag slices are exactly
``addr[11:5]``/``addr[31:12]``). Set-index width, tag width and all tag/valid/
data arrays derive automatically; the value must be a power of two.

Changing it changes the miss *sequence*: the build stays functionally
bit-exact — supported by the ``VEN_CACHE_HALF`` ``make verify`` run (the
64-set case, 77/77), the 4-way CoreMark CRC free-run and the functional
set-count sweeps, not by a ``make verify`` run with this define itself — but
the cycle trace no longer matches the fixed-128-set ``p5trace`` oracle, so the
M4/M5 bands are only claimed for the default. It is the documented
convenience form (see :doc:`/microarch/l1-parametric`) with no recorded user:
the cache design-space sweeps that found the 32 KB L1 knee passed the
per-cache ``VEN_IC_SETS``/``VEN_DC_SETS`` instead, and the umbrella pair's
only recorded use is the ways side (``VEN_L1_WAYS=4``). There is no
production setter — the FPGA builds use ``VEN_CACHE_HALF`` instead. An important scope limit: it does
**not** touch ``ven_l1d``, the ``VEN_L1_AXI`` data cache — that module has its
own ``L1_SETS`` module parameter left at its 128 default by
``ventium_l1_axi``, so a geometry sweep changes the in-core I-cache and
D-timing model only. Always build into a private object directory so the
canonical ``obj_dir/tb_ventium`` is never clobbered.

``VEN_L1_WAYS``
---------------

Sets the associativity for both in-core L1s at once, with the same precedence
pattern (``VEN_IC_WAYS``/``VEN_DC_WAYS`` win; default 2, the P5 silicon
associativity, at which the generalized age-rank true LRU reduces
byte-for-byte to the original 1-bit MRU scheme — victim = ``~lru`` —
confirmed by the cycle gates). Supporting arbitrary N required the per-way
age-counter LRU described under ``VEN_IC_WAYS``, exposed as ``ic_victim_o`` so
the spine's fill-way selection stays uniform. Functionally bit-exact but
cycle-oracle-divergent for non-default values — an area/CPI exploration knob,
e.g. the 4-way conflict-pattern experiment in ``build/rerun/nway_4way.sh``
(local, gitignored; a 12 KB conflict set on ``+VEN_L1_WAYS=4``). No production setter; the same
``ven_l1d`` scope limit as ``VEN_L1_SETS`` applies. The value must be a power
of two.

``VEN_AXI_CDC``
---------------

The P1-3 dual-clock option for the L1/AXI subsystem. Without it,
``ventium_l1_axi`` is the single-clock ``CDC_BYPASS`` build: ``ven_l1d``'s
word-granular backing port wires straight to ``ven_axi_master`` with plain
assigns, the raw ``core_rst_n`` is used everywhere, and ``core_clk`` and
``axi_clk`` are expected tied to the same clock — which is what the production
KV260 bitstream ships (a single ``pl_clk0``).

With the flag, the core plus ``ven_l1d`` stay in ``core_clk`` (the slow,
Fmax-limited PL domain) while ``ven_axi_master`` and the AXI4 link run in
``axi_clk``. ``ventium_l1_axi`` instantiates ``rtl/mem/ven_axi_cdc.sv``, which
bridges the single-outstanding backing port with exactly one primitive:
``rtl/mem/ven_cdc_afifo.sv``, a clean-room Cummings 2-FF Gray-pointer
asynchronous FIFO (binary plus Gray pointers, 2-flop synchronizers, FWFT read
off distributed RAM). Two FIFOs are used — a command FIFO core→axi carrying
``{we, addr, wdata, wstrb}`` and a response FIFO axi→core (depth ≥
``LINE_BEATS = 8`` for a full line fill) — with small per-domain FSMs that
preserve ``ven_l1d``'s ack semantics exactly (eight in-order acks per read
fill, one per write), so ``ven_l1d`` and ``ven_axi_master`` are textually
untouched; ``ven_axi_master`` is simply re-clocked via a compile-time static
net alias, never a runtime clock mux. Two ``ven_reset_sync``
async-assert/sync-deassert reset bridges are fed by the AND of *both* raw
resets, so a reset in either domain resets both — an asymmetric
mid-transaction reset would otherwise wipe the producer side of an in-flight
fill and create an undetectable deadlock (the watchdog lives in the just-reset
axi domain). The sticky ``bus_err`` level crosses axi→core through a 2-flop
``ASYNC_REG`` synchronizer.

Purely an implementation/Fmax option, and proven bit-exact: the multi-ratio
data-integrity gate (``verif/l1/run-l1axi-cdc-gate.sh``) runs
read-fill/write-through/evict at four core:axi clock ratios with assertions
live, and ``run-l1axi-cdc-verify.sh`` re-runs the full 77-program suite
through the bridge at the degenerate equal-clock ratio. Set by the Makefile
``l1axi_cdc`` target (together with ``VEN_L1_AXI``), the CDC gates, and
``fpga/scripts/synth_cdc.tcl`` (an OOC synthesis check with two genuinely
asynchronous clocks, 66/200 MHz).

Gotchas: the reset coupling is load-bearing — never split the resets. The
CDC's core-side FSM completes *on* the last response and must not wait for the
request to drop, because ``ven_l1d`` holds ``m_req`` continuously across
back-to-back transactions (FNSAVE's 27 sequential stores) — waiting deadlocks.
The FIFO depth must be a power of two ≥ 4 (a sim-time ``$fatal`` enforces
it). The FIFO is metastability-safe by Gray-code structure, not by simulation
— Verilator only proves the functional protocol across ratios. The CDC source
files are not in ``rtl/ventium.f``, so defining the macro without adding them
is an elaboration error; and the real dual-clock MMCM block design with its
``ven_cdc.xdc`` constraints is the pending follow-up.

``VEN_KV260_SOC``
-----------------

The full-SoC seam set for the KV260 PS-assisted architecture (the PL holds
only the core, L1 and AXI master; the PS A53 emulates every slow peripheral).
Inside ``rtl/ventium_top.sv`` it does three things. First, it adds the
``retire_count[63:0]`` output, wired to the live retire counter, so the PS can
observe forward progress without the sim-only DPI (the F2 milestone). Second,
it adds the F3 PS-driven external-interrupt injection seam — ``soc_en``,
``intr`` (an 8259-INT-analogue level), ``inta_vector[7:0]`` (the PS-staged
vector) inputs and the one-clock ``inta`` acknowledge strobe output — and
wires the core's previously tied-off interrupt divert for real (NMI stays
tied 0). Third, it overrides the L1/AXI address remap: ``L1AXI_REMAP_BASE``
becomes ``40'h0_4000_0000`` instead of identity 0, so the core's x86 physical
space lands in the reserved 256 MB DDR carveout on ``S_AXI_HPC0``. Without the
define, even a ``+VEN_L1_AXI`` build has none of these ports and the
interrupt path is tied off, so every cosim/L1AXI gate stays byte-identical.

It is the build flag for the ``rtl/soc/ventium_kv260_core.sv`` top
(``ventium_top`` plus ``rtl/soc/ven_soc_axil.sv``, the AXI4-Lite
control/status/port-I/O-bridge/interrupt slave on ``M_AXI_HPM0_FPD``), whose
header states "build with ``+VEN_L1_AXI +VEN_KV260_SOC``". At runtime it is
additive-and-inert by default: ``soc_en`` comes from the slave's
``MODE.SOCEN`` bit (default 0), so F2 boot behaviour is unchanged until the PS
opts into F3 interrupt injection, and the remap is pure address translation,
invisible to the core. There is no Verilator differential gate for this define
itself — it is certified by the BD validate/synth bars plus the on-board F2
firmware diff (the 324-record COM1 boot trace), while its constituent seams
(the AXI-Lite slave handshake, the L1/AXI wire protocol) have their own
Verilator gates.

Set by ``fpga/scripts/bd_kv260_soc.tcl`` and
``fpga/scripts/impl_kv260_soc.tcl`` (every archived production run echoes
it). It has a **hard dependency on ``VEN_L1_AXI``**: its entire port block is
nested inside that ``ifdef`` and the core connection block references the new
ports — defining ``VEN_KV260_SOC`` alone does not compile. The carveout
base/size (``0x4000_0000``/``0x1000_0000``) must agree between the RTL
localparam, the BD address segment, and the PS Linux reserved-memory node.
Note the deliberate asymmetry: cosim keeps ``REMAP_BASE = 0`` because the
testbench memory model is x86-phys-indexed — do not "fix" one side to match
the other. Among ``ventium_top``-based builds, only ``+VEN_KV260_SOC`` exposes the
``soc_en`` port the core branches its CPUID identity on (0x663 vs 0x543; the
port is tied 0 otherwise) — but the M8 ``ventium_soc`` simulation top drives
``soc_en = 1`` unconditionally with no define, and that is where the
``soc_en`` CPUID arm is actually differential-verified (the ``psoccpuid`` SoC
gate); the define itself only adds ports and the remap.

``VEN_PS_PROXY``
----------------

The F4 PS-bridge stall variant of the Quake user-mode int-0x80 proxy. Without
it, an int-0x80 in proxy mode commits with *zero latency*: the ``S_DECODE``
arm applies the testbench-driven golden EAX/gs/resume effects combinationally
on the same clock (the M7.1 lockstep contract). On real hardware the PS A53
answers a syscall thousands of core clocks later, so the same-clock assumption
breaks.

The define adds the ``syscall_resp_valid`` input through ``ventium_top`` into
the core, a new FSM state ``S_SYSCALL_WAIT``, and a 4-bit ``q_proxy_len``
register. When ``proxy_en = 1`` and ``S_DECODE`` hits ``cd 80``, instead of
committing immediately the core latches only the instruction length and parks
in ``S_SYSCALL_WAIT`` (``eip`` still at the ``cd 80``), committing nothing;
``syscall_active`` still pulses for that one ``S_DECODE`` clock, and because
no retire ticks the counter, ``syscall_n`` and the ``syscall_arg_*`` register
reads stay stable for the PS for the whole wait. On ``syscall_resp_valid`` the
wait arm performs the *identical* commit the zero-latency arm would have
(EAX, optional gs-base latch, ``eip += q_proxy_len``, the fold-pending state)
and resumes fetch — architecturally equivalent, just later. It mirrors the
``S_IO`` port-I/O stall discipline.

The stall reorders *when* effects commit, so equivalence is non-obvious and
the flag has its own differential gate:
``verif/m7/run-quake-lockstep-proxy.sh`` builds ``make proxy``
(``obj_dir_proxy``; the ``-DVEN_PS_PROXY`` C flag also makes ``tb_main.cpp``
play the PS with an 8-clock service latency) and requires the stalled RTL
trace to be bit-exact against the *same* QEMU golden the zero-latency build
matches (``QUAKE-LOCKSTEP-PROXY-OK``). It is runtime-double-gated
(``proxy_en`` must be 1 *and* an int-0x80 must decode in user mode), so even
the flagged build is byte-identical on non-proxy runs.

It is in no FPGA production list — the F4 hardware syscall bridge is future
work; ``ven_soc_axil`` only *reserves* the syscall register window
0x40–0x6C for it, and ``ventium_kv260_core`` ties the response inputs off.
The eventual hardware path composes with ``VEN_KV260_SOC`` (the reserved
window will drive ``syscall_resp_valid``) and with ``VEN_L1_AXI``'s
``flush_all`` (the PS writes kernel-memory effects to the DDR carveout,
requiring L1 invalidation before resume). Gotchas for a PS-side
implementation: ``syscall_active`` pulses on the ``S_DECODE`` clock, *not* the
commit clock, so the response must not be assumed sampled the same cycle; and
``eip`` parks at the ``cd 80``, so ``q_proxy_len`` is the only state needed to
advance on the late commit — do not re-derive the length in the wait arm.

``VEN_PBLOCK``
--------------

**Not a Verilog define.** It is an environment variable read by two Vivado Tcl
floorplanning-experiment scripts (``fpga/scripts/impl_route_relax.tcl`` and
``fpga/scripts/impl_route_fppipe.tcl``) via
``[info exists ::env(VEN_PBLOCK)] && $::env(VEN_PBLOCK) eq "1"``; it never
appears in any ``verilog_define`` list and never reaches the RTL, so it cannot
change the netlist's function — a pure physical-implementation option. When
``VEN_PBLOCK=1``, after out-of-context synthesis of the bare core the script
creates a soft Pblock ``pb_core`` containing every non-internal cell, resizes
it to ``SLICE_X0Y0:SLICE_X111Y239`` (a compact lower-left-biased region) and
sets ``IS_SOFT TRUE`` — the intent being to stop the placer from spreading the
82 %-utilization core across the whole die so hot nets (the ``eip`` fetch
self-update loop, ``f_mem80 → fpr``) stay short. The output build directory
gets a ``_pblock`` suffix so baseline and Pblock runs coexist.

The recorded outcome is a **closed negative result**
(``fpga/TIMING_PROBLEMS.md``): placed Fmax improved only 42.5 → 44.2 MHz, and
the full route did not converge at 82.85 % utilization under either route
directive — the conclusion that floorplanning alone cannot fix a
device-filling core; the real lever is lower utilization, which redirected the
Fmax work toward logic consolidation and the later half-cache/uop-cache
configurations. No Makefile or CI sets it — a manual experiment knob, and the
comparison is to the exact string ``"1"`` (``VEN_PBLOCK=true`` silently
disables it). It operates on builds using the Fmax define stack and is
orthogonal to every Verilog flag; the separate ``run_pblock_rebalance.tcl``
and ``impl_floorplan.tcl`` experiments do not read it. The production KV260
flow does not use it.

Peripheral RTL/PS split
=======================

The ``VEN_<DEV>_PS`` family is Ventium's per-peripheral RTL(PL)-versus-PS(A53)
placement selector for the SoC top ``rtl/soc/ventium_soc.sv`` (see
:doc:`/soc/peripheral-split` for the architecture). A device whose macro is
defined is *not* synthesized into the SoC: its I/O port range is added to
``io_ps_sel``, and every IN/OUT in that range is forwarded over the byte-wide
``io_ps_*`` bridge seam to a C model, with ``io_ack`` waiting on
``io_ps_ack`` and ``io_rdata`` taking the returned byte. Because the core
stalls in ``S_IO`` until ack, the C model's latency cannot perturb the
per-record instruction stream — only the returned value matters.

The flags are generated, not hand-set: ``fpga/scripts/gen_periph_split.py``
reads ``fpga/periph_split.config`` (``rtl`` or ``ps`` per device) and emits
``fpga/build/periph_split.vdefs`` (one ``+define+VEN_<DEV>_PS`` line per
ps-placed device — currently RTC, I8042, VGA, ACPIPM, UART and FDC) plus the C
dispatch table ``verif/tb/ps_periph_table.inc``. In verification the single
setter is ``verif/soc/run-soc-ps-cosim-gate.sh``, which passes one
``+define+VEN_<DEV>_PS`` into the ``verif/tb/Makefile`` ``soc`` target,
building a per-device ``obj_dir_soc_<dev>ps`` testbench whose ``ps_devs``
table (``verif/tb/tb_soc.cpp``) services the forwarded ports from
``sw/ps_periph/<model>.c``. The default build (no flags) is the all-RTL SoC
with ``io_ps_req`` tied 0 — byte-identical to the pre-split SoC.

Each enabled flag changes implementation *placement* only and is required to
stay per-record bit-exact: every settable flag has its own differential gate
(``run-soc-ps-cosim-gate.sh <dev>`` → "C model == RTL == qemu" EQUIVALENT;
aggregated by ``run-soc-ps-cosim-all.sh`` inside ``run-all-soc-gates.sh``).
Off the differential surface, each flag does delete real SoC functionality —
device IRQ lines into the PIC tie to 0, the i8042 A20 output, the VGA mode
bits feeding the chain-4 framebuffer, the UART tx console seam — which the
docs class as board-integration follow-ups needing a PS-to-PL return path.

Three of the nine flags (``VEN_PIC_PS``, ``VEN_PIT_PS``, ``VEN_PORT92_PS``)
are **forward-only latent hooks**: their decode term exists in the
``io_ps_sel`` chain, but the RTL instance is *not* ``ifndef``-gated (always
instantiated), they are declared bus-critical "keep in PL" in
``fpga/periph_split.config``, the cosim gate script rejects them, and (for
pit/port92) no C model exists — setting them today would produce a broken
dual-commit hybrid. Note also that nothing in ``fpga/scripts/*.tcl`` consumes
``periph_split.vdefs`` yet: the current KV260 board builds compile
``ventium_kv260_top``/``ventium_kv260_core`` — *all* peripherals on the PS via
``ven_soc_axil`` — and not ``ventium_soc.sv`` at all, so these flags are
presently exercised only in the Verilator SoC verification builds, standing
ready for a future full-SoC PL build.

``VEN_UART_PS``
---------------

Places the COM1 16550 UART (ports 0x3F8–0x3FF) on the PS. The ``ifndef`` in
``rtl/soc/ventium_soc.sv`` removes the ``ven_uart16550`` instance (real area
savings in a future full-SoC PL build); the ``else`` arm ties
``uart_rdata = 0``, ``uart_irq4 = 0`` (master-PIC IR4 permanently deasserted)
and kills the board-console tx seam (``uart_tx_valid = 0`` — the seam the RTL
streams THR bytes out on). ``cs_uart`` is OR-ed into ``io_ps_sel``, so every
access in the range stalls the core until the C model's byte returns.

The replacement is ``sw/ps_periph/ven_uart16550.c``, a 1:1 behavioural port of
the RTL register surface (IER/LCR/MCR/LSR/MSR/SCR/FCR/DLL/DLM/RBR, DLAB
banking) where the RTL's clocked read side-effects (the LSR/MSR read-clears)
become inline read side-effects, applied exactly once per access via the
bridge's single-call guarantee. The gate
(``run-soc-ps-cosim-gate.sh uart``, test ``psocuart``) is EQUIVALENT over 110
retired instructions. The tx-seam loss is moot on the board: when the UART is
PS-placed, the PS C model itself holds the THR bytes — it *is* the console.
IRQ4 delivery is tied off in the PL; the differential runs with interrupts
disabled so this is unobserved, but a board build needs the PS to inject IRQ4
through ``ven_soc_axil``. One documentation skew: ``sw/ps_periph/README.md``
references a ``run-soc-uart-ps-gate.sh`` wrapper that does not exist — the
real entry point is ``run-soc-ps-cosim-gate.sh uart``. The port range is
adjacent to the FDC's and to IDE alt-status 0x3F6; the decode boundaries are
exact, with no overlap.

``VEN_VGA_PS``
--------------

Places the VGA legacy register file (window 0x3B0–0x3DF) on the PS. The
``ifndef`` removes the ``ven_vgaregs`` instance (ATTR/MISC/SEQ/DAC/GFX/CRTC/
IS1 registers with qemu-matched masks); the ``else`` arm ties
``vga_rdata = 0`` and — critically — zeroes the four exported mode bytes
(``vga_seq_plane_mask``, ``vga_seq_mem_mode``, ``vga_gfx_mode``,
``vga_gfx_misc``). Those bytes feed the still-instantiated RTL chain-4
framebuffer ``ven_vga_fb``: ``chain4_en`` derives from them, so with the
registers on the PS it is stuck 0 and the 0xA0000 mode-13h memory intercept
never fires — the framebuffer is dormant.

The range forwards to ``sw/ps_periph/ven_vgaregs.c``, which ports the register
surface 1:1 including the deterministic read side-effects (DAC 3-byte palette
auto-increment, ATTR index/data flip-flop reset on an IS1 read, the IS1
dumb-retrace toggle), executed inline exactly once per access. The ``pvga``
differential (292 records, shared with the acpipm gate) proves the C model
byte-identical to qemu — ``pvga`` never touches 0xA0000, so the dormant
framebuffer is invisible to it.

This is the biggest cross-domain coupling of the family: register split and
framebuffer rendering are currently mutually exclusive — a mode-13h workload
(the F4 Quake/DOS path, the ``pvgafb`` gate) must *not* be run on a
``+VEN_VGA_PS`` build until a PS→PL mode-bit return path exists (the explicit
follow-up note in both the RTL and the peripheral-split page).

``VEN_RTC_PS``
--------------

Places the MC146818 RTC/CMOS (ports 0x70 index / 0x71 data) on the PS. The
``ifndef`` removes the ``ven_rtc`` instance; the ``else`` arm ties
``rtc_rdata = 0``, ``rtc_irq8 = 0`` (slave-PIC IR8 permanently low) and
``rtc_nmi_dis = 0`` (the NMI-disable side-band, today a lint sink). The two
ports forward to ``sw/ps_periph/ven_rtc.c``, a 1:1 port of ``ven_rtc.sv``
matched to QEMU 8.2.2's ``mc146818rtc.c``: the 128-byte CMOS array behind the
index latch with the separate non-aliasing NMI-disable bit, index-port reads
returning 0xFF, REG_A UIP read-only, REG_C read-then-clear plus IRQ8 lower (an
inline read side-effect, applied once per access), REG_C/REG_D write-ignore,
and the REG_B IRQF recompute on write.

The ``psocdev`` differential (122 records) reads only the time-invariant
registers (REG_D, REG_B, the index-read 0xFF, a scratch-CMOS round-trip), so
the un-oracled periodic tick can never perturb the compare; with the flag set
the same test passes EQUIVALENT against the C model. This is the cleanest flag
of the family — no PL-consumed output other than the (diff-quiescent) IRQ8 and
NMI-disable lines. The RTC time-byte / UIP / periodic-tick surface is a
documented oracle boundary in *both* placements (host-clock-derived); the
cosim gate only proves the synchronous register surface, and board-build IRQ8
delivery needs PS injection via ``ven_soc_axil``.

``VEN_PIT_PS``
--------------

A forward-only **latent hook — do not set**. The macro exists in the
``gen_periph_split.py`` table (ports 0x40–0x43) and in the ``io_ps_sel`` chain
of ``ventium_soc.sv``, but unlike the six slow devices the ``ven_pit``
instantiation is *not* wrapped in ``ifndef VEN_PIT_PS`` and has no tie-off
arm. Defining it therefore does not remove the RTL PIT: it only reroutes
0x40–0x43 accesses to the bridge while the RTL PIT still receives the same
``cs_pit`` pulses and still drives ``pit_out0`` into PIC IR0. Because a
PS-bridged access holds ``io_req`` (and thus ``cs_pit``) high for multiple
clocks until ``io_ps_ack`` — versus the one-clock chip select the RTL devices
were designed for — RTL-side write commits and read side-effects
(counter-latch and read-LSB/MSB flip-flop state) would double-step, and a PS
shadow copy would diverge from the live RTL PIT that actually generates IRQ0.

There is no ``sw/ps_periph/ven_pit.c`` (the generator would report
"MISSING — needs a C model"), ``run-soc-ps-cosim-gate.sh`` has no ``pit``
branch, and ``fpga/periph_split.config`` pins ``pit rtl`` as bus-critical
("8254 PIT — IRQ0 timebase"); a ``+VEN_PIT_PS`` testbench build would read
open-bus 0xFF for the range. The PIT's RTL coverage is the all-RTL ``pirqsoc``
gate. It is distinct from ``VEN_PIT_TICK_DIV``, the unrelated sim-only IRQ0
cadence knob, and the PS-side PIT for the actual KV260 board path is planned
to feed ``sw/ps_periph/ven_pic.c`` via the ``ven_soc_axil`` interrupt seam — a
different architecture that bypasses ``ventium_soc`` entirely.

``VEN_PIC_PS``
--------------

The second latent hook — **do not set**. Its only RTL appearance is the
``io_ps_sel`` term covering the master/slave/ELCR ports (0x20/0x21, 0xA0/0xA1,
0x4D0/0x4D1); the ``ven_pic`` instantiation is unconditional, hard-wired to
the core's INTR/INTA pins and to all 16 device IRQ inputs. Defining it would
forward PIC register I/O to the PS while the RTL PIC keeps full
interrupt-delivery authority but stops seeing consistent register state (reads
answered by the PS, writes seen by *both* sides, with the multi-clock
chip-select double-commit hazard on its OCW/ICW write sequencing and poll-read
side effects) — the interrupt machine state of the RTL PIC and any PS shadow
would immediately diverge.

Uniquely among the three latent hooks, a C model *does* exist —
``sw/ps_periph/ven_pic.c``, a 1:1 port matched to QEMU 8.2.2's ``i8259.c`` —
but its stated purpose is the F3 KV260 board firmware path, where all
peripherals including the PIC run on the A53 and the PS drives interrupts into
the core through ``ven_soc_axil``'s interrupt-injection seam: an architecture
that does not use ``ventium_soc`` or this macro at all. The PIC is the one
device where "register surface on PS" is architecturally insufficient —
INTR/INTA timing is the reason ``fpga/periph_split.config`` pins it in the PL.
``tb_soc.cpp`` does not even register a PIC entry in ``ps_devs``, so a
``+VEN_PIC_PS`` build would read open-bus 0xFF from 0x20/0x21.

``VEN_I8042_PS``
----------------

Places the 8042 PS/2 keyboard controller (ports 0x60 data / 0x64 cmd-status)
on the PS. The ``ifndef`` removes the ``ven_i8042`` instance; the ``else`` arm
ties ``kbd_rdata``, ``kbd_irq1``, ``mouse_irq12``, ``kbd_reset_req``,
``kbc_a20`` and ``kbd_inj_ready`` to 0. Two real functional consequences
follow. First, **A20**: the effective A20 mask is
``eff_a20 = kbc_a20 | p92_a20``, gating address bit 20 on the memory path;
with the i8042 on the PS its contribution is gone, so A20 tracks port-92 alone
— acceptable for the ``psocdev`` differential because that test drives both
A20 sources together and port-92 stays in RTL, but a board build needs the PS
to drive the i8042 A20 state back into the fabric. Second, the testbench's
``--type-at`` keyboard-injection seam (used for the F3/F4 FreeDOS typing path)
becomes unavailable (``inj_ready = 0``), so the interactive FreeDOS flows must
not be run on a ``+VEN_I8042_PS`` build.

The range forwards to ``sw/ps_periph/ven_i8042.c`` — a port of the
queue-independent controller surface (A20 enable/disable 0xDF/0xDD, output
port, mode byte, status) with the 0x60 OBF-dequeue read side-effect inline;
the asynchronous PS/2 device queues are an oracle boundary exactly as in the
RTL. The ``psocdev`` cosim (122 records, shared with the rtc gate) passes
EQUIVALENT. One cosmetic note: ``tb_soc.cpp`` registers the dispatch range as
0x060–0x064 inclusive, but only 0x60/0x64 are ever forwarded because
``cs_i8042`` decodes exactly those two ports — the wider C-side range is
harmless.

``VEN_FDC_PS``
--------------

Places the 82077/8272A floppy controller (ports 0x3F1–0x3F5 plus 0x3F7 —
deliberately *not* 0x3F0 and not 0x3F6, which is IDE alt-status) on the PS.
The ``ifndef`` removes the ``ven_i8272`` instance; the ``else`` arm ties
``fdc_rdata = 0`` and ``fdc_irq6 = 0`` (PIC IR6 dead — quiescent on the
differential anyway, which runs with interrupts disabled). The range forwards
to ``sw/ps_periph/ven_i8272.c``, a 1:1 port of the synchronous register
surface matched to QEMU 8.2.2's ``fdc.c``, with the result-FIFO advance as an
inline read side-effect. The oracle boundary matches the RTL's: actual
READ/WRITE/FORMAT/READ-ID/RECALIBRATE/SEEK execution, DIR media-change and
motor spin-up are excluded (they need disk/DMA or asynchronous timing). The
``psocfdc`` differential (116 records) passes EQUIVALENT.

The port-range carve-out is the subtle part: the generator encodes the same
two disjoint ranges, and ``tb_soc.cpp`` registers 0x3F1–0x3F7 inclusive with a
comment that 0x3F6 can never arrive because ``cs_fdc`` excludes it — so an IDE
alt-status access is never mis-forwarded even with both the UART (0x3F8 up)
and the FDC on the PS; the three decoders partition 0x3F1–0x3FF exactly
(0x3F0 itself is claimed by none of them and falls to the default open-bus
decode). One
stale comment: ``ven_i8272.c``'s header says "built with ``+VEN_I8272_PS``" —
that macro name is wrong; the actual macro is ``VEN_FDC_PS``.

``VEN_ACPIPM_PS``
-----------------

Places the ACPI PM timer (single port 0x608, PIIX4 PM base+8) on the PS. The
``ifndef`` removes the ``ven_acpipm`` instance (a free-running 24-bit counter
advanced by a fractional accumulator at the PM-timer rate); the ``else`` arm
ties ``acpipm_rdata`` and ``acpipm_rdata32`` to 0. The port forwards to
``sw/ps_periph/ven_acpipm.c``, which models the same surface: a read returns
the address-selected byte of ``{8'h00, count[23:0]}``; a write is ignored
(qemu's ``acpi_pm_tmr_write`` is a no-op). The cosim gate (the ``pvga`` test,
292 records, shared with the VGA flag) passes EQUIVALENT. The cheapest device
of the family; its PS placement is mostly about keeping the configuration
uniform.

A width nuance unique to this flag: in the RTL placement the ``io_rdata`` mux
returns the full 32-bit ``acpipm_rdata32`` so a dword IN gets the whole
counter, whereas the PS bridge is byte-wide and a PS-placed read returns one
zero-extended byte. That narrowing is invisible to every existing gate because
the timer's read *value* is a documented oracle boundary in both placements
(host-clock/clk-derived; the differential only exercises the write-inert OUT)
— but if a future test ever oracles a 32-bit IN from 0x608 it will differ
between placements. Never compare instantaneous counter values across
placements; the cadence is structural (clk-derived in RTL, no clock at all in
the C model).

``VEN_PORT92_PS``
-----------------

The third latent hook — **do not set**. Its only RTL appearance is the
``io_ps_sel`` term for port 0x92; the ``ven_port92`` instantiation is
unconditional with no tie-off. With the flag, an IN/OUT 0x92 would be answered
by the bridge while the RTL port92 still latches the same writes (the
multi-clock chip-select double-commit hazard) and still drives ``p92_a20``
into ``eff_a20`` and ``p92_reset_req``. Since there is no
``sw/ps_periph/ven_port92.c`` (the generator names the model but the file does
not exist) and ``tb_soc.cpp`` registers no port-92 entry, reads would come
back open-bus 0xFF — while the live A20 state silently tracked only the RTL
writes.

The placement rationale is physical: A20 masking sits combinationally on the
core's memory-address path (``eff_a20`` gates address bit 20), which cannot
tolerate an A53 round-trip — hence ``port92 rtl # fast A20 — gates the address
bus`` in ``fpga/periph_split.config`` and no cosim branch. The flag exists so
the ``io_ps_sel`` chain is complete and uniform for all nine non-bus-master
device rows (the generator's ide/dma/dma2 rows define
``VEN_IDE_PS``/``VEN_DMA_PS``/``VEN_DMA2_PS`` but have no ``io_ps_sel`` term
in the RTL at all), and as the seam a future PS→PL A20 return path would
build on. Note that
``VEN_I8042_PS`` implicitly depends on port-92 staying in RTL for A20 coverage
on the ``psocdev`` diff; PS-placing port-92 would orphan A20 entirely. The
device's RTL coverage is the all-RTL ``psocdev`` A20-mask differential plus
the standalone unit self-check (the ``port92`` target of
``verif/soc/Makefile`` — ``make -C verif/soc port92``, driving
``tb_ven_port92.cpp``).

Peripheral tunables and debug instrumentation
=============================================

This family is the SoC peripheral-tunable and debug-instrumentation knob set.
Every flag is ``ifndef``-defaulted in the RTL itself or ``ifdef``-guarded
around observation-only code (pure ``$display`` blocks,
``VEN_IDE_DISK_HEX``'s ``$readmemh`` load, ``VEN_DBG_WD``'s two observation
registers), so a build with no flags is byte/cycle-
identical to the golden-gated default. The tunables (``VEN_IDE_DISK_HEX``,
``VEN_IDE_DISK_SECTORS``/``CYLS``/``HEADS``/``SECS``, ``VEN_RTC_EXTMEM_KB``,
``VEN_PIT_TICK_DIV``) exist because Verilator's ``-G`` cannot reach a
submodule parameter: ``ventium_soc.sv`` threads the ``ifndef``-defaulted IDE
geometry and PIT macros into the ``ven_ide``/``ven_pit`` instance parameters,
while ``VEN_RTC_EXTMEM_KB`` is ``ifndef``-defaulted and consumed directly
inside ``rtl/soc/ven_rtc.sv`` (module-internal localparams; the ``ven_rtc``
instance takes only ``TICK_DIV``). They are
sim-only by design — set only on the ``tb_soc`` Verilator build line, never in
any ``fpga/scripts/*.tcl`` define list — though that is moot for the deployed
bitstream, whose file list omits ``ventium_soc`` and these peripherals
entirely; the one flow that does synthesize ``ven_ide`` (the full-SoC probe)
hard-errors on the ``disk[]`` array regardless (``fpga/TIMING_PROBLEMS.md``
P1-2), and P1-3 records that the sim tick divisors must be retuned for a real
fabric clock.

Only ``VEN_IDE_DISK_HEX`` has committed setters (the ``verif/tb/Makefile``
``soc`` target plus five boot/IDE gate scripts); the geometry, PIT and
extended-memory overrides were passed ad hoc via the Makefile's
``VL_EXTRA_DEFINES`` passthrough during the F3 FreeDOS / F4 Quake boot
bring-up — no committed script pins the non-default values. Architecturally,
the disk geometry and ``VEN_RTC_EXTMEM_KB`` change CPU-observable values
(IDENTIFY words, the out-of-range-abort boundary, the CMOS memory-size bytes),
so the differential gates rely on the defaults; ``VEN_PIT_TICK_DIV`` changes
only the explicitly non-differential IRQ0 cadence; and ``VEN_IDE_TRACE``,
``VEN_DBG_WD`` and ``VEN_DBG_SSBASE`` are pure ``$display`` probes with zero
state-machine effect.

``VEN_IDE_DISK_HEX``
--------------------

The disk-image backing-store path for ``ven_ide``'s behavioural hard disk — a
quoted-string macro consumed by exactly one statement in
``rtl/soc/ven_ide.sv``: ``initial if (HAS_DISK) $readmemh(`VEN_IDE_DISK_HEX,
disk);``, loading the raw byte array ``disk[0:DISK_SECTORS*512-1]`` at sim
start. The path is embedded at *build* time but the file is read at *run*
time, so the ``pide`` gate regenerates the hex before each run and the binary
picks up the fresh image. Undefined, the ``ifdef`` simply removes the
``$readmemh`` and ``disk[]`` stays uninitialized (never read by non-IDE
tests); a lint sink keeps the define-less build UNUSEDPARAM-clean, and
``HAS_DISK = 0`` (the empty ATAPI CD secondary channel) suppresses the load
even when the define is present.

The ``verif/tb/Makefile`` ``soc`` target always passes it, with double
quoting — ``-DVEN_IDE_DISK_HEX='"$(VEN_IDE_DISK_HEX)"'`` — so the macro
expands to a SystemVerilog string literal; the make variable defaults to the
absolute path of ``verif/sys/tests/pide/pide.disk.hex``, and each boot/IDE
gate overrides it with its own disk (``run-soc-ide-gate.sh``,
``run-soc-boot-gate.sh``, ``run-soc-bootrm-gate.sh``,
``run-soc-bootdma-gate.sh``, ``run-soc-seabios-boot-gate.sh``). The FPGA flow
must leave it undefined (``fpga/TIMING_PROBLEMS.md`` P2-2). The disk contents
are part of the differential surface: the same single-source image is fed to
``qemu -drive`` and to ``$readmemh``, and READ SECTORS data is graded
byte-identical per record — a "flag change" here is really test content, and
each gate regenerates its golden against the same image. The value needs the
shell-level quote nesting (``'"path"'``); a bare path breaks the preprocess.

``VEN_IDE_DISK_SECTORS``
------------------------

Sets the ``ven_ide`` ``DISK_SECTORS`` instance parameter (default 128 — the
64 KiB ``pide`` single-source image). It sizes the backing array, derives the
disk byte-address width used by both the PIO drain pointer and the DMA dword
offset, and pins ``NSECT28 = DISK_SECTORS``, the LBA28 out-of-range bound used
by the upfront READ abort, the per-sector mid-transfer abort and the
block-mode first-block check. It also feeds the CPU-visible IDENTIFY words
60/61 (LBA28 total) and 100 (LBA48 total). The F3 widening (commit
``564d4d9``) made it actually scalable: pre-F3 the byte address was a hard
``[15:0]``, so sectors above 64 KiB aliased; with the macro, the FreeDOS
multi-MB disk passes ``+define+VEN_IDE_DISK_SECTORS=20160`` (with
``+CYLS=20``) and addressing widens automatically.

No committed script sets a non-default value; overrides are ad hoc on the
``tb_soc`` build line. It is **architecturally visible** — IDENTIFY words and
the out-of-range boundary change — so any non-default value invalidates the
``pide`` per-record golden; non-default runs are boot bring-up only. It must
agree with the image supplied via ``VEN_IDE_DISK_HEX`` and with the qemu
golden geometry, and is normally changed together with the CHS triple. And it
is deliberately sim-only: the 64 KiB default ``disk[]`` (524 288 bits) is
already a hard synthesis error on the full-SoC probe (it cannot infer BRAM
due to the multi-port writes and ``$readmemh`` init, and is too large to
dissolve into FFs — ``fpga/TIMING_PROBLEMS.md`` P1-2), so raising
``DISK_SECTORS`` in a synthesis context makes that strictly worse; the
recorded path to a real disk on hardware is the DDR-backed ``ven_ide`` rework
in ``fpga/PLAN.md`` §5.5, which scales ``DISK_SECTORS``/geometry/OOR as part
of moving the backing store off-chip.

``VEN_IDE_CYLS``
----------------

Threads the ``ven_ide`` ``CYLS`` parameter (logical cylinder count, default
2). It appears in the CPU-observable IDENTIFY block — word 1 (logical
cylinders), word 54 (current cylinders) and, via
``OLDSIZE = CYLS*HEADS*SECS``, words 57/58 (current capacity) — and bounds how
much of the disk guests compute as CHS-reachable. The default 2/16/63 triple
is exactly what qemu's ``guess_chs_for_size`` produced for the 128-sector
``pide`` image, so the IDENTIFY constants grade byte-identical against a
freshly generated qemu golden each run. The F3 FreeDOS disk used ``+CYLS=20``
(20 × 16 × 63 = 20160, CHS-consistent with the sector count). No committed
setter; sim-only. Unlike HEADS/SECS it does *not* enter the CHS→LBA
arithmetic, so an inconsistent CYLS only lies to the guest's capacity math
rather than corrupting translation — but any non-default value still breaks
the ``pide`` per-record golden.

``VEN_IDE_HEADS``
-----------------

Threads the ``ven_ide`` ``HEADS`` parameter (default 16). Two CPU-observable
roles: IDENTIFY words 3/55 plus the OLDSIZE capacity product, and the **live
CHS→LBA translation** used whenever a guest issues a command with devhead
bit 6 clear — ``chs_lba = (cyl*HEADS + head)*SECS + sector − 1``, the exact
qemu ``ide_get_sector`` formula. A wrong HEADS therefore does not just
mis-report geometry; it makes CHS reads land on the wrong sectors. The
``pide`` gate exercises CHS addressing explicitly (the M8.4c surface), so the
default is gate-proven; the FreeDOS-era multi-MB overrides kept 16. No
committed setter; sim-only. A mismatch between the define used for the golden
``qemu -drive`` geometry and the RTL build corrupts CHS data compares — a much
louder failure than an IDENTIFY word diff.

``VEN_IDE_SECS``
----------------

Threads the ``ven_ide`` ``SECS`` (sectors-per-track) parameter (default 63).
CPU-observable in IDENTIFY words 4 (unformatted bytes/track = 512 × SECS),
6, 56 and the OLDSIZE product, and in the CHS→LBA translation as the track
multiplier and the 1-based sector offset. The default is the qemu
``guess_chs_for_size`` value for the 128-sector image, against which the
M8.4c CHS-addressing differential is graded; the FreeDOS bring-up kept 63. No
committed setter; sim-only. Same consistency set and same CHS-corruption
failure mode as ``VEN_IDE_HEADS`` if the RTL define and the golden geometry
diverge.

``VEN_IDE_TRACE``
-----------------

Sim-only ``$display`` instrumentation inside ``rtl/soc/ven_ide.sv``, used
during the F3 FreeDOS boot gap-walk to watch IDE traffic. Five trace points:
(1) READ SECTORS (0x20/0x21) command acceptance — opcode, LBA-vs-CHS mode bit,
resolved LBA, effective sector count and the raw task-file registers; (2)
every device-control (0x3F6) write — new/previous control byte plus live
status and transfer state, i.e. SRST and nIEN flips in context; (3) a
"ZERO-READ" diagnostic when the CPU reads the data port with DRQ clear — the
case where ``rdata = 0`` silently; (4) the first 12 data words of each
non-IDENTIFY sector drain; (5) the multi-sector block re-arm at each DRQ
window boundary. No registers, ports or FSM arms are added; on/off builds are
bit-exact, so no gate is needed. No committed setter — ad hoc
``+define+VEN_IDE_TRACE`` via ``VL_EXTRA_DEFINES`` only; must not be defined
for synthesis.

One structural note: trace point (3)'s ``ifdef`` wraps an entire ``else if``
arm of the clocked access-priority chain — safe today because the arm is
display-only, so with the define off the chain simply falls through with
identical state — but any future edit adding an *assignment* inside it would
create a define-dependent behavioural fork. The comment marks the
read-with-DRQ-clear case as "the zero-source" precisely because that silent
path once hid a bug.

``VEN_RTC_EXTMEM_KB``
---------------------

The F4 (DOS Quake prep) knob: kilobytes of RAM above 1 MiB to advertise via
CMOS (default 0 — legacy behaviour, the CMOS memory-size registers keep their
all-0x00 reset value and SeaBIOS computes RamSize = 1 MiB; every existing
gate byte-identical). Background, from the ``rtl/soc/ven_rtc.sv`` header:
this SoC has no fw_cfg (ports 0x510/0x511 read open-bus 0xFF), so SeaBIOS
sizes RAM purely from CMOS. When the macro is nonzero, ``ven_rtc``'s reset arm
seeds the qemu ``pc_cmos_init`` register set: 0x15/0x16 = 640 (base memory),
0x17/0x18 and 0x30/0x31 = min(EXTMEM_KB, 65535) (KB above 1 MiB),
0x34/0x35 = ((1 MiB + EXTMEM_KB) − 16 MiB)/64 KiB capped at 65535 (64 KB
units above 16 MiB); the >4G registers stay 0. The seed localparams implement
the exact ``hw/i386/pc.c`` formulas, and the seeding sits inside
``if (EXTMEM_KB != 0)``, which constant-folds away at the default, preserving
the legacy all-zero image bit-for-bit.

Proven effect (commit ``4f6c68b``): with 64512 the SeaBIOS e820 gains a
63 MiB RAM entry on the RTL — the prerequisite for DPMI/Quake. The PS-offload
C model ``sw/ps_periph/ven_rtc.c`` mirrors it 1:1 (same ``#ifndef`` default,
same byte math, plus a runtime ``ven_rtc_set_extmem_kb()`` override hook for
future PS firmware, with no current caller). No committed setter — ad hoc
``VL_EXTRA_DEFINES`` only.

Gotchas: this is **architectural** — CMOS bytes 0x15–0x18/0x30/0x31/0x34/0x35
are CPU-readable, so any nonzero value diverges from the qemu golden unless
qemu is launched with the matching ``-m``; the ``psocdev`` RTC differential
only stays EQUIVALENT because the default is 0. The advertised memory is a lie
unless the SoC actually backs it — the value must not exceed real RAM behind
the bus or DOS/DPMI will scribble into nothing. Values above 65535 KB saturate
the 0x30/0x31 pair by design (qemu does the same); the above-16 MiB portion is
carried only by 0x34/0x35. In a ``+VEN_RTC_PS`` cosim, the C-model copy of the
flag is the one that matters — the two must be set consistently or the RTL/PS
split changes the guest-visible memory size.

``VEN_PIT_TICK_DIV``
--------------------

Scales the i8254 PIT tick prescaler for simulation (default 1024 — one PIT
tick per 1024 core clocks). ``ventium_soc`` instantiates
``ven_pit #(.TICK_DIV(PIT_TICK_DIV), .TICK_INC(1))``; inside ``ven_pit`` a
generate block picks the implementation: ``TICK_DIV <= 1`` means a tick every
clock (the natural hardware wiring where ``clk`` *is* the 1.193182 MHz PIT
clock — what the unit testbench uses), while larger values instantiate a
24-bit fractional accumulator. The PIT's register semantics (counts, latching,
read-back, the OUT formula) are qemu-bit-exact regardless; only the wall-clock
cadence of ticks — hence the IRQ0 edges — depends on this knob, and that
cadence is explicitly structural, not oracled.

The default 1024 was chosen so the ``pirqsoc`` differential is
boundary-independent: with roughly 8 clocks per instruction and a ~12-
instruction IRQ0 handler, count 0x40 gives about 65 k clocks per IRQ0 — wide
margin for the mainline to finish handler-plus-spin between edges, making the
post-spin counter value deterministic. The F3 override exists because
SeaBIOS's ``sti``/``hlt`` disk wait spins on the timer IRQ: at divisor 1024 an
18.2 Hz-equivalent IRQ0 period is ~67 M clocks — glacial in simulation, and it
trips the testbench quiescence detector — so the FreeDOS bring-up passed a
small divisor for a fast timer. Sim-only: the FPGA flow must instead retune
``TICK_DIV``/``TICK_INC`` for the real clock so IRQ0 is genuinely
1.193182 MHz-derived (``fpga/TIMING_PROBLEMS.md``).

Gotchas: not bit-exact across values in SoC runs even though no architectural
register changes — IRQ0 edges move, so retire traces of IRQ-taking programs
differ; this is legal only because the differential checkpoints were designed
boundary-independent. The ``pirqsoc`` gate is byte-identical at the default
and would *not* survive an override. Both bounds matter: too-slow IRQ0 plus
``sti``/``hlt`` falsely trips quiescence; too-fast IRQ0 starves the mainline
and breaks the deterministic-count property. ``TICK_INC`` is fixed at 1 in the
SoC instance, so the macro is a pure integer divider here. The unit gate
(``verif/soc/run_pit.sh``) is independent — its testbench hard-pins
``TICK_DIV = 1``.

``VEN_DBG_WD``
--------------

A sim-only opt-in hang watchdog inside ``rtl/core/core.sv``, built for boot
bring-up gap-walking (finding where SeaBIOS or FreeDOS wedges without a
waveform). It adds two registers: ``dbg_stall``, a cycles-since-last-retire
counter (any retire zeroes it), and ``dbg_printed``, a one-shot latch. When
the counter hits exactly 150 000 with the latch clear, it ``$display``\ s a
single ``[VEN-WD] STUCK`` line dumping the wedged FSM state *by name*
(``state.name()``), the outstanding ``mem_addr``, ``cr0``,
sys-mode/v86/real-mode, ``eflags``, the IDT/GDT bases, the in-flight
``int_vec``, CPL, CS base and ``eip`` — exactly the registers needed to
localize a real/protected-mode delivery wedge. Pure observation: no output
ports, no FSM influence — retire streams are bit-identical with or without it.

The threshold was deliberately *raised* to 150 000 in commit ``564d4d9`` when
interruptible HLT landed, so legitimate long yield-halts between timer IRQs do
not false-fire; it is co-tuned with ``VEN_PIT_TICK_DIV`` (at divisor 1024 an
``sti``/``hlt`` wait can legally sit ~67 M clocks without retiring, which is
why the watchdog is opt-in and bring-up runs pair it with a fast PIT divisor).
No committed setter — manual ``+define+VEN_DBG_WD`` works for both the ``rtl``
and ``soc`` targets. Gotchas: it is one-shot — only the first stall per reset
is reported. It uses ``state.name()``, fine under Verilator but another reason
it must never reach synthesis. A multi-cycle iterative engine (an SRT divide,
~66 clocks) is far below threshold, but artificially lowering the constant can
false-fire on legitimate long microcoded sequences; 150 000 is the empirically
safe value. It is distinct from the testbench-level quiescence detector
(always on) and from the architectural ``cpu_hung`` F00F-erratum latch (always
built).

``VEN_DBG_SSBASE``
------------------

A sim-only F3 gap-walk probe pair compiled into the top of the ``S_DECODE``
arm (``rtl/core/core_fetch_decode.svh``). Probe 1 (``SSBASE-DESYNC``): on
every instruction decode while ``seg_real`` is set (real mode or V86), it
checks that the SS, DS and ES segment *bases* equal their selector << 4 — the
real-mode invariant — and prints ``eip`` plus all three selector/base pairs on
any mismatch. This is the probe class that root-caused the F3 FreeDOS stall: a
stale protected-mode base surviving into real-mode addressing. Probe 2
(``ESP-HIGH``): while ``seg_real``, it checks ``ESP[31:16] == 0`` — 16-bit
code never legitimately carries high ESP bits, so a hit means a 32-bit-ESP
stack op leaked past 64 K (the missing real-mode SP wrap, explicitly noted as
the audit's remaining finding). Both are pure ``$display`` statements
evaluated once per ``S_DECODE``; no state, port or timing change — bit-exact
on/off. No committed setter; F3 forensics use only, paired naturally with
``VEN_DBG_WD`` and ``VEN_IDE_TRACE``.

Read the output with care: probe 1 false-positives *by design* on unreal-mode
idioms — SeaBIOS's big-real transitions deliberately carry a protected-mode
data selector with base 0 into real mode, tripping the ``sel << 4`` check
while being exactly what the firmware intended — so its output is a
lead-generator, not an assertion. It fires per decoded instruction while the
condition persists, so a genuine desync floods the log (useful as a timeline,
expensive in throughput). And probe 2 documents a *known modelling gap* rather
than a guest bug — a hit may indict the core, not the software. User-mode
protected/flat corpora never assert ``seg_real``, so both probes are silent
there.

Deployed configurations
=======================

This section records the define sets the committed build entry points actually
use.

The default verification battery
--------------------------------

The canonical builds set **no defines at all**. The ``verif/tb/Makefile``
``rtl`` target leaves ``VL_EXTRA_DEFINES`` empty and builds into ``obj_dir``;
it is invoked define-free by ``make verify`` (``verif/verify.sh``), the
milestone runners (``run-m0-smoke.sh`` through ``run-m5.sh``), ``make
verify-sys`` (``verif/sys/run-sys-golden.sh``), the errata suite (``make
m6``), the Quake and Win95 lockstep/cosim runs (``verif/m7``), and the
benchmark sweeps (``verif/bench``). The ``emu`` and ``rtl-sva`` targets use
the same zero-define set (``rtl-sva`` adds only ``--assert`` and the
``biu_p5_sva`` bind). Everything this page calls "the default build" is this
configuration; it is the one byte/cycle-identical to the golden model.

Variant verification builds layer exactly the define under test on top:
``make rtl`` with ``+define+VEN_FP_OVERLAP`` into ``obj_dir_fpovl`` and
``+define+VEN_FXCH_FREE`` into ``obj_dir_fxch`` (the M5 gap kernels);
``+define+VEN_TRANSCENDENTAL`` into ``obj_dir_trsc`` (the in-core
transcendental gates); the ``l1axi`` target (``+define+VEN_L1_AXI``, plus the
C-side ``-DVEN_L1_AXI``) into ``obj_dir_l1axi`` and ``l1axi_cdc``
(``+define+VEN_AXI_CDC``) into ``obj_dir_l1axi_cdc``; the ``proxy`` target
(``+define+VEN_PS_PROXY``) into ``obj_dir_proxy``; and the FP_PIPE2 A/B gate,
whose base set is ``VEN_SRT_ITER VEN_IDIV_ITER VEN_BCD_ITER VEN_FP_PIPE
VEN_BTB_PIPE`` with build B adding ``VEN_FP_PIPE2`` (``obj_dir_fp1``/
``obj_dir_fp2``). The standalone unit-engine gates pass only testbench-local
value macros (``SRT_N``, ``BCD_N``, ``IDIV_N``, …), no core feature flags.
Every variant gets its own object directory precisely so the canonical
``obj_dir/tb_ventium`` is never clobbered.

The SoC gate battery
--------------------

The ``verif/tb/Makefile`` ``soc`` target (top ``ventium_soc``,
``obj_dir_soc``) always defines exactly one macro: ``VEN_IDE_DISK_HEX``,
defaulting to the ``pide`` disk image path. The ``make verify-soc`` aggregate
(``verif/soc/run-all-soc-gates.sh``) builds it that way for all the
peripheral, firmware-boot and real-mode gates; the disk-booting gates
(``run-soc-ide-gate.sh``, ``run-soc-boot-gate.sh``, ``run-soc-bootrm-gate.sh``,
``run-soc-bootdma-gate.sh``, ``run-soc-seabios-boot-gate.sh``) override only
the disk path. The PS-offload cosim gates add exactly one placement flag each:
``run-soc-ps-cosim-gate.sh <dev>`` builds
``{VEN_IDE_DISK_HEX, VEN_<DEV>_PS}`` into ``obj_dir_soc_<dev>ps`` for ``uart``,
``rtc``, ``i8042``, ``acpipm``, ``fdc`` and ``vga``; ``VEN_PIC_PS``,
``VEN_PIT_PS`` and ``VEN_PORT92_PS`` are not in the map — no gate sets them.

The shipped KV260 full-SoC bitstream
------------------------------------

The deployable image (``kv260_soc_impl_linux_40``, built by
``fpga/scripts/impl_kv260_soc.tcl`` with ``FE_PIPE=1`` at ``PL0_MHZ=40``) sets
the full production stack::

   SYNTHESIS  VTM_NO_DPI
   VEN_SRT_ITER  VEN_IDIV_ITER  VEN_BCD_ITER
   VEN_FP_PIPE  VEN_FP_PIPE2  VEN_BTB_PIPE
   VEN_IC_BRAM  VEN_UOPCACHE  VEN_CACHE_HALF
   VEN_L1_AXI  VEN_KV260_SOC
   VEN_FE_PIPE          (appended by the FE_PIPE=1 environment knob)

Reading it as layers: the two housekeeping tokens strip the DPI channel and
the sim-only assertions; the three iterative engines and the two-stage FP
commit pipeline are the bit-exact Fmax substrate; ``VEN_BTB_PIPE``,
``VEN_IC_BRAM``, ``VEN_UOPCACHE`` and ``VEN_CACHE_HALF`` are the front-end
congestion stack; ``VEN_L1_AXI`` + ``VEN_KV260_SOC`` are the platform layer
(real bus, PS seams, DDR-carveout remap); and ``VEN_FE_PIPE`` is the final cut
that lets the full SoC route. Three footnotes from the setters data:
``VEN_BCD_DIV100`` is **absent** from this list (and from
``apr_hc_pnr.tcl``) — the ÷100 step is documented in ``ven_bcd.sv`` and
``docs/fpga-synthesis.md``, but no committed Tcl carries the define; the
65.3 MHz half-cache result added it ad hoc; the older
``fpga/scripts/bd_kv260_soc.tcl`` validate-and-synth flow still carries the
pre-uopcache config (``VEN_IC_NARROWB`` instead of
``VEN_IC_BRAM``/``VEN_UOPCACHE``/``VEN_CACHE_HALF``, and no
``VEN_FP_PIPE2``); and no committed Tcl consumes
``fpga/build/periph_split.vdefs`` yet — the board path runs all peripherals on
the PS via ``ven_soc_axil``, so the ``VEN_<DEV>_PS`` flags are exercised only
in the Verilator SoC builds today.

``VEN_DBG_CORE`` — on-die debug / trace / step / breakpoint
-----------------------------------------------------------

A single opt-in flag arms a comprehensive forensic unit. It is **OFF by
default** (``make verify`` stays 77/77 byte/cycle-identical; the FPGA close is
unchanged). Turn it on for a *debug bitstream* (``DBG_CORE=1`` in
``fpga/scripts/impl_kv260_soc.tcl``) or a debug sim (``+define+VEN_DBG_CORE``;
the ``l1axi_kv_dos`` tb target carries it).

What it adds (all observers except step/breakpoint, which only act once the PS
arms them — so even a ``VEN_DBG_CORE`` build is cycle-identical until then):

* **Committed-state taps** — the EIP / CS / ESP / EFLAGS of the last retired
  instruction, the FSM micro-state (``state_e``), the last exception/IRQ vector
  and its source EIP, and live CR0 (PE/PG = real/protected/paging). Surfaced as
  ``ventium_top`` debug-bundle ports (read directly by ``tb_main``; on a stop it
  prints a one-line ``[VEN_DBG]`` dump naming *where and why* the core stopped).
* **PC ring buffer** (``ven_soc_dbg``, 32 × {state,CS,EIP}, BRAM) — the last N
  retired PCs, read back N-back via ``R_DBG_TRACE_IDX``/``_PC``/``_AUX``. On a
  board freeze this is the instruction *trail* into a derail.
* **Freeze detector** — a stall counter (reset on each retire); crossing the
  PS-set ``R_DBG_FREEZE_TH`` latches a snapshot (EIP/FSM/vector) + a sticky
  ``frozen`` bit, distinguishing a silent stall from a clean halt.
* **Performance counters** — cycles, retired, no-retire (stall) cycles, S_IO
  cycles, external IRQs (``R_DBG_PERF_*``). CPI = cyc/retired.
* **Single-step / breakpoint** — ``R_DBG_RUNCTL`` parks the core at the S_DECODE
  instruction boundary (``halt_req``), grants one instruction (``step``), or a
  PC breakpoint (``bp_en`` + ``R_DBG_BP_ADDR``) halts at the target. The park
  sits *below* SMI/NMI/INTR (no interrupt is lost) and is modeled on the
  resumable S_HLTWAIT halt; in-flight loads/fills advance untouched.

The PS reads it all through the ``ven_soc_axil`` 0x80+ window (mirrored in
``sw/ps/ven_soc_app/ven_soc_regs.h``); ``ven_soc_app`` probes ``R_DBG_CAP`` ==
``0xDB01_0020`` and ``dbg_dump_core()`` prints the full state + ring on a
``CPU_HUNG`` or a silent-stall auto-trigger. Verified: the deployed top lints +
``make verify`` 77/77 (off); in sim ``--dbg-step N`` walks the EIP stream one
instruction at a time and ``--dbg-bp <eip>`` parks at the target.