Verilog build flags¶
Ventium’s RTL is built around a single convention: the default build — no
defines at all — is the verification baseline. It is the configuration every
differential gate runs, byte- and cycle-identical to the golden model
(QEMU for architectural state, p5trace for the cycle traces). Every build
flag in the project opts in to something on top of that baseline, and every
opted-in configuration is itself pushed back through the differential battery
before it is trusted.
The flags divide into two kinds. Behaviour-preserving implementation options change how a result is computed — an iterative engine instead of a combinational cone, a BRAM instead of LUTRAM, a pipeline register on a critical path — and exist purely for FPGA Fmax and area. These must stay architecturally bit-exact, and either cycle-identical or inside the loose M4/M5 cycle bands; they are what the KV260 bitstream configurations set. Behaviour-adding options change what the core does — the authentic FDIV flaw, eight new transcendental instructions, a P5 cycle-model refinement, a real AXI bus — and each one carries its own oracle and its own gate, because the default golden no longer applies to it.
This page is the complete reference: every conditional-compilation define and
build parameter in the tree, what it gates, who sets it, and how (or whether)
it is verified. File references are given as paths; the authoritative location
is always the ifdef in the source itself.
Quick reference¶
Flag |
Kind |
Default |
Purpose |
|---|---|---|---|
x87 FPU implementation |
|||
|
boolean define |
off — single-cycle FP execute |
1-deep FP-execute commit pipeline (defer commit to N+1) for Fmax. |
|
boolean define |
off (inert without |
Second commit stage at the |
|
boolean define |
off — serialized, matches oracle |
P5 cycle gap 1: independent integer ops overlap an in-flight FDIV (own oracle variant). |
|
boolean define |
off — FXCH costs 1 clock |
P5 cycle gap 2: an FXCH following a stack push folds in for free (own oracle variant). |
|
boolean define |
off — combinational FDIV/FSQRT |
Multi-cycle iterative SRT divide and square-root engines, bit-exact. |
|
boolean define |
no-op (legacy) |
Historical name; the genuine SRT divider is now the unconditional default. |
|
boolean define |
off — correct PLA |
The authentic Pentium FDIV flaw (buggy quotient-selection PLA). |
|
boolean define |
off — SRT engine is default |
Opt back to the behavioural wide divide (fast-sim escape hatch). |
|
boolean define |
off — combinational BCD cones |
Iterative FBSTP/FBLD packed-BCD conversion engines, bit-exact. |
|
boolean define |
off everywhere |
One divide-by-100 per BCD step instead of two chained divide-by-10
(Fmax micro-optimisation inside |
|
boolean define |
off — opcodes HALT |
Adds the 8 x87 transcendental instructions via iterative engines. |
|
boolean define |
inert (reserved) |
Planned silicon-accuracy mode select for the transcendental engines; not implemented. |
Front-end and L1 caches |
|||
|
boolean define |
off — combinational resolve |
Registers the BTB resolve inputs; cycle-neutral Fmax fix. |
|
boolean define |
off — live byte-window decode |
Predecode-on-fill micro-op cache; the fix for the x86 byte-window MUXF congestion wall. |
|
boolean define |
off |
Sim scaffold re-instantiating the reference decode beside the µop-cache slot read. |
|
boolean define |
off — live TLB lookups |
Page-keyed fetch/data micro-TLB register; breaks the |
|
boolean define |
off — async LUTRAM lines |
Block-RAM icache line store with registered reads and predicted-target prefetch. |
|
boolean define |
off — full-width port B |
Halves the icache straddle read port (distributed-RAM front end only). |
|
boolean define |
off — 128 sets per L1 |
Both L1s halved to 64 sets (4 KB each); FPGA area/congestion option. |
|
valued define (int) |
128 |
I-cache set-count override for geometry sweeps. |
|
valued define (int) |
2 |
I-cache associativity override (parametric true LRU). |
|
valued define (int) |
128 |
D-cache timing-model set-count override. |
|
valued define (int) |
2 |
D-cache timing-model associativity override. |
Integer core and build plumbing |
|||
|
boolean define |
off — combinational divide |
Multi-cycle restoring DIV/IDIV engine replacing the native |
|
boolean define |
unset in simulation |
Strips sim-only |
|
boolean define |
unset in simulation |
Elides the four DPI-C retire-observation imports and call sites. |
|
C++ macro (self-defined) |
auto-detected |
|
|
boolean define |
off |
|
|
boolean define |
off |
|
Memory system and SoC platform |
|||
|
boolean define |
off — BFM memory port |
Real L1 data cache plus AXI4 master to PS-DDR (bus mode 2). |
|
valued define (int) |
128 |
Umbrella set-count knob for both in-core L1s at once. |
|
valued define (int) |
2 |
Umbrella associativity knob for both in-core L1s at once. |
|
boolean define |
off — single-clock bypass |
Dual-clock async-FIFO bridge inside the L1/AXI subsystem. |
|
boolean define |
off |
PS-facing seams: |
|
boolean define |
off — same-clock proxy |
Stalling |
|
environment variable (Tcl) |
unset |
Vivado soft-Pblock floorplan experiment; not a Verilog define. |
Peripheral RTL/PS split |
|||
|
boolean define |
off — device in RTL |
COM1 16550 UART served by the PS C model. |
|
boolean define |
off — device in RTL |
VGA register file served by the PS C model (framebuffer goes dormant). |
|
boolean define |
off — device in RTL |
MC146818 RTC/CMOS served by the PS C model. |
|
boolean define |
off — keep off |
Latent hook only; the 8254 PIT is pinned in PL (IRQ0 timebase). |
|
boolean define |
off — keep off |
Latent hook only; the 8259 PIC is pinned in PL (INTR/INTA latency). |
|
boolean define |
off — device in RTL |
8042 keyboard controller served by the PS C model. |
|
boolean define |
off — device in RTL |
82077/8272A floppy controller served by the PS C model. |
|
boolean define |
off — device in RTL |
ACPI PM timer served by the PS C model. |
|
boolean define |
off — keep off |
Latent hook only; port-92 fast-A20 is pinned in PL (gates the address bus). |
Peripheral tunables and debug instrumentation |
|||
|
valued define (path) |
undefined — no disk load |
Hex disk-image path |
|
valued define (int) |
128 |
Disk size in sectors; sizes the array, the LBA28 bound and IDENTIFY. |
|
valued define (int) |
2 |
IDE logical cylinder count (IDENTIFY geometry). |
|
valued define (int) |
16 |
IDE logical head count (IDENTIFY + CHS→LBA translation). |
|
valued define (int) |
63 |
IDE sectors per track (IDENTIFY + CHS→LBA translation). |
|
boolean define |
off |
Five |
|
valued define (int) |
0 — legacy CMOS |
Seeds the CMOS extended-memory registers SeaBIOS sizes RAM from. |
|
valued define (int) |
1024 |
PIT tick prescaler — simulation IRQ0 cadence only. |
|
boolean define |
off |
One-shot 150 000-cycle no-retire watchdog |
|
boolean define |
off |
Real-mode segment-base / high-ESP forensic probes. |
x87 FPU implementation¶
This family covers the flags around the x87 FPU: the radix-4 SRT FDIV/FSQRT
datapath, the FBSTP/FBLD packed-BCD converters, the FP-execute commit
pipeline, the in-core transcendental engines, and two P5 cycle-model “gap”
features (FDIV/integer overlap, free FXCH). The default build uses
single-cycle combinational x87 functions (rtl/fpu/fpu_x87_pkg.sv,
rtl/core/ventium_x87_pkg.sv) that are bit-exact against QEMU softfloat at
every retire — and note that the genuine SRT divider with the correct PLA is
now the fx_div default (still bit-exact); VEN_DIV_EXACT opts back
to the behavioural wide divide.
The family splits cleanly in two. The implementation/Fmax options
(VEN_SRT_ITER, VEN_BCD_ITER, VEN_FP_PIPE, VEN_FP_PIPE2,
VEN_BCD_DIV100, VEN_DIV_EXACT) must stay bit-exact and band-safe; the
first four are what the KV260 FPGA builds set, while VEN_BCD_DIV100 and
VEN_DIV_EXACT are manual/sim-side options in the same class. The architectural opt-ins change visible
behaviour and carry their own oracles and gates: VEN_SRT_FDIV_BUG (the
authentic Pentium FDIV flaw), VEN_TRANSCENDENTAL (eight new instructions),
and VEN_FP_OVERLAP/VEN_FXCH_FREE (cycle-trace changes graded against
fpovl=1/fxchfree=1 variants of the p5trace oracle). Two names in
this family are not live ifdefs at all: VEN_SRT_DIV is a legacy flag
whose semantics became the default in commit 3791e46, and
VEN_TRSC_SILICON is a reserved, intentionally inert name.
VEN_FP_PIPE¶
Enables the 1-deep FP-execute commit pipeline that splits the
eip → icache → decode → f_eval → fpr critical cone across two clocks.
fx_add alone is roughly 14 ns of logic, so a single-cycle FP execute can
never clear 66 MHz (problem P0-5 in fpga/TIMING_PROBLEMS.md). Without the
flag, f_eval is computed and committed in the same clock as issue (fast
arm) or as S_FEXEC dispatch (slow arm); the fpp_* registers are absent
and the we_wabs port is tied 0.
On the fast arm (cycle-mode S_PIPE arithmetic), issue captures the
operands, aluop, rounding control and absolute destination into fpp_*
pipeline registers; the result is computed the next clock from the registered
operands and written through a new absolute-indexed
we_wabs/we_wabs_fstat port on rtl/fpu/fpu_top.sv. A one-clock
read-after-write hazard (fp_pipe_rd_haz) stalls a same-clock role-0/1
reader of the in-flight target. On the slow arm (S_FEXEC
memory-operand arithmetic), the s_arf cone is dropped and a new FSM state
S_FEXEC_EX evaluates from the registered fpp_* and commits via
we_wabs in the same clock as the retire (rtl/core/core_fp_exec.svh).
That same-clock pairing is mandatory: the functional differential harness
checks architectural state at retire, so a plain one-clock defer would read
stale state — the documented functional-mode-retire-check trap. The FP
scoreboard already publishes results at issue + latency (fadd = 3), so
the N+1 commit lands before any role ≥ 2 consumer reads it and the M5 FP
cycle bands hold by construction. The flag also cut area: LUT utilisation
91.7 % → 82.85 %.
It gates the S_FEXEC_EX state, the fpp_* capture/advance registers,
the we_wabs port group, the suppression of the same-cycle
fp_we_top/fp_we_fstat writes, and the issue-side commit bubble — all
in rtl/core/core.sv, rtl/core/core_fp_exec.svh,
rtl/core/core_fastpath.svh and rtl/fpu/fpu_top.sv.
Set by every KV260 SoC bitstream config (fpga/scripts/impl_kv260_soc.tcl),
the half-cache and path-probe scripts (fpga/scripts/apr_hc_pnr.tcl,
fpga/scripts/synth_paths_retime.tcl,
fpga/scripts/synth_paths_icbram.tcl), and the A/B cycle gate’s base define
set (verif/fppipe/run-fp-pipe2-ab.sh). It composes with VEN_SRT_ITER:
with it, the deferred commit uses the split
f_eval_s2(f_eval_s1(...)) sharing a single round_pack cone (−999 LUT),
safe only because the iterative engine owns divide; without it the full
f_eval is kept, because divides do defer through this port. It is the
parent of VEN_FP_PIPE2 (whose ifdef is nested inside it), and the
VEN_FXCH_FREE fold only absorbs an FXCH after a push because the
arithmetic path defers under this flag.
One honest caveat: fp_pipe_rd_haz protects role-0/1 readers
(FXCH/FLDSTI/slow FST/FCOM), but the icache LRU-touch driver does not
replicate the bubble; the code argues this is band-safe because those readers
never appear in the graded throughput kernels. VEN_FP_PIPE is therefore
band-preserving, not proven byte-cycle-identical for arbitrary role-0/1
sequences. Validation: the default make verify 75/75 is unaffected; with
the flag, make m3 75/75, make verify 75/75, and mb_faddchain CPI
2.989 / mb_fpindep 1.152 — both in band.
VEN_FP_PIPE2¶
The second pipeline stage for the FP commit — the flag behind the K26
half-cache 60 MHz milestone (see The FP execute/commit pipeline). With
VEN_FP_PIPE alone, the deferred commit runs the whole f_eval
(s1 front → s2 round) in the single commit clock — an ~80-level /
~43-CARRY8 cone from fpp_* through fx_round_pack to fpr that caps
routed Fmax around 52 MHz. VEN_FP_PIPE2 inserts exactly one register
(fpp2_s1, of type fx_pipe_t) at the f_eval_s1/f_eval_s2
boundary: the captured op’s front half (unpack/align/add-or-mul) runs in clock
N+1 and only the short round-pack plus the we_wabs write runs in N+2. The
result lands in fpr at issue+2, still at or before the scoreboard publish
at issue+3, so per-retire architectural state and the FP cycle bands are
identical to the 1-stage build with no oracle change. The read-after-write
hazard window widens to span both in-flight slots.
Measured on the half-cache OOC build: routed 63.0 MHz (then 65.3 MHz with
VEN_BCD_DIV100); synthesis sees the FADD cone leave the critical path
(59.4 → 78.4 MHz synth), the worst path becomes the iterative BCD engine
(u_bcd), and only with VEN_BCD_DIV100 added does the binder move to
the fpp → fpp2_s1 stage-1 front (docs/fpga-synthesis.md). It is proven
cycle-safe by a dedicated A/B gate (verif/fppipe/run-fp-pipe2-ab.sh,
make verify-fppipe2) that builds two testbenches differing only by this
define and requires the A and B --cycle traces to be byte-identical on
both kernels (mb_faddchain and mb_fpindep), plus a separate band
check of B against the p5trace golden (12 % tolerance) on each.
Set by fpga/scripts/impl_kv260_soc.tcl (production define list), by
fpga/scripts/apr_hc_pnr.tcl when the FP_PIPE2=1 environment knob is
given (output tagged _fp2), and as TB “B” of the A/B gate. It requires
``VEN_FP_PIPE`` (nested ifdef — silently inert without it) and
``VEN_SRT_ITER``: the split eval’s divide arm returns an early zero, which
is only unreachable because the iterative engine owns divide. There is no
compile-time error if VEN_SRT_ITER is missing — such a build would
silently mis-divide deferred FDIVs; the guard is by convention. Note also that
the A/B gate proves equivalence to the 1-stage build, not directly to the
default — it deliberately inherits VEN_FP_PIPE’s band/oracle validation.
VEN_FP_OVERLAP¶
Gap 1 of the P5 cycle-model gaps identified from the RTL Engineering
transcripts and shared with the p5trace.c oracle: the real P5 overlaps
independent integer work with a long FDIV, the default model does not. Without
the flag there is one in-order pipe_free_at = issue + occ — an FDIV
(occupancy 39) blocks even independent integer ops behind it for the full
39 clocks, matching the default oracle.
With the flag, a 32-bit fp_busy_cyc register models when the single x87
execution unit becomes free. An issuing FP arithmetic op holds the integer
pipe only P5_FP_ISSUE_OCC = 2 clocks (retiring at issue+2) while the real
occupancy window moves onto fp_busy_cyc — so following independent integer
ops issue and retire inside the FDIV shadow, while a following FP op (FXCH
exempt, as a rename) waits until fp_busy_cyc. The wait is mirrored in all
three replicated guards — the fast-path issue arm
(rtl/core/core_fastpath.svh), the icache LRU-touch driver and the
fp_we_* commit driver (rtl/core/core.sv) — so spine and drivers agree
on the commit clock.
This is an architectural (cycle-trace) change: mb_fdivint’s CPI drops
from the serialized ~1.73 to the 1.20–1.47 band. It therefore has its own
differential gate: verif/run-m5.sh builds a separate testbench into
obj_dir_fpovl with +define+VEN_FP_OVERLAP and grades only the manifest
"fpovl": true kernels against an fpovl=1 variant of the p5trace
oracle; every other kernel keeps the default golden and testbench
byte/cycle-identical. verif/m5_metrics.py enforces a two-part band:
(1) CPI below the 1.50 ceiling of the serialized default (overlap present)
and (2) absolute cycles tracking the fpovl=1 golden within tolerance —
the trailing FP producer still paying the unit-busy wait is a property of
the kernel and its golden (the manifest band description), pinned only
indirectly through the absolute-cycle match. No FPGA config or default
verify build sets it. It must never be
combined with the default golden — and note the inversion of fidelity: this
flag is the more faithful P5 model; the default build is the one matching
the unmodified oracle, not real silicon.
VEN_FXCH_FREE¶
Gap 2: the real P5 FXCH is a free stack rename; by default every FXCH costs
occ = 1 (its own commit clock), so a fadd``+``fxch pair is 2 clocks —
matching the default oracle. With the flag, decode marks D9 C8+i with
is_fxch_free = 1 and fp_occ = 0 (rtl/core/decode.sv; the field
exists unconditionally in rtl/core/ventium_decode_pkg.sv). In the cycle
fast path, a free fxch %st(i) (i ≠ 0) that directly follows a push
(FK_FLDC constant load or FK_FLDSTI) folds into that push’s commit
clock for zero added cycles: the architectural effect is computed as one
combined write — we_push places the old st(i-1) into the new ST0 slot
and we_sti(i-1) places the pushed constant/copy into st(i), using
pre-edge ftop addressing — while the spine retires the FXCH as the V-pair
member (retire2_*) and advances eip past both instructions
(rtl/core/core_fastpath.svh). A lone or post-arithmetic FXCH falls through
to its own occ = 1 commit; only a push absorbs it, because under
VEN_FP_PIPE the arithmetic result is deferred and not yet in fpr at
the would-be fold clock.
Architectural cycle-trace change with its own gate: verif/run-m5.sh builds
a separate testbench into obj_dir_fxch and grades only the manifest
"fxchfree": true kernels (mb_fxch) against the fxchfree=1 oracle.
No FPGA config or default build sets it. Gotchas: fxch %st(0) is
deliberately not folded (the fp_sti != 0 guard), and the combined write
relies on pre-edge ftop addressing (we_push targets fpr[ftop-1],
we_sti(i-1) targets fpr[ftop+i-1]) — easy to mis-derive when reasoning
about the swap.
VEN_SRT_ITER¶
Routes normal-operand FDIV/FDIVR and positive-normal FSQRT through
multi-cycle iterative engines — the D8b/P0-1 rework that made the core fit the
KV260 at all (commit 3791e46: LUT 518 % → 111 %, DSP 320 → 95). Without
it, FDIV/FSQRT execute combinationally in one S_FEXEC clock via
fx_srt_div (36 radix-4 steps fully unrolled — roughly 126 256-bit adders
in synthesis) and fx_sqrt (a 128-step restoring isqrt): fine for
Verilator, hopeless for FPGA timing and area.
The flag adds the FSM state S_FP_BUSY and instantiates
rtl/fpu/fpu_srt_div.sv (one radix-4 SRT step per clock, NSTEP = 36,
the per-step body lifted verbatim from fx_srt_div so the committed
floatx80 is bit-identical) and fpu_sqrt_iter. Eligibility is computed
in the fp_we_* driver: finite-nonzero a and b for divide (all six
FX_AR_* forms with aluop 6/7, excluding runtime-errata divides),
finite-nonzero non-negative ST0 for sqrt. Eligible ops start the engine from
S_FEXEC, the FSM busy-waits, and result plus fstat commit through the
same write ports on the engine-done clock, retiring that clock. Because every
operand that would reach the deep loops is engine-routed, both combinational
loops are stubbed out of synthesis under the flag (fx_srt_div returns
a zero-significand placeholder for finite-nonzero operands; fx_sqrt’s
finite-nonzero arm likewise stubs out the isqrt loop, returning the operand
unchanged) — the zero/Inf/NaN guards remain live. rtl/core/decode.sv
additionally drops fdiv/fdivr from the cycle fast-path FP
classification so they always take the slow FSM.
Functionally bit-exact: make m3 with the flag passes 74/74 including
tx_sqrt, and the standalone verify-srt-iter gates assert bit-exactness
against the golden for both PLAs. Execution latency becomes the engine step
count — an implementation artifact, not the P5 occ-39; the FP scoreboard
deliberately keeps the fixed oracle latencies, and the functional harness
needs a larger quiesce window (verif/run-m3.sh’s QUIESCE env knob
defaults to 64 — pass QUIESCE=512 for the +VEN_SRT_ITER build, the
in-script comment’s recommended value; an FSQRT idles ~66 clocks).
Set by every FPGA define list and synthesis probe, by the A/B gate’s base
defines, and documented as the canonical VL_EXTRA_DEFINES example in
verif/tb/Makefile. It is the by-convention prerequisite for
VEN_FP_PIPE’s split-eval optimisation and for VEN_FP_PIPE2; the
engine’s buggy-PLA input mirrors VEN_SRT_FDIV_BUG (ENG_DIV_BUGGY), so
the FDIV flaw survives the iterative rewrite.
Two gotchas. First, bit-exact is not timing-neutral: the quiesce window must
be raised and cycle-mode fdiv bands do not apply. Second, the errata
interplay: a runtime ERR_FDIV-enabled divide is excluded from the engine
route and falls back to fx_div_errata → fx_div → fx_srt_div — whose
finite-nonzero path is stubbed under this flag — so the M6 Erratum-23
runtime mode appears broken under +VEN_SRT_ITER (even the canonical
published vector takes sign/exponent from the stubbed “clean” value). The M6
errata gates run on the default build; that combination is unverified
territory.
VEN_SRT_DIV¶
Legacy, vestigial — defining it today changes nothing; there is no live
ifdef VEN_SRT_DIV anywhere in the RTL. At M8.5 (commit 9e6e329) the
fx_div dispatcher read ifdef VEN_SRT_DIV → fx_srt_div (with nested
VEN_SRT_FDIV_BUG selecting the buggy PLA), else → fx_div_exact.
Commit 3791e46 inverted the polarity: the genuine SRT datapath with the
correct PLA became the unconditional hardware default (“bit-exact vs
fx_div_exact/QEMU”), and the opt-out became VEN_DIV_EXACT.
The name now survives only in comments and documentation — the feature header
in rtl/fpu/fpu_x87_pkg.sv, a Makefile note, and stale docs that
still describe the old default: verif/srt/README.md (“with no defines the
divider is fx_div_exact”) and docs/sphinx/microarch/srt-divider.rst
both contradict the current code, while fpga/TARGETS.md has the current
truth. Nothing in the build system sets it; both PLAs are graded by
verify-srt against tools/srt/srt_model.py with the testbench driving
the buggy input directly, no define needed. Anyone auditing “which divider
is default” from the README will get the wrong answer.
VEN_SRT_FDIV_BUG¶
Selects the buggy quotient-selection PLA — the authentic Pentium FDIV
flaw, reproduced from first principles with no operand special-cased.
fx_srt_pla in rtl/fpu/fpu_x87_pkg.sv models the five PLA cells that
should hold +2 but were never programmed (Edelman: 8*P_Bad in
{23, 27, 31, 35, 39} for divisor columns d4 in {1, 4, 7, 10, 13}); with
buggy = 1 those cells return digit 0. The define flips two sites: the
combinational dispatcher fx_div passes buggy = 1'b1, and — under
VEN_SRT_ITER — the iterative engine’s ENG_DIV_BUGGY = 1'b1
(rtl/core/core.sv), so the flaw survives the multi-cycle rewrite.
Validated outcomes: 4195835/3145727 flaws to
0x3FFF_AAB7F6392A768638 (the documented double 0x3FF556FEC7254ED1,
wrong at the 13th significant bit, the bug hit at iteration 8);
5505001/294911 also flaws; the negative controls 7654321/3145727 and
4195835/3.0 stay clean. This is an architectural change — deliberately
not bit-exact vs QEMU — so it can never join the default verify track. Its
oracle is the single-source Python golden tools/srt/srt_model.py; the
verify-srt/verify-srt-iter gates grade both PLAs by driving the
buggy input directly, so they pass without the define being set. It is
distinct from the M6 runtime Erratum-23 model (fx_div_errata,
errata_en bit 0), which only reproduces the one published vector and stays
in the default build.
No committed config sets it — purely a user opt-in. Gotchas: QEMU computes
the correct quotient, so any differential gate against QEMU will (correctly)
fail under this flag; VEN_DIV_EXACT is checked first in fx_div, so
combining the two silently yields the exact divider with no flaw; and a
triggering divisor alone is not sufficient to flaw (7654321/3145727 is
clean), so spot-checking with one operand pair gives false confidence either
way.
VEN_DIV_EXACT¶
The opt-out escape hatch: routes fx_div to fx_div_exact, a plain
behavioural wide-integer divide (“kept as a fast-sim escape hatch; NOT the
hardware path”). Because both fx_div_exact and the correct-PLA
fx_srt_div are correctly-rounded floatx80 division — validated equal
over an 8 000-divide random corpus — the flag is architecturally neutral;
results stay bit-exact vs QEMU. It exists purely to make pure-Verilator
simulation of the combinational divider cheaper (the unrolled SRT function is
a 36-step loop per call in sim).
It is checked first in the fx_div dispatcher, so it takes silent
precedence over VEN_SRT_FDIV_BUG. Under VEN_SRT_ITER it is largely
moot for committed results — normal divides are engine-routed before
fx_div is consulted — but the non-engine fallback paths (specials, the
errata path) still call fx_div, which makes VEN_DIV_EXACT the actual
workaround if runtime ERR_FDIV semantics were ever needed under
VEN_SRT_ITER. Nothing in committed configs sets it (documented in
fpga/TARGETS.md). Despite the name it is not “more exact” than the
default — both are correctly rounded — and the stale verif/srt/README.md
still calls fx_div_exact the default; the code comment in
rtl/fpu/fpu_x87_pkg.sv is authoritative.
VEN_BCD_ITER¶
Replaces both rare-but-deep packed-BCD conversion cones with multi-cycle
iterative engines. Without it, FBSTP converts FP → packed BCD via the
combinational fx_fx_to_bcd (18 chained divide-by-10 stages, an
~182-deep CARRY8 cone — historically the whole core’s worst timing path)
inside the fstore_val mux, and FBLD pushes fx_bcd_to_fx (18 chained
multiply-by-10, ~189 levels — the worst pure-logic path once FP arithmetic was
pipelined) directly in S_FEXEC.
The flag adds two FSM states. FBSTP: S_FEXEC starts
rtl/fpu/ven_bcd.sv, which performs the FP → int64 conversion once in its
idle state (including the 10^18/int64 overflow check producing the packed-BCD
indefinite and IE), then extracts the 18 BCD digits at two divide-by-10 steps
per clock (~9 clocks); the FSM waits in S_BCD_BUSY, latches
{ie, pe, bcd} on done, feeds it to fstore_val, and the IE/PE fstat
sticky is deferred to the engine-done clock, because the flags are not known
at dispatch. FBLD: S_FEXEC defers the push and starts
rtl/fpu/ven_bcd_to_fp.sv, which accumulates digits MSD-first at two
multiply-by-10 steps per clock and converts int64 → floatx80 once at the
end; the push of the engine result is driven on the done clock, the same
clock S_FBLD_BUSY retires, so per-retire architectural state is exact.
Both engines are bit-exact to their combinational originals (make
verify-bcd / make verify-fbld; make m3 with the flag is 74/74
including tx_bcd_st/tx_bcd_ld). A pure implementation/Fmax-area
option — no architectural change, no dedicated differential oracle needed
beyond the standalone bit-exact gates plus the m3 run. Set by every KV260
config, the half-cache and probe scripts, and the A/B gate base defines. It is
the parent of VEN_BCD_DIV100, and the FBLD engine was motivated by
VEN_FP_PIPE (which made fx_bcd_to_fx the worst logic path). Gotcha:
FBSTP’s IE/PE flags must be deferred — the generic store sticky-flag arm is
explicitly skipped for FX_FBSTP; forgetting that pairing is the classic
way to double-set or zero the flags.
VEN_BCD_DIV100¶
A micro-optimisation inside ven_bcd’s per-clock digit-extract step, so
it is meaningless without VEN_BCD_ITER. By default the step extracts two
digits with two chained divide-by-10 operations (~54 levels / ~41 CARRY8 in
series). With the flag it computes a single divide-by-100 (the only wide
reciprocal-multiply on the q feedback) and derives both low digits from
the 7-bit remainder, halving the per-clock carry cone. It is bit-exact and
cycle-neutral — the two-digits-per-clock cadence, state machine and outputs
are unchanged; make verify-bcd (40 k vectors) passes either way and the
default make verify stayed 77/77 (commit a26045d).
A pure Fmax option with a narrow applicability window: it only helps when
u_bcd is the worst path, which is true in the OOC
half-cache + VEN_FP_PIPE2 configuration (synthesis 78.4 → 80.6 MHz,
routed 63.0 → 65.3 MHz on the XCK26) — in the full SoC the eip/TLB front-end
cluster binds long before u_bcd, so the committed configurations leave it
off there. No committed
script or config sets it; it was applied manually (an extra define on the
apr_hc_pnr.tcl define list) during the half-cache experiments. The
default is deliberate: even though strictly bit-exact and
cycle-neutral, it only pays when u_bcd is the binding path — which the
full SoC is not.
VEN_TRANSCENDENTAL¶
The architectural opt-in for the eight x87 transcendental instructions
(project issue #11). Without it, the D9 F0/F1/F2/F3/F9/FB/FE/FF opcodes
(F2XM1, FYL2X, FPTAN, FPATAN, FYL2XP1, FSINCOS, FSIN, FCOS) stay
d_unknown in decode and the core HALTs on them, keeping make
verify 77/77 byte-identical.
With the flag, decode arms appear, S_FEXEC routes the eight fxops to a new
busy-wait state S_TRSC_BUSY, and four engine instances are created:
rtl/fpu/fpu_f2xm1.sv (a verbatim transcription of qemu-8.2.2’s
helper_f2xm1 and its softfloat routines, with a 65-entry ROM),
fpu_fpatan, fpu_fyl2x (a mode input serves both FYL2X and FYL2XP1)
and fpu_fsincos (one engine computes sin and cos, serving
FSIN/FCOS/FSINCOS/FPTAN; octant reduction via a 3-part Cody-Waite π/2 plus
Taylor). Commit happens on the engine-done clock, which is the retire clock:
F2XM1 overwrites ST0 in place; FPATAN/FYL2X/FYL2XP1 write ST1 then pop;
FSINCOS does we_top(sin) plus we_push(cos); FPTAN commits
tan = fx_div(sin, cos) and pushes +1.0; out-of-range trig (|x| ≥ 2^63)
sets C2 and leaves ST0 unchanged.
The oracle is split (verif/trsc/README.md). Group B
(F2XM1/FPATAN/FYL2X/FYL2XP1) is bit-exact vs qemu-i386, whose softfloat is
deterministic there; the executable reference is tools/p5xtrans/qref.c,
itself validated bit-for-bit against the pinned QEMU. Group A
(FSIN/FCOS/FSINCOS/FPTAN) deliberately diverges from QEMU (which calls the
host glibc at double precision) and is instead bit-exact vs the
silicon-accuracy shared-polynomial model (~1.8 ulp vs __float128; the
core-vs-qemu spread of ~1000 ulp is reported, not gated). The flag therefore
has its own gate suite: standalone engine gates against qref in all four
rounding modes, plus in-core gates that build the cosim testbench with the
define into obj_dir_trsc and run the tx_* differential programs. FP
scoreboard timing is unchanged (engine done-latency is an implementation
artifact per docs/m11-transcendental-spec.md). It is not in any FPGA
config.
Interactions: the engines reuse fx_mul/fx_add from fpu_x87_pkg,
and the FPTAN commit calls fx_div — so the division flags
(VEN_DIV_EXACT/VEN_SRT_FDIV_BUG) influence FPTAN’s tangent. Under
VEN_SRT_ITER that fx_div call sits outside the engine-routed path,
i.e. on the combinational fx_div whose finite-nonzero SRT body is stubbed
— verify before combining the two flags for FPTAN; the committed trsc gates
run without VEN_SRT_ITER. Gotchas: a naive QEMU diff on FSIN/FCOS fails
by design (the checker compares against the model instead), and the
default-build behaviour for these opcodes is HALT, not a #UD trap — real-world
code using transcendentals requires the flag.
VEN_TRSC_SILICON¶
Reserved and currently unimplemented: there is no
ifdef VEN_TRSC_SILICON anywhere in the RTL or build system — the name
exists only in comments (rtl/fpu/fpu_f2xm1.sv, tools/p5xtrans/qref.c);
the M11 spec documents the SILICON parameter seam and the design
conclusion without ever naming the macro. The seam it would drive does exist: every transcendental
engine declares parameter bit SILICON = 1'b0 (fpu_f2xm1,
fpu_fpatan, fpu_fyl2x, fpu_fsincos), and the instances in
rtl/core/core.sv never override it.
The inertness is a design conclusion, not an omission.
docs/m11-transcendental-spec.md records that (a) bit-exact-to-silicon is
unachievable from public data (undumped transcendental microcode, unpublished
Remez coefficients), and (b) for Group B, QEMU’s softfloat algorithm is
itself the accuracy-faithful silicon one (80-bit table + Horner +
reconstruct, ~0.5 ulp) — a single datapath serves both the bit-exact-vs-qemu
gate and silicon fidelity, so the parameter is reserved/inert. For Group A the
shipped shared-polynomial model already is the silicon-accuracy target.
tools/p5xtrans/p5xtrans.c remains the independent silicon-accuracy oracle
(agrees with qref to <1 ulp). If the flag were ever implemented it would
be an architectural-mode change requiring its own silicon-model oracle, since
the QEMU differential gates for Group B would no longer apply. Do not assume
it does anything because comments reference it.
Front-end and L1 caches¶
This family covers the fetch/decode front end — the BTB, the icache, the
predecoded micro-op cache, the page-keyed fetch micro-TLB — and the parametric
L1 geometry knobs. Per project convention every flag is opt-in: the default
build is the byte/cycle-identical golden-matched reference, and each flag is
one of (a) a bit-exact Fmax/area implementation option whose cycle behaviour
is identical or stays inside the loose M4/M5 bands (VEN_BTB_PIPE,
VEN_IC_NARROWB, VEN_IC_BRAM); (b) an architecturally bit-exact but
explicitly non-cycle-faithful demonstrator (VEN_FE_PIPE,
VEN_CACHE_HALF) plus the common-case-correct, not-yet-verify-hardened
VEN_UOPCACHE; or
(c) a numeric geometry parameter for cache sweeps (VEN_IC_SETS,
VEN_IC_WAYS, VEN_DC_SETS, VEN_DC_WAYS, plus the umbrella
VEN_L1_SETS/VEN_L1_WAYS pair described with the SoC platform family).
The deployable KV260 SoC bitstream sets VEN_BTB_PIPE + VEN_IC_BRAM +
VEN_UOPCACHE + VEN_CACHE_HALF + VEN_FE_PIPE (plus the FP and
iterative-engine flags), via fpga/scripts/impl_kv260_soc.tcl; simulation
sweeps set the geometry knobs through verif/tb/Makefile’s
VL_EXTRA_DEFINES passthrough into separate object directories. The
narrative arc documented in fpga/TIMING_PROBLEMS.md and
docs/fpga-synthesis.md: the single-cycle x86 byte-window decode’s MUXF
density is the congestion wall, the micro-op cache is the only lever that ever
reduced it (first OOC route, 51.7 MHz), half-cache relieved the tail, and
VEN_FE_PIPE made the full SoC route at all.
VEN_BTB_PIPE¶
Registers the BTB resolve inputs (resolve_valid/resolve_pc/
resolve_taken) one clock, so the 2-bit-counter update’s clock enable comes
from a flop instead of the combinational issue_arm net
(rtl/core/bpred_btb.sv). This lifts the ~13-level BTB tail off the
~63-level eip → icache → decode → issue_arm → btb_ctr critical path. The
predict ports always read pre-update state (read-before-write off the
registered arrays), so applying the update one clock later only shifts when
the predictor warms — it never changes which instructions retire, only (in
principle) prediction timing.
A three-agent investigation established that BRAM was the wrong fix for this
path (the BTB is 64 deep, won’t infer BRAM, and is only 21 % of the path); the
registered update is the cycle-safe lever, and it measured cycle-neutral —
mb_brloop/mb_brrandom absolute cycle counts identical to baseline —
while taking OOC synthesis from 58 to 59.5 MHz (P0-6, commit ab3001e).
Purely an implementation/Fmax option, included in essentially every FPGA
config (the deployable SoC, both apr_run.tcl configurations,
apr_hc_pnr.tcl, bd_kv260_soc.tcl, the path/probe scripts — which mark
it as a removable define) and in the A/B sim builds of the FP_PIPE2 gate, so
the cycle-equivalence gate implicitly exercises it in simulation.
One nuance: the in-module comment claims the delayed update is absorbed by the loose cycle bands, but P0-6 measured it stronger — absolute cycles identical on the branch kernels. Strictly, it can change a prediction by one resolve, so treat it as bit-exact-architectural / band-validated-cycle rather than formally cycle-identical.
VEN_UOPCACHE¶
The P0-11 predecode-on-fill micro-op cache — the textbook P6/Sandy-Bridge fix
for the x86 byte-window MUXF congestion wall. Without it, the fast path
gathers twelve bytes via the ub[]/vb[] 32:1 byte selects out of the
two icache line buffers and decodes U and the V candidate live each clock with
two decode instances.
With the flag, rtl/mem/uopcache.sv is enabled: a 3-state walker FSM
re-runs the same pure decode leaf over a freshly filled 32-byte line
after the multi-cycle S_PF fill completes (the spine never stalls on the
walker — its pd_busy output is unconnected) — decoding the fixed bottom
6 bytes of a
residual register and barrel-shifting right by the decoded length, so no
flin-indexed wide byte mux ever exists — and stores NSLOT = 8
fixed-width fpd_t slots plus a 32-entry byte→slot boundary map per
(set, way) in registered-read (ram_style = "block") BRAM stores replicated
A/B. In rtl/core/core.sv the spine accumulates the fill burst into
pf_line_buf and pulses pd_start the clock after fill_done (also
invalidating that set/way’s predecode-valid); the fast path then deletes the
twelve 32:1 byte gathers and the live decoders, reading
u_d = slotsA[bslot(flin[4:0])] and v_d as literally the next slot
(killing the flin + lenU V-base serialization too), with br_taken —
the only flag-dependent fpd_t field — re-evaluated against live
eflags.
Results: the first-ever MUXF reduction (MUXF7 −22 %, MUXF8 −33 %) and the
first configuration to route OOC at 15 ns (51.7 MHz routed, where the
narrowb config dies with 42 392 overlaps). Its registered slot-read timing
mirrors VEN_IC_BRAM, and every setter defines the two together: the
deployable SoC, apr_run.tcl (CONFIG=uopcache), apr_hc_pnr.tcl, and the
uopcache path/probe scripts.
The major caveat: not verify-hardened. The architectural-correctness
backstops are unbuilt — uop_hit and uc_v_avail are computed but
consumed nowhere, and the planned S_PIPE !uop_ready stall arm and
branch-into-the-middle re-predecode are explicitly still unbuilt
(fpga/TIMING_PROBLEMS.md, docs/fpga-synthesis.md). A fetch from a
resident-but-not-yet-predecoded line, or a branch into the middle of a
predecoded boundary, reads stale or wrong slots with no stall. It is a
synthesis/APR Fmax demonstrator (common-case correct), not a
differential-gate-verified configuration — yet it is in the deployable
KV260 bitstream. The default/narrowb build is the verified production
reference. It also conflicts with any I-cache wider than 2 ways: the uopcache way
ports are 1 bit and the store is sized IC_SETS*2, so
VEN_IC_WAYS/VEN_L1_WAYS greater than 2 with VEN_UOPCACHE would
silently alias ways (1-way is safe but wastes half the slot store).
VEN_UOPCACHE_CHECK¶
A sim-only structural-equivalence scaffold layered inside the
VEN_UOPCACHE branch: it re-instantiates the live ub[]/vb[]
byte-window gather plus two reference decoders (producing
u_d_ref/v_d_ref from the actual icache bytes) alongside the slot-read
u_d/v_d, with the stated intent of asserting that the slot read equals
the live gather decode for every issued flin. The ifdef block lives
solely in rtl/core/core.sv; rtl/mem/uopcache.sv carries only a header
comment describing the flag, whose “asserts slot read == live decode” wording
is aspirational. Because it resurrects the 12 × 32:1 gather it must
never appear in a synthesis build.
As of the current tree the define only builds the parallel reference path:
no assert or comparison statement consumes u_d_ref/v_d_ref
anywhere (true since the introducing commit 4cdf7eb). The “equivalence
gate” is scaffolding for manual waveform comparison or future hardening, not
an automated checker; any claim that the uopcache is machine-checked
equivalent via this flag is unsupported. No build script sets it — manual
VL_EXTRA_DEFINES use only, and it is a no-op without VEN_UOPCACHE
(the block is nested). Closing the gap — a real per-issue assert plus the
uop_ready stall — is the documented verify-hardening work.
VEN_FE_PIPE¶
A page-keyed micro-TLB register that breaks the architectural eip
self-loop (eip → I-TLB hit carry chain → ic_present → uop slot read →
next eip) — the cone that made the full-SoC KV260 build unroutable at
~41 MHz. Without the flag, xlate_miss and mem_xlate read the live
combinational I-TLB/D-TLB lookups (the 16-entry vpn-compare carry chain) on
every access; fe_xlate_pend is tied 0 so every added guard constant-folds
away and the default build is byte/cycle-identical.
The flag adds one registered translation per side —
fe_itlb_{page,phys,hit,v} and fe_dtlb_{page,phys,hit,dirty,v}, keyed
on cur_lin[31:12] — sampled by a dedicated always_ff that also
invalidates on tlb_flush (CR3 write) and, per side, on any page-walk fill
so a fresh walk is always re-registered. When the current access’s page is not
the registered page, fe_xlate_pend asserts for exactly one clock: the FSM
holds in a dedicated stall arm (no state/eip/issue/bus change that edge) and
the bus driver squashes the request so no stale physical address ever leaves
(rtl/core/core_bus_driver.svh); the next clock, xlate_miss and
mem_xlate read the registered hit/frame instead of the live carry chain.
Architecturally bit-exact (make verify GREEN with it on); cycle-wise it
adds one clock per 4 KiB page crossing (under ~1 % perturbation, zero in real
mode or within a page) — an Fmax-only demonstrator, cycle bands not claimed
(option B of docs/fe-pipeline-spec.md). Payoff: the full SoC goes from
“not legally routed, 8 766 node overlaps” at 60 MHz to a clean route
(~50.4 MHz Fmax); the deployed 40 MHz Linux image includes it. The only
setter is fpga/scripts/impl_kv260_soc.tcl via the FE_PIPE=1
environment knob.
It only matters when paging is on, and coherence couples it to tlb_flush
and the page-walk fill ports — the invalidate-wins-over-register ordering at
one edge is what keeps a freshly walked or flushed page from being cached
stale. The risk profile is the gotcha: a bug here is silent (wrong physical
page = data corruption, not a crash), so the spec requires
architectural-bit-exact verification on the paging-heavy gates before trusting
changes to it. Small mispredict/fill cycle shifts on an FE_PIPE build are
expected, not regressions.
VEN_IC_BRAM¶
Forces the L1 I-cache line store into Block RAM (RAMB36), dissolving the
distributed-RAM read mux. The store becomes two replicated
ram_style = "block" arrays (one per read port, each a single-read SDP
BRAM), reads become synchronous (data valid the clock after the address),
and the per-word fill becomes the canonical UltraScale byte-write-enable
template writing both copies (rtl/mem/icache.sv). Without it, the line
store is a single distributed-RAM array with asynchronous reads.
Because data now arrives one clock late, the spine grows a content-addressed
front end (rtl/core/core.sv): the read addresses are registered as tags in
lock-step with the data; ic_byte content-addresses whichever buffer holds
the needed set; ic_fetch_ready reports whether the decode window’s line(s)
are currently buffered (sequential fetch is bubble-free because port B always
prefetches the next line, so a sequential line crossing finds the new line
already buffered); and a new S_PIPE bubble arm burns one no-issue clock
only on a redirect to an un-buffered line (rtl/core/core_fastpath.svh).
A sub-step removes even that: when a predicted-taken branch sits in a
non-straddling window, port B is repurposed to prefetch the predicted target
line (pf_redir/pf_redir_tgt) — literally the real P5’s BTB-driven
prefetch.
Fully verified: functional 75/75 bit-exact, all 20 cycle bands pass
(mb_accimm +20 % before the target prefetch, +0.39 % after; mb_imiss
+3.97 %, inside the 10 % band), and a Quake 300 k-instruction lockstep run is
EQUIVALENT (verif/m7/quake_icbram.log). Area trade: 3 072 LUT-as-memory →
0, +5 RAMB36 in the full core. It did not break the OOC congestion wall by
itself — the byte-window MUXF mass lives in the spine, not the cache (the 99 %
u_icache congestion attribution was a flatten-hierarchy artifact) — but it
became the required base for VEN_UOPCACHE and is in the deployable SoC
config. Set by the deployable SoC, apr_run.tcl (uopcache config),
apr_hc_pnr.tcl, the icbram/uopcache probes, a standalone OOC icache probe,
and ad-hoc sim builds (no checked-in verif script pins it).
Gotchas: it is band-validated, not cycle-identical — redirects to
un-buffered lines and the +1 buffer-fill clock after S_PF shift cycle
counts, so a strict cycle diff against the default build will differ even
though all bands pass. It supersedes VEN_IC_NARROWB (which lives in its
else branch). And the replicated A/B stores mean every fill writes both
copies — forgetting that when reasoning about SMC/fill coherence is a trap.
VEN_IC_NARROWB¶
Static width-pruning of the icache’s second (straddle) read port in the
distributed-RAM (non-BRAM) front end. rd_lineB is only ever byte-sliced
at low positions — the worst case is flin[4:0] = 31 plus a 6-byte U length
plus index 5, minus 32, i.e. byte 10 — so the flag drives only the low
128 bits and ties the high 128 to zero, letting Vivado prune half of the
256-wide distributed-RAM read port (rtl/mem/icache.sv). Since the pruned
bytes are never read, the fetched bytes are bit-identical: verified make
verify plus make m3 cycle-identical, mb_imiss +0.03 % noise. It is
the only flag in this family (besides the pure parameters at default values)
that is both bit-exact and cycle-identical, so it needs no gate of its own.
Measured: LUT-as-memory 4 096 → 3 072, MUXF7 16 235 → 14 457 / MUXF8
7 147 → 6 202 (−12 %), OOC synthesis WNS −1.808 → −0.587 ns (59.5 → 64.1 MHz)
— the icache left the synthesis critical path — but placed Fmax was
unchanged (47.6 MHz) because the irreducible rd_lineA 256:1 read and the
spine byte-window MUXF remain. It was the “narrowb production” OOC config
until P0-14 showed it does not actually route at 15 ns (42 392 overlaps),
after which the uopcache config took over as the routable base —
fpga/scripts/impl_kv260_soc.tcl documents that narrowb “hits the
single-cycle byte-window MUXF congestion wall and does NOT route on the small
xck26”.
Set by synth_paths_narrowb.tcl, apr_run.tcl (CONFIG=narrowb), several
floorplan/strategy scripts, and the older bd_kv260_soc.tcl validate flow
(the deployable impl script switched to the uopcache config). It is mutually
dead with VEN_IC_BRAM — the narrowb code sits in the BRAM ifdef’s
else branch, so defining both silently behaves as BRAM-only (several
scripts list it harmlessly). The safety argument depends on the 6-byte U plus
6-byte V window geometry; if the decode window ever widens, the static bound
must be re-derived.
VEN_CACHE_HALF¶
An FPGA area/congestion experiment that halves both L1s to 64 sets (4 KB
I-cache + 4 KB D-cache), changing the index/tag geometry from idx 7/tag 20 to
idx 6/tag 21. It is the third-priority arm of the IC_SETS and DC_SETS
precedence chains in rtl/core/core.sv. Because the index/tag split
changes, the hit/miss sequence differs from the fixed-128-set p5trace
cycle oracle, so it is explicitly not a verification or cycle-fidelity
configuration — the M4/M5 bands are only claimed for the default. It stays
functionally bit-exact (a cache is a hit/miss plus a data store; smaller means
more fills, same bytes): both the default and +VEN_CACHE_HALF builds pass
make verify 77/77 vs QEMU, run manually via
VL_EXTRA_DEFINES=+define+VEN_CACHE_HALF make verify.
FPGA payoff (on the uopcache base, K26, 15 ns): u_icache placed cells
−45 %, MUXF7/F8 −29 %/−24 %, placer congestion level 6 → 5, failing endpoints
−23 %, TNS −38 % — but routed Fmax roughly flat (51.7 → 50.1 MHz), because the
worst path was the FP deferred-commit cone, not the cache. It was kept as a
free area/tail win and, combined with VEN_FP_PIPE2, became the
63–65.3 MHz OOC configuration and part of the deployable SoC image. Set by
impl_kv260_soc.tcl, apr_run.tcl (HALF=1 env), and
apr_hc_pnr.tcl; in simulation only by manual VL_EXTRA_DEFINES.
It is overridden by VEN_IC_SETS/VEN_DC_SETS and by VEN_L1_SETS
(the elsif chain ranks them above it); the geometry flows into the icache,
uopcache and dcache_timing parameters. Under VEN_L1_AXI the D-cache
half has no effect at all (dcache_timing is compiled out). Never run the
cycle bands on a half-cache build and call a mismatch a regression — the
in-source comment brands it “NOT a verification config”.
VEN_IC_SETS¶
Per-I-cache set-count override (an integer-valued define), the
highest-priority arm of the IC_SETS chain
(VEN_IC_SETS > VEN_L1_SETS > VEN_CACHE_HALF (64) > 128). The value
must be a power of two, since IC_IDXW = $clog2(IC_SETS) and
IC_TAGW = 32-5-IC_IDXW derive directly from it; it resizes the parametric
rtl/mem/icache.sv arrays and flows into the uopcache parameter as well.
Non-default values stay functionally correct (the cache is
architecture-transparent) but are not matched by the fixed-128-set/2-way
p5trace cycle oracle — they are area/perf experiments.
Its real consumer is the L1 cache-size sweep: build/rerun/cache_sweep.sh
(dhrystone, lockstep replay) and build/rerun/dmiss_sweep.sh (mb_dmiss,
pure --cycle) — local working artifacts in the gitignored build/
tree, not committed — build one testbench per geometry via
VL_EXTRA_DEFINES="+define+VEN_IC_SETS=$SETS +define+VEN_DC_SETS=$SETS"
into separate obj_dir_sweep_* directories and measure CPI vs size — the
sweep that produced the 32 KB-knee result (mb_dmiss CPI 2.50 → 0.53).
Cache timing is cycle-mode-only and cycle mode diverges on syscall/TLS
programs, so sweeps either use flat syscall-free kernels or the PC-keyed
lockstep replay path — that constraint is baked into the scripts. Each
geometry must build into its own object directory or it clobbers the canonical
obj_dir/tb_ventium used by make verify.
VEN_IC_WAYS¶
Per-I-cache associativity override (chain VEN_IC_WAYS > VEN_L1_WAYS
> 2; powers of two, with a guard so even 1-way elaborates). The icache was
generalized from the original 1-bit MRU 2-way scheme to a per-way age-counter
true LRU (rank 0 = MRU … WAYS−1 = LRU) that provably reduces to the
original behaviour at WAYS == 2 — same hit test, same victim sequence,
same touch/fill recency; the reset encoding (age = way index, reproducing the
old victim = way 1) was deliberately matched, and any “cleanup” there breaks
golden identity. The replacement victim is exported per set as ic_victim_o
so the spine’s fill-way selection stays uniform.
No script — committed or local — sets it directly; the associativity
demonstration used the umbrella VEN_L1_WAYS=4 via the local
build/rerun/nway_4way.sh (gitignored), which proved a
12 KB conflict working set thrashes 2-way but fits 4-way (CPI drop) and that
4-way free-runs CoreMark with CRC output identical to qemu-native. It
conflicts with VEN_UOPCACHE at any value greater than 2 (1-bit way ports,
stores sized IC_SETS*2; 1-way is safe but wastes half the slot store). The fixed-2-way oracle means there is no cycle
gate for other way counts — only functional/CRC validation.
VEN_DC_SETS¶
Set-count override for the L1 D-cache timing model
(rtl/mem/dcache_timing.sv) — crucially a timing-only structure: there is
no data array; load data always comes from the BFM/mem_rdata, and
dcache_timing just tracks tag/valid/LRU so the spine knows when a load
completes (a read miss adds the dmiss = 8 penalty, misalignment +3,
mirroring p5_mem()/l1_access() in the p5trace.c oracle). It can
therefore never change architectural results — only cycle counts, and only in
cycle mode. Like the I-side knob it is functionally safe but breaks the
fixed-128-set cycle-oracle match; it is used by the same two cache sweeps,
always set equal to VEN_IC_SETS.
It is entirely moot under VEN_L1_AXI: the whole dcache_timing instance
is compiled out and dc_lu_hit tied 1 (mode-2 timing is functional-only),
so SoC/board builds ignore it. The D-miss penalty is also deferred one
instruction (pending_mem_pen), so per-geometry cycle deltas land on the
following retire — relevant when eyeballing sweep traces. Because it is
timing-only, a functional-mode run returns identical data values regardless of
the setting; the sweep scripts exploit exactly this, which is what makes
fixed---max-insn CPI comparisons valid.
VEN_DC_WAYS¶
Associativity override for the D-cache timing model, the D-side twin of
VEN_IC_WAYS (chain VEN_DC_WAYS > VEN_L1_WAYS > 2).
dcache_timing was generalized to any power-of-two way count with a per-way
age-counter true LRU that reduces exactly to the original 1-bit dc_lru
behaviour at 2-way — same hit test, victim sequence and recency update, the
reset encoding reproducing the old victim = ~dc_lru. That exact-reduction
property is the load-bearing reason the default build needed no
re-verification after the parameterization commit. Since the block is
timing-only, changing ways only changes when loads complete in cycle mode,
never what they return.
No script — committed or local — sets it directly; the 4-way demonstration
used the umbrella VEN_L1_WAYS=4 via the local build/rerun/nway_4way.sh
(gitignored), which together with the I-side showed the
conflict-kernel CPI collapse at 4-way and CRC-identical CoreMark output as the
functional proof. Moot under VEN_L1_AXI; usually swept together with the
I-side so the two L1s stay symmetric, mirroring the oracle’s single L1
geometry parameterization. Its “differential gate” is a functional/CRC
equivalence run plus a CPI plausibility check — never the cycle bands.
Integer core and build plumbing¶
This family covers the integer-core implementation flag VEN_IDIV_ITER
plus the housekeeping tokens (SYNTHESIS, VTM_NO_DPI,
VTM_HAVE_DPI_HEADER, M7_PROXY_DEBUG, F2_DBG) that separate
simulation builds from FPGA synthesis builds. VEN_IDIV_ITER swaps the
default single-cycle combinational x86 DIV/IDIV — native SystemVerilog /
and % inside rtl/core/core_exec.svh’s S_EXEC arm, which
synthesized to the core’s worst path once the FPU went iterative (post-P0-1:
87.8 ns, 667 levels, 585 of them CARRY8) — for a
multi-cycle restoring-divide engine plus a new S_DIV_BUSY wait state:
architecturally bit-exact, but only cycle-band-equivalent. The non-VEN tokens
follow the convention that the default (no-flag) Verilator build is the
golden-comparable one: VTM_NO_DPI strips the DPI-C retire-observation
channel so the RTL lints and synthesizes with no C testbench, SYNTHESIS
strips sim-only $fatal guards and bound SVA, and the rest are debug or
C++-side plumbing. None of the non-VEN tokens change architectural behaviour;
VEN_IDIV_ITER has its own differential gate (make verify-idiv) plus
full-suite, cycle-band and verify-sys #DE re-verification.
VEN_IDIV_ITER¶
Without the flag, x86 DIV (q_md = 6) and IDIV (q_md = 7) execute with
native combinational / and % directly in the S_EXEC arm (64-bit
{EDX,EAX} dividend for the 32-bit form), committing GPRs in one clock and
charging the modelled P5 occupancy as a deferred pending_mem_pen penalty
burned in S_PIPE before the next issue (occ−7: DIV 10/18/34, IDIV
15/23/39 for r/m8/16/32). rtl/core/ven_idiv.sv is always in the filelist
but uninstantiated (rtl/ventium.f notes this is harmless).
With the flag, both K_MULDIV divide arms reroute: S_EXEC stops
computing anything and parks in S_DIV_BUSY; a combinational driver block
pulses eng_int_start that same clock, assembling the dividend from gpr
and the divisor from srcv (recomputed identically to the exec arm), and
instantiates ven_idiv. The engine is a 4-state FSM doing restoring
division on magnitudes at two radix-2 steps per clock, with the x86 sign
fix-up (quotient sign = dividend ^ divisor, remainder takes the dividend sign,
truncation toward zero) and the exact per-width #DE predicates from
core_exec.svh (zero divisor short-circuits; per-width quotient-overflow
checks) — bit-exact to the native path. S_DIV_BUSY busy-waits on done,
then either delivers #DE vector 0 (system-mode start_fault through the
IDT, user-mode loud-HALT — identical policy to the inline path) or commits
quotient/remainder into EAX/EDX with the same per-width merge, tops up
pending_mem_pen with a residual (IDIV +6, DIV +1, tuned against make
m5 so issue-to-next-issue equals the P5 occupancies 17/25/41 DIV,
22/30/46 IDIV), sets retire_valid and returns to S_PIPE.
This removed the post-P0-1 worst path of the whole core (87.8 ns / 667
levels). With the flag, make verify stays 74/74 green and the
verify-sys “pde” #DE program is EQUIVALENT, while the divide cycle kernels
move mb_div8 +5.45 %, div16 +0.13 %, div32 +2.36 %, idiv32
+2.12 % — all inside the 10 % band. The standalone gate is make
verify-idiv (verif/idiv/run-idiv-gate.sh, verif/idiv/tb_idiv.sv: a
golden mirroring core_exec.svh, directed edge cases plus 80 k random
vectors over all six forms). Set by every full-core FPGA configuration from P0-2 onward
(bd_kv260_soc.tcl, impl_kv260_soc.tcl, the APR scripts and the
synth_paths_*/probe_muxf_*/probe_lut_quick probes); the pre-P0-2
baseline probes synth_probe_core.tcl and synth_probe_core_iter.tcl
deliberately omit it (they are the before/after measurement configs). The
only verif builder that sets it is the FP_PIPE2 A/B base define set.
Gotchas: cycle-mode A/B diffs against a no-flag build show line deltas on
divide-heavy traces — expected, not a bug. The fixed residual works across
widths only because the engine’s run-state clock count scales exactly with the
P5 occupancy deltas; changing steps-per-clock breaks the model. The engine
driver recomputes srcv combinationally and must stay identical to the exec
arm’s expression or operands diverge silently. tb_idiv deliberately
substitutes a well-defined overflow vector for the INT_MIN/−1 corner (the
testbench’s native-/ golden would hit C++ UB in Verilator); on real x86
that corner is #DE-overflow, which the engine’s predicate covers. And the
busy-wait serializes — no instruction issues until the EDX:EAX commit,
matching the native path’s non-pipelined U-pipe hold.
SYNTHESIS¶
The standard simulation-versus-synthesis fence, used exclusively as
ifndef SYNTHESIS around code Vivado cannot or should not synthesize. Six
sites, all in the L1/AXI/SoC-bridge files and none in the core proper:
rtl/mem/ven_cdc_afifo.sv (initial $fatal guards requiring DEPTH be a
power of two and ≥ 4, plus SVA forbidding write-while-full/read-while-empty);
rtl/mem/ven_axi_master.sv ($fatal guards converting two
silent-corruption bugs into build failures — REMAP_BASE must be 32-byte
aligned for the INCR8 4 KiB rule, LINE_BEATS must be 8 to match the
ven_l1d 32 B line — plus bound AXI protocol SVA: VALID-with-stable-payload
until handshake, 32 B-aligned read-burst base, no 4 KiB crossing, no AR while
a write is outstanding); rtl/mem/ven_axi_cdc.sv (single-outstanding CDC
invariants); and rtl/soc/ven_soc_axil.sv (io_ack one-cycle-pulse and
pending-required checks, the F3 INTA seam checks, AXI-Lite BVALID/RVALID
hold). Defining it removes all of this, leaving pure synthesizable logic;
nothing architectural changes, so no gate is needed — but verification must
not set it, or the L1/AXI gates lose their checkers (those gates build with
verilator --assert).
Set by the Vivado SoC/BD flows (bd_l1axi.tcl, bd_kv260_soc.tcl,
impl_kv260_soc.tcl, synth_cdc.tcl), always paired with
VTM_NO_DPI. The guarded files are only in SoC/L1-AXI builds — none are in
the core filelist rtl/ventium.f — which is why older core-only probe
scripts set only VTM_NO_DPI and still synthesize cleanly. Vivado also
pre-defines SYNTHESIS on its own, but the project sets it explicitly
rather than relying on the tool default. In the default testbench build the
concurrent assertions are compiled but inert, because --assert is only
added by the dedicated SVA/gate targets — keeping the default build
byte-identical and unburdened by assertion machinery.
VTM_NO_DPI¶
Elides the entire DPI retire-observation contract so the RTL lints,
elaborates and synthesizes with no C testbench present. Without it, four
import "DPI-C" declarations in rtl/ventium_pkg.sv — vtm_retire
(integer architectural state), vtm_retire_x87 (80-bit st0–st7 split
lo/hi plus fctrl/fstat/ftag), vtm_retire_cycle (U/V pipe attribution and
pairing) and vtm_retire_sys (cr0/cr2/cr3/cr4) — are called once per
retired instruction from the retire points in rtl/ventium_top.sv and
rtl/soc/ventium_soc.sv (including the dual-issue retire2 second
calls); this is the observation channel every differential gate depends on
(one .vtrace record per vtm_retire call). A small stub fence also
exists in the M0 selftest top (verif/tb/selftest/ventium_top.sv).
With the define, retire_n bookkeeping still ticks but nothing observes it
— the synthesized core carries zero trace overhead. It is mandatory for
synthesis (fpga/TIMING_PROBLEMS.md P2-2, fpga/PLAN.md). Purely an
observability token: it cannot change architectural behaviour (the calls are
pure observers), but a build with it set produces no trace and therefore
cannot be differentially gated — every golden comparison requires it unset.
Set by every fpga/scripts/*.tcl synthesis/impl/probe script and by the
standalone lint flow documented in rtl/README.md; never by
verif/tb/Makefile or any gate script.
Gotchas: the import signatures are declared verbatim from
docs/rtl-interface.md §2 because the comparator parses by field name — do
not “clean up” the import inside the fence. ventium_top.sv calls
vtm_retire_x87 on every retirement, not just FP ops, because QEMU
reports live FP state on every instruction in x87-trace mode; eliding it
per-op would silently break x87 diffs. Re-enabling DPI on a previously
NO_DPI-configured build needs no other change.
VTM_HAVE_DPI_HEADER¶
The C++-side companion of VTM_NO_DPI in the same DPI contract — a C++
preprocessor macro, not a Verilog define, and never passed on any command
line. verif/tb/dpi_retire.cpp implements the bodies of the four
vtm_retire* functions; it uses __has_include to detect the
Verilator-generated Vventium_top__Dpi.h. When the header is present it is
included (so the definitions are type-checked against the generated extern-C
prototypes) and VTM_HAVE_DPI_HEADER is self-defined; when absent (a
standalone compile without -Iobj_dir), the #ifndef branch supplies
hand-written fallback extern "C" prototypes matching the documented SV→C
type mapping (longint unsigned → unsigned long long, and so on). Pure
build plumbing with zero behavioural effect; it is listed here because it is
the only other VTM_* token in the tree. If the fallback prototypes ever
drift from the generated header, builds with the header present fail
type-check — the intended safety net. Grep hits land in .cpp only.
M7_PROXY_DEBUG¶
A leave-undefined simulation debug token from the M7 Win95-cosim /
syscall-proxy bring-up. Three ifdef’d $display sites, tagged
[M7DBG]: the int-0x80 proxy fold (pc, retire count, injected EAX, gs
apply/base) in rtl/core/core_fetch_decode.svh; MOV-to-Sreg selector and
hidden-base resolution (the gs-selector-0x33 seam) in
rtl/core/core_exec.svh; and indirect-CALL target resolution (q_pc,
segment base, EA, loaded target), also in core_exec.svh. Purely
observational — no logic, no state, no architectural effect, no gate needed.
Nothing in the repository sets it (manual +define+M7_PROXY_DEBUG only),
and fpga/TIMING_PROBLEMS.md explicitly instructs leaving it undefined for
synthesis. The proxy-fold site lives inside the same decode path that
underlies VEN_PS_PROXY’s S_SYSCALL_WAIT, so it is the natural tap when
debugging PS-proxy syscall injection. Output interleaves with testbench
stdout; enabling it on long runs floods logs — intended for short directed
reproductions.
F2_DBG¶
A leave-undefined debug token for the iterative F2XM1 transcendental engine
(issue #11 bring-up). One site, in rtl/fpu/fpu_f2xm1.sv’s S_PACK
state: a $display of the polynomial-accumulator internals (iteration,
accumulator exponent/sign/significand halves and the y exponent) just
before norm_round_pack_x produces the result. Pure observation, no
behavioural or timing effect, no gate needed; nothing sets it. It is only
meaningful when the engine is actually instantiated, i.e. under
VEN_TRANSCENDENTAL.
Testbench value macros and sweep hygiene¶
For completeness, the unit-testbench value macros are not core feature
flags: BCD_N, SRT_N, SQRT_N, IDIV_N, FBLD_N, FPPIPE_N,
F2_NV/F2_RC, FA_NV/FA_RC, FY_NV/FY_RC/FY_MODE and
FS_NV/FS_OP live only in the standalone gate testbenches under
verif/bcd, verif/srt, verif/idiv, verif/fbld, verif/fppipe
and verif/trsc, and are vector-count / rounding-mode / operation-select
parameters set by their own gate scripts. A full ifdef/ifndef sweep
over rtl/ finds no other unknown tokens beyond the flags on this page, and
the remaining #ifndef tokens in verif/tb C++ are ordinary include
guards. The ad-hoc sweep object directories under verif/tb split two ways:
obj_dir_sweep_32 … 2048 and obj_dir_4way are produced by the
local, uncommitted sweep scripts in the gitignored build/rerun/ tree,
while obj_dir_fe alone is a manual VL_EXTRA_DEFINES build with no
script at all.
Memory system and SoC platform¶
This family is the build-time ladder that turns the verification core (BFM
memory, same-cycle ack, DPI retire tracing) into a real KV260 FPGA SoC, plus
the umbrella L1 geometry knobs and one physical-implementation switch. The
ladder is strictly layered: VEN_L1_AXI swaps the bus-functional memory
port for a real stalling L1 data array plus an AXI4 master to PS-DDR
(bus mode 2, runtime l1axi_en); VEN_AXI_CDC optionally splits that
subsystem into two clock domains via async-FIFO bridges; VEN_KV260_SOC
(nested inside VEN_L1_AXI’s port block, so it requires it) adds the
PS-facing seams used by the ventium_kv260_core full-SoC top; and
VEN_PS_PROXY converts the zero-latency int-0x80 lockstep proxy into a
stalling handshake for the future F4 PS syscall bridge. All of these honour
the project convention that the default build stays byte/cycle-identical,
though the protections vary: the double gate (ifdef absent by default
and a runtime enable tied 0) holds for VEN_L1_AXI
(l1axi_en/real_bus), VEN_PS_PROXY (proxy_en) and
VEN_KV260_SOC’s interrupt seam (soc_en), while VEN_AXI_CDC is a
compile-time-only path swap shielded by l1axi_en at the subsystem level;
per-flag differential gates prove architectural bit-exactness for
VEN_L1_AXI, VEN_AXI_CDC and VEN_PS_PROXY but not for
VEN_KV260_SOC (board-level certification only); only cycle timing is
allowed to change (mode-2 timing is explicitly functional-only). VEN_L1_SETS/VEN_L1_WAYS are
valued geometry knobs for the in-core caches, and VEN_PBLOCK is not a
Verilog define at all but an environment variable read by two Vivado Tcl
scripts.
VEN_L1_AXI¶
The P1-1 “bus mode 2” build: routes the core’s same-cycle-ack 32-bit memory
port through rtl/mem/ventium_l1_axi.sv — ven_l1d (an 8 KB / 2-way /
32 B-line write-through L1 data array) plus ven_axi_master (a burst
engine) — out to a full AXI4 master bundle that either the testbench’s
behavioural DDR slave or the PS S_AXI_HPC0_FPD services. Without the flag
there are no AXI ports, no ventium_l1_axi instance, no
real_bus/bus_err core ports, and the dcache_timing instance is
kept — the core’s memory port wires directly (or via the bus-mode-1 BIU) to
the testbench BFM, byte/cycle-identical to all M0–M14 gates.
The flag adds the l1axi_en/flush_all/m_axi_*/bus_err port
group to rtl/ventium_top.sv and the u_l1axi instance with a 3-way
memory mux that holds the direct mem_* port inert when l1axi_en = 1.
Inside the core it adds real_bus and bus_err inputs and arms the
fast-path miss-stall gates: the dual-issue fast path otherwise latches
mem_rdata the same clock it requests (the BFM assumption), which a real L1
cannot honour on a miss — so I-side fill word-0 capture and the S_PF
launch wait for mem_ack, a D-side register-base load bubbles the whole
issue until the fill completes, and the modelled P5 D-miss penalty is
suppressed (real DDR latency becomes the only timing source — “mode-2 timing
is functional-only”). A fatal AXI fault (watchdog timeout or SLVERR) parks the
core in S_HALT with cpu_hung so the PS can observe and reset. As an
area side-effect the build drops the ~4.4 k-cell dcache_timing instance
(dc_lu_hit tied 1 — functionally inert because the penalty it fed is
suppressed).
Architecturally bit-exact: the entire 77-program suite is re-verified against
the QEMU golden through this path (verif/l1/run-l1axi-verify.sh); the
byte-accurate/cross-line L1 fix took mode-2 equivalence from 34/77
(aligned-only) to 76/77, and the final program was recovered by the high
quiesce window (the default 64 falsely declared an idle core), giving
L1AXI-VERIFY-OK 77/77.
All gates are doubly protected (ifdef plus runtime
l1axi_en/real_bus = 0), so even the flagged build in modes 0/1 is
byte-identical. Set by the verif/tb/Makefile targets l1axi and
l1axi_cdc (which append rtl/mem/ven_l1d.sv,
rtl/mem/ven_axi_master.sv and rtl/mem/ventium_l1_axi.sv — deliberately
absent from rtl/ventium.f, so the macro alone is an elaboration error), by
run-l1axi-verify.sh, and by the production KV260 define lists
(bd_kv260_soc.tcl, impl_kv260_soc.tcl, together with
VEN_KV260_SOC). The -DVEN_L1_AXI C flag also reaches the C++
testbench, gating the behavioural DDR slave leg in tb_main.cpp.
Gotchas: the cycle oracle is not claimed in mode 2 — only the functional trace
is graded. Verification needs a high quiesce (--quiesce 200000 vs the
default 64) because a multi-cycle mode-2 access (an icache fill, a 27-beat
FNSAVE) goes many clocks without retiring. ven_l1d holds m_req
continuously across back-to-back transactions, so any downstream consumer
that waits for !m_req deadlocks (documented in ven_axi_cdc.sv). The
flush_all input exists for cosim coherency: the syscall proxy writes the
backing memory model directly, bypassing the L1. The standalone unit gates
(run-l1axi-gate.sh, run-l1d-gate.sh, run-l1axi-wd-gate.sh) do
not need the define — the leaf modules are unconditional; the ifdef only
controls wiring into ventium_top/core.
VEN_L1_SETS¶
Sets the set count for both in-core L1s at once: IC_SETS for the
functional I-cache and DC_SETS for the D-cache timing model. It sits in a
precedence chain — the per-cache VEN_IC_SETS/VEN_DC_SETS override it,
it overrides VEN_CACHE_HALF (the legacy 64-set shortcut), which overrides
the 128 default (the Pentium silicon geometry the cycle oracle is validated
against; at the default, the index/tag slices are exactly
addr[11:5]/addr[31:12]). Set-index width, tag width and all tag/valid/
data arrays derive automatically; the value must be a power of two.
Changing it changes the miss sequence: the build stays functionally
bit-exact — supported by the VEN_CACHE_HALF make verify run (the
64-set case, 77/77), the 4-way CoreMark CRC free-run and the functional
set-count sweeps, not by a make verify run with this define itself — but
the cycle trace no longer matches the fixed-128-set p5trace oracle, so the
M4/M5 bands are only claimed for the default. It is the documented
convenience form (see Parametric L1 Caches) with no recorded user:
the cache design-space sweeps that found the 32 KB L1 knee passed the
per-cache VEN_IC_SETS/VEN_DC_SETS instead, and the umbrella pair’s
only recorded use is the ways side (VEN_L1_WAYS=4). There is no
production setter — the FPGA builds use VEN_CACHE_HALF instead. An important scope limit: it does
not touch ven_l1d, the VEN_L1_AXI data cache — that module has its
own L1_SETS module parameter left at its 128 default by
ventium_l1_axi, so a geometry sweep changes the in-core I-cache and
D-timing model only. Always build into a private object directory so the
canonical obj_dir/tb_ventium is never clobbered.
VEN_L1_WAYS¶
Sets the associativity for both in-core L1s at once, with the same precedence
pattern (VEN_IC_WAYS/VEN_DC_WAYS win; default 2, the P5 silicon
associativity, at which the generalized age-rank true LRU reduces
byte-for-byte to the original 1-bit MRU scheme — victim = ~lru —
confirmed by the cycle gates). Supporting arbitrary N required the per-way
age-counter LRU described under VEN_IC_WAYS, exposed as ic_victim_o so
the spine’s fill-way selection stays uniform. Functionally bit-exact but
cycle-oracle-divergent for non-default values — an area/CPI exploration knob,
e.g. the 4-way conflict-pattern experiment in build/rerun/nway_4way.sh
(local, gitignored; a 12 KB conflict set on +VEN_L1_WAYS=4). No production setter; the same
ven_l1d scope limit as VEN_L1_SETS applies. The value must be a power
of two.
VEN_AXI_CDC¶
The P1-3 dual-clock option for the L1/AXI subsystem. Without it,
ventium_l1_axi is the single-clock CDC_BYPASS build: ven_l1d’s
word-granular backing port wires straight to ven_axi_master with plain
assigns, the raw core_rst_n is used everywhere, and core_clk and
axi_clk are expected tied to the same clock — which is what the production
KV260 bitstream ships (a single pl_clk0).
With the flag, the core plus ven_l1d stay in core_clk (the slow,
Fmax-limited PL domain) while ven_axi_master and the AXI4 link run in
axi_clk. ventium_l1_axi instantiates rtl/mem/ven_axi_cdc.sv, which
bridges the single-outstanding backing port with exactly one primitive:
rtl/mem/ven_cdc_afifo.sv, a clean-room Cummings 2-FF Gray-pointer
asynchronous FIFO (binary plus Gray pointers, 2-flop synchronizers, FWFT read
off distributed RAM). Two FIFOs are used — a command FIFO core→axi carrying
{we, addr, wdata, wstrb} and a response FIFO axi→core (depth ≥
LINE_BEATS = 8 for a full line fill) — with small per-domain FSMs that
preserve ven_l1d’s ack semantics exactly (eight in-order acks per read
fill, one per write), so ven_l1d and ven_axi_master are textually
untouched; ven_axi_master is simply re-clocked via a compile-time static
net alias, never a runtime clock mux. Two ven_reset_sync
async-assert/sync-deassert reset bridges are fed by the AND of both raw
resets, so a reset in either domain resets both — an asymmetric
mid-transaction reset would otherwise wipe the producer side of an in-flight
fill and create an undetectable deadlock (the watchdog lives in the just-reset
axi domain). The sticky bus_err level crosses axi→core through a 2-flop
ASYNC_REG synchronizer.
Purely an implementation/Fmax option, and proven bit-exact: the multi-ratio
data-integrity gate (verif/l1/run-l1axi-cdc-gate.sh) runs
read-fill/write-through/evict at four core:axi clock ratios with assertions
live, and run-l1axi-cdc-verify.sh re-runs the full 77-program suite
through the bridge at the degenerate equal-clock ratio. Set by the Makefile
l1axi_cdc target (together with VEN_L1_AXI), the CDC gates, and
fpga/scripts/synth_cdc.tcl (an OOC synthesis check with two genuinely
asynchronous clocks, 66/200 MHz).
Gotchas: the reset coupling is load-bearing — never split the resets. The
CDC’s core-side FSM completes on the last response and must not wait for the
request to drop, because ven_l1d holds m_req continuously across
back-to-back transactions (FNSAVE’s 27 sequential stores) — waiting deadlocks.
The FIFO depth must be a power of two ≥ 4 (a sim-time $fatal enforces
it). The FIFO is metastability-safe by Gray-code structure, not by simulation
— Verilator only proves the functional protocol across ratios. The CDC source
files are not in rtl/ventium.f, so defining the macro without adding them
is an elaboration error; and the real dual-clock MMCM block design with its
ven_cdc.xdc constraints is the pending follow-up.
VEN_KV260_SOC¶
The full-SoC seam set for the KV260 PS-assisted architecture (the PL holds
only the core, L1 and AXI master; the PS A53 emulates every slow peripheral).
Inside rtl/ventium_top.sv it does three things. First, it adds the
retire_count[63:0] output, wired to the live retire counter, so the PS can
observe forward progress without the sim-only DPI (the F2 milestone). Second,
it adds the F3 PS-driven external-interrupt injection seam — soc_en,
intr (an 8259-INT-analogue level), inta_vector[7:0] (the PS-staged
vector) inputs and the one-clock inta acknowledge strobe output — and
wires the core’s previously tied-off interrupt divert for real (NMI stays
tied 0). Third, it overrides the L1/AXI address remap: L1AXI_REMAP_BASE
becomes 40'h0_4000_0000 instead of identity 0, so the core’s x86 physical
space lands in the reserved 256 MB DDR carveout on S_AXI_HPC0. Without the
define, even a +VEN_L1_AXI build has none of these ports and the
interrupt path is tied off, so every cosim/L1AXI gate stays byte-identical.
It is the build flag for the rtl/soc/ventium_kv260_core.sv top
(ventium_top plus rtl/soc/ven_soc_axil.sv, the AXI4-Lite
control/status/port-I/O-bridge/interrupt slave on M_AXI_HPM0_FPD), whose
header states “build with +VEN_L1_AXI +VEN_KV260_SOC”. At runtime it is
additive-and-inert by default: soc_en comes from the slave’s
MODE.SOCEN bit (default 0), so F2 boot behaviour is unchanged until the PS
opts into F3 interrupt injection, and the remap is pure address translation,
invisible to the core. There is no Verilator differential gate for this define
itself — it is certified by the BD validate/synth bars plus the on-board F2
firmware diff (the 324-record COM1 boot trace), while its constituent seams
(the AXI-Lite slave handshake, the L1/AXI wire protocol) have their own
Verilator gates.
Set by fpga/scripts/bd_kv260_soc.tcl and
fpga/scripts/impl_kv260_soc.tcl (every archived production run echoes
it). It has a hard dependency on ``VEN_L1_AXI``: its entire port block is
nested inside that ifdef and the core connection block references the new
ports — defining VEN_KV260_SOC alone does not compile. The carveout
base/size (0x4000_0000/0x1000_0000) must agree between the RTL
localparam, the BD address segment, and the PS Linux reserved-memory node.
Note the deliberate asymmetry: cosim keeps REMAP_BASE = 0 because the
testbench memory model is x86-phys-indexed — do not “fix” one side to match
the other. Among ventium_top-based builds, only +VEN_KV260_SOC exposes the
soc_en port the core branches its CPUID identity on (0x663 vs 0x543; the
port is tied 0 otherwise) — but the M8 ventium_soc simulation top drives
soc_en = 1 unconditionally with no define, and that is where the
soc_en CPUID arm is actually differential-verified (the psoccpuid SoC
gate); the define itself only adds ports and the remap.
VEN_PS_PROXY¶
The F4 PS-bridge stall variant of the Quake user-mode int-0x80 proxy. Without
it, an int-0x80 in proxy mode commits with zero latency: the S_DECODE
arm applies the testbench-driven golden EAX/gs/resume effects combinationally
on the same clock (the M7.1 lockstep contract). On real hardware the PS A53
answers a syscall thousands of core clocks later, so the same-clock assumption
breaks.
The define adds the syscall_resp_valid input through ventium_top into
the core, a new FSM state S_SYSCALL_WAIT, and a 4-bit q_proxy_len
register. When proxy_en = 1 and S_DECODE hits cd 80, instead of
committing immediately the core latches only the instruction length and parks
in S_SYSCALL_WAIT (eip still at the cd 80), committing nothing;
syscall_active still pulses for that one S_DECODE clock, and because
no retire ticks the counter, syscall_n and the syscall_arg_* register
reads stay stable for the PS for the whole wait. On syscall_resp_valid the
wait arm performs the identical commit the zero-latency arm would have
(EAX, optional gs-base latch, eip += q_proxy_len, the fold-pending state)
and resumes fetch — architecturally equivalent, just later. It mirrors the
S_IO port-I/O stall discipline.
The stall reorders when effects commit, so equivalence is non-obvious and
the flag has its own differential gate:
verif/m7/run-quake-lockstep-proxy.sh builds make proxy
(obj_dir_proxy; the -DVEN_PS_PROXY C flag also makes tb_main.cpp
play the PS with an 8-clock service latency) and requires the stalled RTL
trace to be bit-exact against the same QEMU golden the zero-latency build
matches (QUAKE-LOCKSTEP-PROXY-OK). It is runtime-double-gated
(proxy_en must be 1 and an int-0x80 must decode in user mode), so even
the flagged build is byte-identical on non-proxy runs.
It is in no FPGA production list — the F4 hardware syscall bridge is future
work; ven_soc_axil only reserves the syscall register window
0x40–0x6C for it, and ventium_kv260_core ties the response inputs off.
The eventual hardware path composes with VEN_KV260_SOC (the reserved
window will drive syscall_resp_valid) and with VEN_L1_AXI’s
flush_all (the PS writes kernel-memory effects to the DDR carveout,
requiring L1 invalidation before resume). Gotchas for a PS-side
implementation: syscall_active pulses on the S_DECODE clock, not the
commit clock, so the response must not be assumed sampled the same cycle; and
eip parks at the cd 80, so q_proxy_len is the only state needed to
advance on the late commit — do not re-derive the length in the wait arm.
VEN_PBLOCK¶
Not a Verilog define. It is an environment variable read by two Vivado Tcl
floorplanning-experiment scripts (fpga/scripts/impl_route_relax.tcl and
fpga/scripts/impl_route_fppipe.tcl) via
[info exists ::env(VEN_PBLOCK)] && $::env(VEN_PBLOCK) eq "1"; it never
appears in any verilog_define list and never reaches the RTL, so it cannot
change the netlist’s function — a pure physical-implementation option. When
VEN_PBLOCK=1, after out-of-context synthesis of the bare core the script
creates a soft Pblock pb_core containing every non-internal cell, resizes
it to SLICE_X0Y0:SLICE_X111Y239 (a compact lower-left-biased region) and
sets IS_SOFT TRUE — the intent being to stop the placer from spreading the
82 %-utilization core across the whole die so hot nets (the eip fetch
self-update loop, f_mem80 → fpr) stay short. The output build directory
gets a _pblock suffix so baseline and Pblock runs coexist.
The recorded outcome is a closed negative result
(fpga/TIMING_PROBLEMS.md): placed Fmax improved only 42.5 → 44.2 MHz, and
the full route did not converge at 82.85 % utilization under either route
directive — the conclusion that floorplanning alone cannot fix a
device-filling core; the real lever is lower utilization, which redirected the
Fmax work toward logic consolidation and the later half-cache/uop-cache
configurations. No Makefile or CI sets it — a manual experiment knob, and the
comparison is to the exact string "1" (VEN_PBLOCK=true silently
disables it). It operates on builds using the Fmax define stack and is
orthogonal to every Verilog flag; the separate run_pblock_rebalance.tcl
and impl_floorplan.tcl experiments do not read it. The production KV260
flow does not use it.
Peripheral RTL/PS split¶
The VEN_<DEV>_PS family is Ventium’s per-peripheral RTL(PL)-versus-PS(A53)
placement selector for the SoC top rtl/soc/ventium_soc.sv (see
SoC Peripherals PL/PS Split for the architecture). A device whose macro is
defined is not synthesized into the SoC: its I/O port range is added to
io_ps_sel, and every IN/OUT in that range is forwarded over the byte-wide
io_ps_* bridge seam to a C model, with io_ack waiting on
io_ps_ack and io_rdata taking the returned byte. Because the core
stalls in S_IO until ack, the C model’s latency cannot perturb the
per-record instruction stream — only the returned value matters.
The flags are generated, not hand-set: fpga/scripts/gen_periph_split.py
reads fpga/periph_split.config (rtl or ps per device) and emits
fpga/build/periph_split.vdefs (one +define+VEN_<DEV>_PS line per
ps-placed device — currently RTC, I8042, VGA, ACPIPM, UART and FDC) plus the C
dispatch table verif/tb/ps_periph_table.inc. In verification the single
setter is verif/soc/run-soc-ps-cosim-gate.sh, which passes one
+define+VEN_<DEV>_PS into the verif/tb/Makefile soc target,
building a per-device obj_dir_soc_<dev>ps testbench whose ps_devs
table (verif/tb/tb_soc.cpp) services the forwarded ports from
sw/ps_periph/<model>.c. The default build (no flags) is the all-RTL SoC
with io_ps_req tied 0 — byte-identical to the pre-split SoC.
Each enabled flag changes implementation placement only and is required to
stay per-record bit-exact: every settable flag has its own differential gate
(run-soc-ps-cosim-gate.sh <dev> → “C model == RTL == qemu” EQUIVALENT;
aggregated by run-soc-ps-cosim-all.sh inside run-all-soc-gates.sh).
Off the differential surface, each flag does delete real SoC functionality —
device IRQ lines into the PIC tie to 0, the i8042 A20 output, the VGA mode
bits feeding the chain-4 framebuffer, the UART tx console seam — which the
docs class as board-integration follow-ups needing a PS-to-PL return path.
Three of the nine flags (VEN_PIC_PS, VEN_PIT_PS, VEN_PORT92_PS)
are forward-only latent hooks: their decode term exists in the
io_ps_sel chain, but the RTL instance is not ifndef-gated (always
instantiated), they are declared bus-critical “keep in PL” in
fpga/periph_split.config, the cosim gate script rejects them, and (for
pit/port92) no C model exists — setting them today would produce a broken
dual-commit hybrid. Note also that nothing in fpga/scripts/*.tcl consumes
periph_split.vdefs yet: the current KV260 board builds compile
ventium_kv260_top/ventium_kv260_core — all peripherals on the PS via
ven_soc_axil — and not ventium_soc.sv at all, so these flags are
presently exercised only in the Verilator SoC verification builds, standing
ready for a future full-SoC PL build.
VEN_UART_PS¶
Places the COM1 16550 UART (ports 0x3F8–0x3FF) on the PS. The ifndef in
rtl/soc/ventium_soc.sv removes the ven_uart16550 instance (real area
savings in a future full-SoC PL build); the else arm ties
uart_rdata = 0, uart_irq4 = 0 (master-PIC IR4 permanently deasserted)
and kills the board-console tx seam (uart_tx_valid = 0 — the seam the RTL
streams THR bytes out on). cs_uart is OR-ed into io_ps_sel, so every
access in the range stalls the core until the C model’s byte returns.
The replacement is sw/ps_periph/ven_uart16550.c, a 1:1 behavioural port of
the RTL register surface (IER/LCR/MCR/LSR/MSR/SCR/FCR/DLL/DLM/RBR, DLAB
banking) where the RTL’s clocked read side-effects (the LSR/MSR read-clears)
become inline read side-effects, applied exactly once per access via the
bridge’s single-call guarantee. The gate
(run-soc-ps-cosim-gate.sh uart, test psocuart) is EQUIVALENT over 110
retired instructions. The tx-seam loss is moot on the board: when the UART is
PS-placed, the PS C model itself holds the THR bytes — it is the console.
IRQ4 delivery is tied off in the PL; the differential runs with interrupts
disabled so this is unobserved, but a board build needs the PS to inject IRQ4
through ven_soc_axil. One documentation skew: sw/ps_periph/README.md
references a run-soc-uart-ps-gate.sh wrapper that does not exist — the
real entry point is run-soc-ps-cosim-gate.sh uart. The port range is
adjacent to the FDC’s and to IDE alt-status 0x3F6; the decode boundaries are
exact, with no overlap.
VEN_VGA_PS¶
Places the VGA legacy register file (window 0x3B0–0x3DF) on the PS. The
ifndef removes the ven_vgaregs instance (ATTR/MISC/SEQ/DAC/GFX/CRTC/
IS1 registers with qemu-matched masks); the else arm ties
vga_rdata = 0 and — critically — zeroes the four exported mode bytes
(vga_seq_plane_mask, vga_seq_mem_mode, vga_gfx_mode,
vga_gfx_misc). Those bytes feed the still-instantiated RTL chain-4
framebuffer ven_vga_fb: chain4_en derives from them, so with the
registers on the PS it is stuck 0 and the 0xA0000 mode-13h memory intercept
never fires — the framebuffer is dormant.
The range forwards to sw/ps_periph/ven_vgaregs.c, which ports the register
surface 1:1 including the deterministic read side-effects (DAC 3-byte palette
auto-increment, ATTR index/data flip-flop reset on an IS1 read, the IS1
dumb-retrace toggle), executed inline exactly once per access. The pvga
differential (292 records, shared with the acpipm gate) proves the C model
byte-identical to qemu — pvga never touches 0xA0000, so the dormant
framebuffer is invisible to it.
This is the biggest cross-domain coupling of the family: register split and
framebuffer rendering are currently mutually exclusive — a mode-13h workload
(the F4 Quake/DOS path, the pvgafb gate) must not be run on a
+VEN_VGA_PS build until a PS→PL mode-bit return path exists (the explicit
follow-up note in both the RTL and the peripheral-split page).
VEN_RTC_PS¶
Places the MC146818 RTC/CMOS (ports 0x70 index / 0x71 data) on the PS. The
ifndef removes the ven_rtc instance; the else arm ties
rtc_rdata = 0, rtc_irq8 = 0 (slave-PIC IR8 permanently low) and
rtc_nmi_dis = 0 (the NMI-disable side-band, today a lint sink). The two
ports forward to sw/ps_periph/ven_rtc.c, a 1:1 port of ven_rtc.sv
matched to QEMU 8.2.2’s mc146818rtc.c: the 128-byte CMOS array behind the
index latch with the separate non-aliasing NMI-disable bit, index-port reads
returning 0xFF, REG_A UIP read-only, REG_C read-then-clear plus IRQ8 lower (an
inline read side-effect, applied once per access), REG_C/REG_D write-ignore,
and the REG_B IRQF recompute on write.
The psocdev differential (122 records) reads only the time-invariant
registers (REG_D, REG_B, the index-read 0xFF, a scratch-CMOS round-trip), so
the un-oracled periodic tick can never perturb the compare; with the flag set
the same test passes EQUIVALENT against the C model. This is the cleanest flag
of the family — no PL-consumed output other than the (diff-quiescent) IRQ8 and
NMI-disable lines. The RTC time-byte / UIP / periodic-tick surface is a
documented oracle boundary in both placements (host-clock-derived); the
cosim gate only proves the synchronous register surface, and board-build IRQ8
delivery needs PS injection via ven_soc_axil.
VEN_PIT_PS¶
A forward-only latent hook — do not set. The macro exists in the
gen_periph_split.py table (ports 0x40–0x43) and in the io_ps_sel chain
of ventium_soc.sv, but unlike the six slow devices the ven_pit
instantiation is not wrapped in ifndef VEN_PIT_PS and has no tie-off
arm. Defining it therefore does not remove the RTL PIT: it only reroutes
0x40–0x43 accesses to the bridge while the RTL PIT still receives the same
cs_pit pulses and still drives pit_out0 into PIC IR0. Because a
PS-bridged access holds io_req (and thus cs_pit) high for multiple
clocks until io_ps_ack — versus the one-clock chip select the RTL devices
were designed for — RTL-side write commits and read side-effects
(counter-latch and read-LSB/MSB flip-flop state) would double-step, and a PS
shadow copy would diverge from the live RTL PIT that actually generates IRQ0.
There is no sw/ps_periph/ven_pit.c (the generator would report
“MISSING — needs a C model”), run-soc-ps-cosim-gate.sh has no pit
branch, and fpga/periph_split.config pins pit rtl as bus-critical
(“8254 PIT — IRQ0 timebase”); a +VEN_PIT_PS testbench build would read
open-bus 0xFF for the range. The PIT’s RTL coverage is the all-RTL pirqsoc
gate. It is distinct from VEN_PIT_TICK_DIV, the unrelated sim-only IRQ0
cadence knob, and the PS-side PIT for the actual KV260 board path is planned
to feed sw/ps_periph/ven_pic.c via the ven_soc_axil interrupt seam — a
different architecture that bypasses ventium_soc entirely.
VEN_PIC_PS¶
The second latent hook — do not set. Its only RTL appearance is the
io_ps_sel term covering the master/slave/ELCR ports (0x20/0x21, 0xA0/0xA1,
0x4D0/0x4D1); the ven_pic instantiation is unconditional, hard-wired to
the core’s INTR/INTA pins and to all 16 device IRQ inputs. Defining it would
forward PIC register I/O to the PS while the RTL PIC keeps full
interrupt-delivery authority but stops seeing consistent register state (reads
answered by the PS, writes seen by both sides, with the multi-clock
chip-select double-commit hazard on its OCW/ICW write sequencing and poll-read
side effects) — the interrupt machine state of the RTL PIC and any PS shadow
would immediately diverge.
Uniquely among the three latent hooks, a C model does exist —
sw/ps_periph/ven_pic.c, a 1:1 port matched to QEMU 8.2.2’s i8259.c —
but its stated purpose is the F3 KV260 board firmware path, where all
peripherals including the PIC run on the A53 and the PS drives interrupts into
the core through ven_soc_axil’s interrupt-injection seam: an architecture
that does not use ventium_soc or this macro at all. The PIC is the one
device where “register surface on PS” is architecturally insufficient —
INTR/INTA timing is the reason fpga/periph_split.config pins it in the PL.
tb_soc.cpp does not even register a PIC entry in ps_devs, so a
+VEN_PIC_PS build would read open-bus 0xFF from 0x20/0x21.
VEN_I8042_PS¶
Places the 8042 PS/2 keyboard controller (ports 0x60 data / 0x64 cmd-status)
on the PS. The ifndef removes the ven_i8042 instance; the else arm
ties kbd_rdata, kbd_irq1, mouse_irq12, kbd_reset_req,
kbc_a20 and kbd_inj_ready to 0. Two real functional consequences
follow. First, A20: the effective A20 mask is
eff_a20 = kbc_a20 | p92_a20, gating address bit 20 on the memory path;
with the i8042 on the PS its contribution is gone, so A20 tracks port-92 alone
— acceptable for the psocdev differential because that test drives both
A20 sources together and port-92 stays in RTL, but a board build needs the PS
to drive the i8042 A20 state back into the fabric. Second, the testbench’s
--type-at keyboard-injection seam (used for the F3/F4 FreeDOS typing path)
becomes unavailable (inj_ready = 0), so the interactive FreeDOS flows must
not be run on a +VEN_I8042_PS build.
The range forwards to sw/ps_periph/ven_i8042.c — a port of the
queue-independent controller surface (A20 enable/disable 0xDF/0xDD, output
port, mode byte, status) with the 0x60 OBF-dequeue read side-effect inline;
the asynchronous PS/2 device queues are an oracle boundary exactly as in the
RTL. The psocdev cosim (122 records, shared with the rtc gate) passes
EQUIVALENT. One cosmetic note: tb_soc.cpp registers the dispatch range as
0x060–0x064 inclusive, but only 0x60/0x64 are ever forwarded because
cs_i8042 decodes exactly those two ports — the wider C-side range is
harmless.
VEN_FDC_PS¶
Places the 82077/8272A floppy controller (ports 0x3F1–0x3F5 plus 0x3F7 —
deliberately not 0x3F0 and not 0x3F6, which is IDE alt-status) on the PS.
The ifndef removes the ven_i8272 instance; the else arm ties
fdc_rdata = 0 and fdc_irq6 = 0 (PIC IR6 dead — quiescent on the
differential anyway, which runs with interrupts disabled). The range forwards
to sw/ps_periph/ven_i8272.c, a 1:1 port of the synchronous register
surface matched to QEMU 8.2.2’s fdc.c, with the result-FIFO advance as an
inline read side-effect. The oracle boundary matches the RTL’s: actual
READ/WRITE/FORMAT/READ-ID/RECALIBRATE/SEEK execution, DIR media-change and
motor spin-up are excluded (they need disk/DMA or asynchronous timing). The
psocfdc differential (116 records) passes EQUIVALENT.
The port-range carve-out is the subtle part: the generator encodes the same
two disjoint ranges, and tb_soc.cpp registers 0x3F1–0x3F7 inclusive with a
comment that 0x3F6 can never arrive because cs_fdc excludes it — so an IDE
alt-status access is never mis-forwarded even with both the UART (0x3F8 up)
and the FDC on the PS; the three decoders partition 0x3F1–0x3FF exactly
(0x3F0 itself is claimed by none of them and falls to the default open-bus
decode). One
stale comment: ven_i8272.c’s header says “built with +VEN_I8272_PS” —
that macro name is wrong; the actual macro is VEN_FDC_PS.
VEN_ACPIPM_PS¶
Places the ACPI PM timer (single port 0x608, PIIX4 PM base+8) on the PS. The
ifndef removes the ven_acpipm instance (a free-running 24-bit counter
advanced by a fractional accumulator at the PM-timer rate); the else arm
ties acpipm_rdata and acpipm_rdata32 to 0. The port forwards to
sw/ps_periph/ven_acpipm.c, which models the same surface: a read returns
the address-selected byte of {8'h00, count[23:0]}; a write is ignored
(qemu’s acpi_pm_tmr_write is a no-op). The cosim gate (the pvga test,
292 records, shared with the VGA flag) passes EQUIVALENT. The cheapest device
of the family; its PS placement is mostly about keeping the configuration
uniform.
A width nuance unique to this flag: in the RTL placement the io_rdata mux
returns the full 32-bit acpipm_rdata32 so a dword IN gets the whole
counter, whereas the PS bridge is byte-wide and a PS-placed read returns one
zero-extended byte. That narrowing is invisible to every existing gate because
the timer’s read value is a documented oracle boundary in both placements
(host-clock/clk-derived; the differential only exercises the write-inert OUT)
— but if a future test ever oracles a 32-bit IN from 0x608 it will differ
between placements. Never compare instantaneous counter values across
placements; the cadence is structural (clk-derived in RTL, no clock at all in
the C model).
VEN_PORT92_PS¶
The third latent hook — do not set. Its only RTL appearance is the
io_ps_sel term for port 0x92; the ven_port92 instantiation is
unconditional with no tie-off. With the flag, an IN/OUT 0x92 would be answered
by the bridge while the RTL port92 still latches the same writes (the
multi-clock chip-select double-commit hazard) and still drives p92_a20
into eff_a20 and p92_reset_req. Since there is no
sw/ps_periph/ven_port92.c (the generator names the model but the file does
not exist) and tb_soc.cpp registers no port-92 entry, reads would come
back open-bus 0xFF — while the live A20 state silently tracked only the RTL
writes.
The placement rationale is physical: A20 masking sits combinationally on the
core’s memory-address path (eff_a20 gates address bit 20), which cannot
tolerate an A53 round-trip — hence port92 rtl # fast A20 — gates the address
bus in fpga/periph_split.config and no cosim branch. The flag exists so
the io_ps_sel chain is complete and uniform for all nine non-bus-master
device rows (the generator’s ide/dma/dma2 rows define
VEN_IDE_PS/VEN_DMA_PS/VEN_DMA2_PS but have no io_ps_sel term
in the RTL at all), and as the seam a future PS→PL A20 return path would
build on. Note that
VEN_I8042_PS implicitly depends on port-92 staying in RTL for A20 coverage
on the psocdev diff; PS-placing port-92 would orphan A20 entirely. The
device’s RTL coverage is the all-RTL psocdev A20-mask differential plus
the standalone unit self-check (the port92 target of
verif/soc/Makefile — make -C verif/soc port92, driving
tb_ven_port92.cpp).
Peripheral tunables and debug instrumentation¶
This family is the SoC peripheral-tunable and debug-instrumentation knob set.
Every flag is ifndef-defaulted in the RTL itself or ifdef-guarded
around observation-only code (pure $display blocks,
VEN_IDE_DISK_HEX’s $readmemh load, VEN_DBG_WD’s two observation
registers), so a build with no flags is byte/cycle-
identical to the golden-gated default. The tunables (VEN_IDE_DISK_HEX,
VEN_IDE_DISK_SECTORS/CYLS/HEADS/SECS, VEN_RTC_EXTMEM_KB,
VEN_PIT_TICK_DIV) exist because Verilator’s -G cannot reach a
submodule parameter: ventium_soc.sv threads the ifndef-defaulted IDE
geometry and PIT macros into the ven_ide/ven_pit instance parameters,
while VEN_RTC_EXTMEM_KB is ifndef-defaulted and consumed directly
inside rtl/soc/ven_rtc.sv (module-internal localparams; the ven_rtc
instance takes only TICK_DIV). They are
sim-only by design — set only on the tb_soc Verilator build line, never in
any fpga/scripts/*.tcl define list — though that is moot for the deployed
bitstream, whose file list omits ventium_soc and these peripherals
entirely; the one flow that does synthesize ven_ide (the full-SoC probe)
hard-errors on the disk[] array regardless (fpga/TIMING_PROBLEMS.md
P1-2), and P1-3 records that the sim tick divisors must be retuned for a real
fabric clock.
Only VEN_IDE_DISK_HEX has committed setters (the verif/tb/Makefile
soc target plus five boot/IDE gate scripts); the geometry, PIT and
extended-memory overrides were passed ad hoc via the Makefile’s
VL_EXTRA_DEFINES passthrough during the F3 FreeDOS / F4 Quake boot
bring-up — no committed script pins the non-default values. Architecturally,
the disk geometry and VEN_RTC_EXTMEM_KB change CPU-observable values
(IDENTIFY words, the out-of-range-abort boundary, the CMOS memory-size bytes),
so the differential gates rely on the defaults; VEN_PIT_TICK_DIV changes
only the explicitly non-differential IRQ0 cadence; and VEN_IDE_TRACE,
VEN_DBG_WD and VEN_DBG_SSBASE are pure $display probes with zero
state-machine effect.
VEN_IDE_DISK_HEX¶
The disk-image backing-store path for ven_ide’s behavioural hard disk — a
quoted-string macro consumed by exactly one statement in
rtl/soc/ven_ide.sv: initial if (HAS_DISK) $readmemh(`VEN_IDE_DISK_HEX,
disk);, loading the raw byte array disk[0:DISK_SECTORS*512-1] at sim
start. The path is embedded at build time but the file is read at run
time, so the pide gate regenerates the hex before each run and the binary
picks up the fresh image. Undefined, the ifdef simply removes the
$readmemh and disk[] stays uninitialized (never read by non-IDE
tests); a lint sink keeps the define-less build UNUSEDPARAM-clean, and
HAS_DISK = 0 (the empty ATAPI CD secondary channel) suppresses the load
even when the define is present.
The verif/tb/Makefile soc target always passes it, with double
quoting — -DVEN_IDE_DISK_HEX='"$(VEN_IDE_DISK_HEX)"' — so the macro
expands to a SystemVerilog string literal; the make variable defaults to the
absolute path of verif/sys/tests/pide/pide.disk.hex, and each boot/IDE
gate overrides it with its own disk (run-soc-ide-gate.sh,
run-soc-boot-gate.sh, run-soc-bootrm-gate.sh,
run-soc-bootdma-gate.sh, run-soc-seabios-boot-gate.sh). The FPGA flow
must leave it undefined (fpga/TIMING_PROBLEMS.md P2-2). The disk contents
are part of the differential surface: the same single-source image is fed to
qemu -drive and to $readmemh, and READ SECTORS data is graded
byte-identical per record — a “flag change” here is really test content, and
each gate regenerates its golden against the same image. The value needs the
shell-level quote nesting ('"path"'); a bare path breaks the preprocess.
VEN_IDE_DISK_SECTORS¶
Sets the ven_ide DISK_SECTORS instance parameter (default 128 — the
64 KiB pide single-source image). It sizes the backing array, derives the
disk byte-address width used by both the PIO drain pointer and the DMA dword
offset, and pins NSECT28 = DISK_SECTORS, the LBA28 out-of-range bound used
by the upfront READ abort, the per-sector mid-transfer abort and the
block-mode first-block check. It also feeds the CPU-visible IDENTIFY words
60/61 (LBA28 total) and 100 (LBA48 total). The F3 widening (commit
564d4d9) made it actually scalable: pre-F3 the byte address was a hard
[15:0], so sectors above 64 KiB aliased; with the macro, the FreeDOS
multi-MB disk passes +define+VEN_IDE_DISK_SECTORS=20160 (with
+CYLS=20) and addressing widens automatically.
No committed script sets a non-default value; overrides are ad hoc on the
tb_soc build line. It is architecturally visible — IDENTIFY words and
the out-of-range boundary change — so any non-default value invalidates the
pide per-record golden; non-default runs are boot bring-up only. It must
agree with the image supplied via VEN_IDE_DISK_HEX and with the qemu
golden geometry, and is normally changed together with the CHS triple. And it
is deliberately sim-only: the 64 KiB default disk[] (524 288 bits) is
already a hard synthesis error on the full-SoC probe (it cannot infer BRAM
due to the multi-port writes and $readmemh init, and is too large to
dissolve into FFs — fpga/TIMING_PROBLEMS.md P1-2), so raising
DISK_SECTORS in a synthesis context makes that strictly worse; the
recorded path to a real disk on hardware is the DDR-backed ven_ide rework
in fpga/PLAN.md §5.5, which scales DISK_SECTORS/geometry/OOR as part
of moving the backing store off-chip.
VEN_IDE_CYLS¶
Threads the ven_ide CYLS parameter (logical cylinder count, default
2). It appears in the CPU-observable IDENTIFY block — word 1 (logical
cylinders), word 54 (current cylinders) and, via
OLDSIZE = CYLS*HEADS*SECS, words 57/58 (current capacity) — and bounds how
much of the disk guests compute as CHS-reachable. The default 2/16/63 triple
is exactly what qemu’s guess_chs_for_size produced for the 128-sector
pide image, so the IDENTIFY constants grade byte-identical against a
freshly generated qemu golden each run. The F3 FreeDOS disk used +CYLS=20
(20 × 16 × 63 = 20160, CHS-consistent with the sector count). No committed
setter; sim-only. Unlike HEADS/SECS it does not enter the CHS→LBA
arithmetic, so an inconsistent CYLS only lies to the guest’s capacity math
rather than corrupting translation — but any non-default value still breaks
the pide per-record golden.
VEN_IDE_HEADS¶
Threads the ven_ide HEADS parameter (default 16). Two CPU-observable
roles: IDENTIFY words 3/55 plus the OLDSIZE capacity product, and the live
CHS→LBA translation used whenever a guest issues a command with devhead
bit 6 clear — chs_lba = (cyl*HEADS + head)*SECS + sector − 1, the exact
qemu ide_get_sector formula. A wrong HEADS therefore does not just
mis-report geometry; it makes CHS reads land on the wrong sectors. The
pide gate exercises CHS addressing explicitly (the M8.4c surface), so the
default is gate-proven; the FreeDOS-era multi-MB overrides kept 16. No
committed setter; sim-only. A mismatch between the define used for the golden
qemu -drive geometry and the RTL build corrupts CHS data compares — a much
louder failure than an IDENTIFY word diff.
VEN_IDE_SECS¶
Threads the ven_ide SECS (sectors-per-track) parameter (default 63).
CPU-observable in IDENTIFY words 4 (unformatted bytes/track = 512 × SECS),
6, 56 and the OLDSIZE product, and in the CHS→LBA translation as the track
multiplier and the 1-based sector offset. The default is the qemu
guess_chs_for_size value for the 128-sector image, against which the
M8.4c CHS-addressing differential is graded; the FreeDOS bring-up kept 63. No
committed setter; sim-only. Same consistency set and same CHS-corruption
failure mode as VEN_IDE_HEADS if the RTL define and the golden geometry
diverge.
VEN_IDE_TRACE¶
Sim-only $display instrumentation inside rtl/soc/ven_ide.sv, used
during the F3 FreeDOS boot gap-walk to watch IDE traffic. Five trace points:
(1) READ SECTORS (0x20/0x21) command acceptance — opcode, LBA-vs-CHS mode bit,
resolved LBA, effective sector count and the raw task-file registers; (2)
every device-control (0x3F6) write — new/previous control byte plus live
status and transfer state, i.e. SRST and nIEN flips in context; (3) a
“ZERO-READ” diagnostic when the CPU reads the data port with DRQ clear — the
case where rdata = 0 silently; (4) the first 12 data words of each
non-IDENTIFY sector drain; (5) the multi-sector block re-arm at each DRQ
window boundary. No registers, ports or FSM arms are added; on/off builds are
bit-exact, so no gate is needed. No committed setter — ad hoc
+define+VEN_IDE_TRACE via VL_EXTRA_DEFINES only; must not be defined
for synthesis.
One structural note: trace point (3)’s ifdef wraps an entire else if
arm of the clocked access-priority chain — safe today because the arm is
display-only, so with the define off the chain simply falls through with
identical state — but any future edit adding an assignment inside it would
create a define-dependent behavioural fork. The comment marks the
read-with-DRQ-clear case as “the zero-source” precisely because that silent
path once hid a bug.
VEN_RTC_EXTMEM_KB¶
The F4 (DOS Quake prep) knob: kilobytes of RAM above 1 MiB to advertise via
CMOS (default 0 — legacy behaviour, the CMOS memory-size registers keep their
all-0x00 reset value and SeaBIOS computes RamSize = 1 MiB; every existing
gate byte-identical). Background, from the rtl/soc/ven_rtc.sv header:
this SoC has no fw_cfg (ports 0x510/0x511 read open-bus 0xFF), so SeaBIOS
sizes RAM purely from CMOS. When the macro is nonzero, ven_rtc’s reset arm
seeds the qemu pc_cmos_init register set: 0x15/0x16 = 640 (base memory),
0x17/0x18 and 0x30/0x31 = min(EXTMEM_KB, 65535) (KB above 1 MiB),
0x34/0x35 = ((1 MiB + EXTMEM_KB) − 16 MiB)/64 KiB capped at 65535 (64 KB
units above 16 MiB); the >4G registers stay 0. The seed localparams implement
the exact hw/i386/pc.c formulas, and the seeding sits inside
if (EXTMEM_KB != 0), which constant-folds away at the default, preserving
the legacy all-zero image bit-for-bit.
Proven effect (commit 4f6c68b): with 64512 the SeaBIOS e820 gains a
63 MiB RAM entry on the RTL — the prerequisite for DPMI/Quake. The PS-offload
C model sw/ps_periph/ven_rtc.c mirrors it 1:1 (same #ifndef default,
same byte math, plus a runtime ven_rtc_set_extmem_kb() override hook for
future PS firmware, with no current caller). No committed setter — ad hoc
VL_EXTRA_DEFINES only.
Gotchas: this is architectural — CMOS bytes 0x15–0x18/0x30/0x31/0x34/0x35
are CPU-readable, so any nonzero value diverges from the qemu golden unless
qemu is launched with the matching -m; the psocdev RTC differential
only stays EQUIVALENT because the default is 0. The advertised memory is a lie
unless the SoC actually backs it — the value must not exceed real RAM behind
the bus or DOS/DPMI will scribble into nothing. Values above 65535 KB saturate
the 0x30/0x31 pair by design (qemu does the same); the above-16 MiB portion is
carried only by 0x34/0x35. In a +VEN_RTC_PS cosim, the C-model copy of the
flag is the one that matters — the two must be set consistently or the RTL/PS
split changes the guest-visible memory size.
VEN_PIT_TICK_DIV¶
Scales the i8254 PIT tick prescaler for simulation (default 1024 — one PIT
tick per 1024 core clocks). ventium_soc instantiates
ven_pit #(.TICK_DIV(PIT_TICK_DIV), .TICK_INC(1)); inside ven_pit a
generate block picks the implementation: TICK_DIV <= 1 means a tick every
clock (the natural hardware wiring where clk is the 1.193182 MHz PIT
clock — what the unit testbench uses), while larger values instantiate a
24-bit fractional accumulator. The PIT’s register semantics (counts, latching,
read-back, the OUT formula) are qemu-bit-exact regardless; only the wall-clock
cadence of ticks — hence the IRQ0 edges — depends on this knob, and that
cadence is explicitly structural, not oracled.
The default 1024 was chosen so the pirqsoc differential is
boundary-independent: with roughly 8 clocks per instruction and a ~12-
instruction IRQ0 handler, count 0x40 gives about 65 k clocks per IRQ0 — wide
margin for the mainline to finish handler-plus-spin between edges, making the
post-spin counter value deterministic. The F3 override exists because
SeaBIOS’s sti/hlt disk wait spins on the timer IRQ: at divisor 1024 an
18.2 Hz-equivalent IRQ0 period is ~67 M clocks — glacial in simulation, and it
trips the testbench quiescence detector — so the FreeDOS bring-up passed a
small divisor for a fast timer. Sim-only: the FPGA flow must instead retune
TICK_DIV/TICK_INC for the real clock so IRQ0 is genuinely
1.193182 MHz-derived (fpga/TIMING_PROBLEMS.md).
Gotchas: not bit-exact across values in SoC runs even though no architectural
register changes — IRQ0 edges move, so retire traces of IRQ-taking programs
differ; this is legal only because the differential checkpoints were designed
boundary-independent. The pirqsoc gate is byte-identical at the default
and would not survive an override. Both bounds matter: too-slow IRQ0 plus
sti/hlt falsely trips quiescence; too-fast IRQ0 starves the mainline
and breaks the deterministic-count property. TICK_INC is fixed at 1 in the
SoC instance, so the macro is a pure integer divider here. The unit gate
(verif/soc/run_pit.sh) is independent — its testbench hard-pins
TICK_DIV = 1.
VEN_DBG_WD¶
A sim-only opt-in hang watchdog inside rtl/core/core.sv, built for boot
bring-up gap-walking (finding where SeaBIOS or FreeDOS wedges without a
waveform). It adds two registers: dbg_stall, a cycles-since-last-retire
counter (any retire zeroes it), and dbg_printed, a one-shot latch. When
the counter hits exactly 150 000 with the latch clear, it $displays a
single [VEN-WD] STUCK line dumping the wedged FSM state by name
(state.name()), the outstanding mem_addr, cr0,
sys-mode/v86/real-mode, eflags, the IDT/GDT bases, the in-flight
int_vec, CPL, CS base and eip — exactly the registers needed to
localize a real/protected-mode delivery wedge. Pure observation: no output
ports, no FSM influence — retire streams are bit-identical with or without it.
The threshold was deliberately raised to 150 000 in commit 564d4d9 when
interruptible HLT landed, so legitimate long yield-halts between timer IRQs do
not false-fire; it is co-tuned with VEN_PIT_TICK_DIV (at divisor 1024 an
sti/hlt wait can legally sit ~67 M clocks without retiring, which is
why the watchdog is opt-in and bring-up runs pair it with a fast PIT divisor).
No committed setter — manual +define+VEN_DBG_WD works for both the rtl
and soc targets. Gotchas: it is one-shot — only the first stall per reset
is reported. It uses state.name(), fine under Verilator but another reason
it must never reach synthesis. A multi-cycle iterative engine (an SRT divide,
~66 clocks) is far below threshold, but artificially lowering the constant can
false-fire on legitimate long microcoded sequences; 150 000 is the empirically
safe value. It is distinct from the testbench-level quiescence detector
(always on) and from the architectural cpu_hung F00F-erratum latch (always
built).
VEN_DBG_SSBASE¶
A sim-only F3 gap-walk probe pair compiled into the top of the S_DECODE
arm (rtl/core/core_fetch_decode.svh). Probe 1 (SSBASE-DESYNC): on
every instruction decode while seg_real is set (real mode or V86), it
checks that the SS, DS and ES segment bases equal their selector << 4 — the
real-mode invariant — and prints eip plus all three selector/base pairs on
any mismatch. This is the probe class that root-caused the F3 FreeDOS stall: a
stale protected-mode base surviving into real-mode addressing. Probe 2
(ESP-HIGH): while seg_real, it checks ESP[31:16] == 0 — 16-bit
code never legitimately carries high ESP bits, so a hit means a 32-bit-ESP
stack op leaked past 64 K (the missing real-mode SP wrap, explicitly noted as
the audit’s remaining finding). Both are pure $display statements
evaluated once per S_DECODE; no state, port or timing change — bit-exact
on/off. No committed setter; F3 forensics use only, paired naturally with
VEN_DBG_WD and VEN_IDE_TRACE.
Read the output with care: probe 1 false-positives by design on unreal-mode
idioms — SeaBIOS’s big-real transitions deliberately carry a protected-mode
data selector with base 0 into real mode, tripping the sel << 4 check
while being exactly what the firmware intended — so its output is a
lead-generator, not an assertion. It fires per decoded instruction while the
condition persists, so a genuine desync floods the log (useful as a timeline,
expensive in throughput). And probe 2 documents a known modelling gap rather
than a guest bug — a hit may indict the core, not the software. User-mode
protected/flat corpora never assert seg_real, so both probes are silent
there.
Deployed configurations¶
This section records the define sets the committed build entry points actually use.
The default verification battery¶
The canonical builds set no defines at all. The verif/tb/Makefile
rtl target leaves VL_EXTRA_DEFINES empty and builds into obj_dir;
it is invoked define-free by make verify (verif/verify.sh), the
milestone runners (run-m0-smoke.sh through run-m5.sh), make
verify-sys (verif/sys/run-sys-golden.sh), the errata suite (make
m6), the Quake and Win95 lockstep/cosim runs (verif/m7), and the
benchmark sweeps (verif/bench). The emu and rtl-sva targets use
the same zero-define set (rtl-sva adds only --assert and the
biu_p5_sva bind). Everything this page calls “the default build” is this
configuration; it is the one byte/cycle-identical to the golden model.
Variant verification builds layer exactly the define under test on top:
make rtl with +define+VEN_FP_OVERLAP into obj_dir_fpovl and
+define+VEN_FXCH_FREE into obj_dir_fxch (the M5 gap kernels);
+define+VEN_TRANSCENDENTAL into obj_dir_trsc (the in-core
transcendental gates); the l1axi target (+define+VEN_L1_AXI, plus the
C-side -DVEN_L1_AXI) into obj_dir_l1axi and l1axi_cdc
(+define+VEN_AXI_CDC) into obj_dir_l1axi_cdc; the proxy target
(+define+VEN_PS_PROXY) into obj_dir_proxy; and the FP_PIPE2 A/B gate,
whose base set is VEN_SRT_ITER VEN_IDIV_ITER VEN_BCD_ITER VEN_FP_PIPE
VEN_BTB_PIPE with build B adding VEN_FP_PIPE2 (obj_dir_fp1/
obj_dir_fp2). The standalone unit-engine gates pass only testbench-local
value macros (SRT_N, BCD_N, IDIV_N, …), no core feature flags.
Every variant gets its own object directory precisely so the canonical
obj_dir/tb_ventium is never clobbered.
The SoC gate battery¶
The verif/tb/Makefile soc target (top ventium_soc,
obj_dir_soc) always defines exactly one macro: VEN_IDE_DISK_HEX,
defaulting to the pide disk image path. The make verify-soc aggregate
(verif/soc/run-all-soc-gates.sh) builds it that way for all the
peripheral, firmware-boot and real-mode gates; the disk-booting gates
(run-soc-ide-gate.sh, run-soc-boot-gate.sh, run-soc-bootrm-gate.sh,
run-soc-bootdma-gate.sh, run-soc-seabios-boot-gate.sh) override only
the disk path. The PS-offload cosim gates add exactly one placement flag each:
run-soc-ps-cosim-gate.sh <dev> builds
{VEN_IDE_DISK_HEX, VEN_<DEV>_PS} into obj_dir_soc_<dev>ps for uart,
rtc, i8042, acpipm, fdc and vga; VEN_PIC_PS,
VEN_PIT_PS and VEN_PORT92_PS are not in the map — no gate sets them.
The shipped KV260 full-SoC bitstream¶
The deployable image (kv260_soc_impl_linux_40, built by
fpga/scripts/impl_kv260_soc.tcl with FE_PIPE=1 at PL0_MHZ=40) sets
the full production stack:
SYNTHESIS VTM_NO_DPI
VEN_SRT_ITER VEN_IDIV_ITER VEN_BCD_ITER
VEN_FP_PIPE VEN_FP_PIPE2 VEN_BTB_PIPE
VEN_IC_BRAM VEN_UOPCACHE VEN_CACHE_HALF
VEN_L1_AXI VEN_KV260_SOC
VEN_FE_PIPE (appended by the FE_PIPE=1 environment knob)
Reading it as layers: the two housekeeping tokens strip the DPI channel and
the sim-only assertions; the three iterative engines and the two-stage FP
commit pipeline are the bit-exact Fmax substrate; VEN_BTB_PIPE,
VEN_IC_BRAM, VEN_UOPCACHE and VEN_CACHE_HALF are the front-end
congestion stack; VEN_L1_AXI + VEN_KV260_SOC are the platform layer
(real bus, PS seams, DDR-carveout remap); and VEN_FE_PIPE is the final cut
that lets the full SoC route. Three footnotes from the setters data:
VEN_BCD_DIV100 is absent from this list (and from
apr_hc_pnr.tcl) — the ÷100 step is documented in ven_bcd.sv and
docs/fpga-synthesis.md, but no committed Tcl carries the define; the
65.3 MHz half-cache result added it ad hoc; the older
fpga/scripts/bd_kv260_soc.tcl validate-and-synth flow still carries the
pre-uopcache config (VEN_IC_NARROWB instead of
VEN_IC_BRAM/VEN_UOPCACHE/VEN_CACHE_HALF, and no
VEN_FP_PIPE2); and no committed Tcl consumes
fpga/build/periph_split.vdefs yet — the board path runs all peripherals on
the PS via ven_soc_axil, so the VEN_<DEV>_PS flags are exercised only
in the Verilator SoC builds today.
VEN_DBG_CORE — on-die debug / trace / step / breakpoint¶
A single opt-in flag arms a comprehensive forensic unit. It is OFF by
default (make verify stays 77/77 byte/cycle-identical; the FPGA close is
unchanged). Turn it on for a debug bitstream (DBG_CORE=1 in
fpga/scripts/impl_kv260_soc.tcl) or a debug sim (+define+VEN_DBG_CORE;
the l1axi_kv_dos tb target carries it).
What it adds (all observers except step/breakpoint, which only act once the PS
arms them — so even a VEN_DBG_CORE build is cycle-identical until then):
Committed-state taps — the EIP / CS / ESP / EFLAGS of the last retired instruction, the FSM micro-state (
state_e), the last exception/IRQ vector and its source EIP, and live CR0 (PE/PG = real/protected/paging). Surfaced asventium_topdebug-bundle ports (read directly bytb_main; on a stop it prints a one-line[VEN_DBG]dump naming where and why the core stopped).PC ring buffer (
ven_soc_dbg, 32 × {state,CS,EIP}, BRAM) — the last N retired PCs, read back N-back viaR_DBG_TRACE_IDX/_PC/_AUX. On a board freeze this is the instruction trail into a derail.Freeze detector — a stall counter (reset on each retire); crossing the PS-set
R_DBG_FREEZE_THlatches a snapshot (EIP/FSM/vector) + a stickyfrozenbit, distinguishing a silent stall from a clean halt.Performance counters — cycles, retired, no-retire (stall) cycles, S_IO cycles, external IRQs (
R_DBG_PERF_*). CPI = cyc/retired.Single-step / breakpoint —
R_DBG_RUNCTLparks the core at the S_DECODE instruction boundary (halt_req), grants one instruction (step), or a PC breakpoint (bp_en+R_DBG_BP_ADDR) halts at the target. The park sits below SMI/NMI/INTR (no interrupt is lost) and is modeled on the resumable S_HLTWAIT halt; in-flight loads/fills advance untouched.
The PS reads it all through the ven_soc_axil 0x80+ window (mirrored in
sw/ps/ven_soc_app/ven_soc_regs.h); ven_soc_app probes R_DBG_CAP ==
0xDB01_0020 and dbg_dump_core() prints the full state + ring on a
CPU_HUNG or a silent-stall auto-trigger. Verified: the deployed top lints +
make verify 77/77 (off); in sim --dbg-step N walks the EIP stream one
instruction at a time and --dbg-bp <eip> parks at the target.