Debug Instrumentation¶
This page catalogs every debugging and bring-up feature built into Ventium: the on-die debug/trace unit, the diagnostic operating modes, the PS-side daemon controls, the host tools, and the testbench probes. These are the instruments used to localize silicon-only bugs — corruption that the idealized simulator never reproduces — down to a single instruction on a running board, and to tell timing faults from functional/synthesis faults from software faults without guesswork.
The compile-time defines that gate these features (VEN_DBG_CORE,
DBG_CORE=1, …) are cataloged in Verilog build flags; the bus/register
plumbing they ride on is in Bus & Peripheral Theory of Operation. This page documents
what each instrument observes and how to drive it.
Overview: three layers of observability¶
Ventium is debuggable at three layers, from closest-to-silicon to host-side:
On-die (PL fabric) — the
VEN_DBG_COREdebug/trace unit (ven_soc_dbg) taps committed architectural state, a program-counter ring, a freeze detector, performance counters, and an AXI read-beat trace, and exposes single-step and hardware-breakpoint control. The PS reads/writes it over the AXI-Lite control window. Present only in aDBG_CORE=1bitstream.PS-side software (A53) —
ven_soc_app(the SoC bring-up daemon) drives the on-die unit through command-line flags (breakpoint, single-step, peek, poke, flush, mode toggles) and addsven_systrace, a non-destructive observability layer (rings, histograms, a watchdog, aSIGUSR1snapshot).Host tools & testbench —
vpeekreads the DDR carveout,venclk.shsets the PL clock, and the Verilator testbench (tb_main.cpp) provides single-step, framebuffer dump, a faithful AXI read slave for beat-timing fuzzing, randomized X-initialization, and $display forensic probes.
A typical silicon-bug hunt walks down these layers: reproduce on the board, freeze/breakpoint at the fault, read the committed state and PC ring, then use the diagnostic modes (clock-lower, golden-L1, placement-swap) to classify the fault and the testbench probes to reproduce and fix it.
On-die debug/trace unit (VEN_DBG_CORE)¶
rtl/soc/ven_soc_dbg.sv is instantiated by ventium_kv260_core only
under \`ifdef VEN_DBG_CORE (env DBG_CORE=1). It costs a little BRAM and
routing, so it is off for the timing-critical production close and on for
a forensic / bring-up bitstream. Probe R_DBG_CAP first: it reads
0xDB01_0020 (magic 0xDB, version 1, ring depth 0x20 = 32) when
the unit is present, and 0 otherwise.
The unit is mapped in the AXI-Lite control slave (ven_soc_axil) at the HPM0
aperture base 0xA000_0000, register block 0x80``+. All offsets below are
from ``sw/ps/ven_soc_app/ven_soc_regs.h (which must stay in lock-step with
ven_soc_axil.sv).
Offset |
Name |
Access |
Meaning |
|---|---|---|---|
|
|
RO |
Capability magic |
|
|
RO |
Last-retired EIP (the committed program counter). |
|
|
RO |
Last-retired CS selector. |
|
|
RO |
Last-retired ESP. |
|
|
RO |
Last-retired EFLAGS. |
|
|
RO (live) |
|
|
|
RO (live) |
EIP that sourced the current exception/IRQ. |
|
|
RO (live) |
Live CR0. |
|
|
RO |
EIP and |
|
|
RW |
Write |
|
|
RO |
|
|
|
W1P |
|
|
|
RW |
Stall-cycle threshold → take a freeze snapshot ( |
|
|
RO |
|
|
|
RW |
Hardware breakpoint EIP. |
|
|
RW |
|
|
|
RO |
AXI read-beat trace ring — the DDR address and raw RDATA of each
accepted |
Committed-state taps¶
EIP / CS / ESP / EFLAGS (0x84–0x90) are the retired
architectural state — what previously required a faithful-sim rebuild to
recover. On a halt or freeze they name exactly where the core stopped.
R_DBG_STATE (0x94) adds the live FSM micro-state, the mode bits (real /
protected / paged), the last vector, and the io_pending / cpu_hung /
frozen status in one read.
PC ring¶
A 32-deep ring of the most-recently-retired program counters (plus
{fsm,cs} aux). Walk it newest-to-oldest by writing R_DBG_TRACE_IDX
(0=newest) and reading R_DBG_TRACE_PC / _AUX (remember the 1-clock
BRAM read latency). The ring reconstructs the control-flow trail into a stall or
derail. R_DBG_TRACE_IDX[13:8] reports how many entries are valid.
Freeze detector¶
Set R_DBG_FREEZE_TH (0xB8) to a stall-cycle count: if the core goes that
many cycles with no retire, the unit latches a one-shot snapshot of the EIP
and {fsm,vec} into R_DBG_FROZEN_EIP / _ST (0xA0 / 0xA4) and
sets STATE[18]. This catches a transient wedge whose state would otherwise
be overwritten before software can poll it. Clear with R_DBG_CTRL[0].
Performance counters¶
Free-running 64-bit cycle and retired-instruction counters, plus stall / S_IO
/ IRQ counters (0xBC–0xD4). Used both for CPI measurement and as a
no-progress watchdog input. Cleared by R_DBG_CTRL[0].
Single-step and hardware breakpoint¶
R_DBG_RUNCTL (0xDC) is the execution-control register:
Single-step — pulse
halt_req | step(0x3): the core executes exactly one instruction and re-parks (halted=1). The printed committed EIP sequence is the executed instruction stream — an on-silicon instruction trace for a bug the simulator cannot reproduce.Hardware breakpoint — write
R_DBG_BP_ADDRand setbp_en(0x4) before releasing the core’s reset, so the core halts the first time the PC reaches the address. Combine with single-step to walk the trajectory from the breakpoint forward.
Note
The single-step loop does not service S_SYSCALL_WAIT: if a step lands on
an int 0x80 the core parks waiting for the proxy and will not re-halt.
To step past a syscall, place the breakpoint after the syscall (the
run-up to the breakpoint is continuous and is serviced by the main loop) and
step from there.
AXI read-beat trace¶
RD_ADDR / RD_DATA / RD_COUNT (0xE0–0xE8) capture the DDR
address and raw RDATA of every accepted ven_axi_master read beat into a ring
indexed by R_DBG_TRACE_IDX. It is the instrument for post-morteming a
corrupted cache line fill: read back the exact {addr, data} the fabric
delivered for each beat and compare against the staged DDR image.
Diagnostic operating modes¶
Three runtime modes (and one host knob) deliberately degrade or alter the machine to classify a fault. Each is a controlled experiment.
Golden / “L1-useless” mode (MODE.GOLDEN_L1)¶
Set bit 7 of the R_MODE register (0x10), VEN_MODE_GOLDEN_L1. When
set, ven_l1d drops all cached state — every line’s valid bit and the
registered read window (rd_window_q / rd_armed / rd_addr_q) — after
every completed core transaction, so the next access fully re-fills from the
backing and re-captures fresh. No L1 state survives a transaction.
This tanks performance (every access misses), but produces a result that is uncontaminated by L1 line-staleness or read-window bugs:
if a fetch/data corruption persists with
golden=1, the bug is not in L1’s carried state — it is downstream (the core fetch buffer / decode) or in a path shared by every access;if it vanishes, L1’s carried state was the cause.
Driven from the daemon with --golden (see below) or by writing R_MODE
directly; tied 0 = normal cached operation.
Cycle (dual-issue / fast-fetch) mode (MODE.CYCLE)¶
Bit 1 of R_MODE, VEN_MODE_CYCLE. In functional mode every fetch is
routed to the slow S_FETCH path; in cycle mode the dual-issue U/V pipeline
and the fast uop-cache fetch path are used. Toggling it tells whether a
fetch/decode fault lives in the slow path, the fast path, or the shared L1 read
they both feed (a fault present in both modes is in the shared path). Driven
with --cycle.
L1 invalidate (CTRL.FLUSH_REQ)¶
A W1P pulse on the control register clears every L1 valid bit. Used after a PS
write that bypassed the L1 (the syscall proxy writes the carveout directly), and
as a debug knob (--flush-each-step) to force every fetch to miss and refill —
isolating a stale-line bug from a fill-corruption bug.
Clock control (venclk.sh)¶
Not a core feature but the key timing discriminator. sw/ps/ven_soc_app/venclk.sh
sets the PL0 clock divider via the PS CRL register 0xFF5E_00C0:
Command |
Result |
Use |
|---|---|---|
|
40 MHz |
The timing-closure / sign-off operating point. |
|
30 MHz |
Clock-lower test. |
|
20 MHz |
Aggressive clock-lower test (period doubled). |
|
40→50 MHz |
Overclock smoke-test (above 40 MHz is silicon-lottery). |
Clock-lower test: if a fault is byte-identical at 20/30 MHz and 40 MHz, it is not a setup-timing violation (more period would relieve setup). A fault that only appears at higher clocks is a setup-margin/overclock effect.
PS-side daemon controls (ven_soc_app)¶
The bring-up daemon (sw/ps/ven_soc_app/ven_soc_app.c) drives the on-die unit
and the carveout. The debug-relevant flags:
Flag |
Effect |
|---|---|
|
Arm the on-die hardware breakpoint before releasing the core; on the
hit, single-step |
|
Number of instructions to single-step after the breakpoint hit (default 80). |
|
Pulse |
|
After staging, hexdump the carveout at the folded guest address (and exit). Tells a staging error (wrong bytes written to DDR) from a fetch-read bug (DDR correct, core reads wrong). |
|
Overwrite the staged carveout with a controlled instruction stream, then run / single-step. Probes the fetch/decode datapath with a known byte sequence on real silicon (e.g. a marching pattern, or a minimal repro). |
|
Set |
|
Set |
dbg_dump_core() prints the full debug bundle — committed EIP/CS/ESP/EFLAGS,
FSM micro-state and mode, fault vector, CR0, the freeze snapshot, the perf
counters (with CPI), and the PC ring — on a watchdog fire, a cpu_hung, a
SIGTERM, or a SIGUSR1. It is a no-op on a non-DBG_CORE bitstream
(R_DBG_CAP ≠ magic).
ven_systrace — non-destructive PS observability¶
sw/ps/ven_soc_app/ven_systrace.{c,h} is an env-gated observability layer that
does not duplicate dbg_dump_core(): a syscall/video event ring, a
heartbeat, a no-progress watchdog (fires if neither syscalls nor retire
advance for a timeout — a wedged proxy handshake or stuck core), a commit-verify
(SYS_PEND must clear after RESP_VALID), and a non-destructive
SIGUSR1 snapshot (prints syscalls / video bytes / first-frame status
without stopping the core). All recording is gated off by default so the
production hot path is unaffected.
Host tools¶
vpeek — carveout reader¶
sw/ps/ven_soc_app/vpeek.c is a non-destructive reader of the 256 MB DDR
carveout, mapped exactly like ven_soc_app (open /dev/mem with
O_RDWR | O_SYNC, mmap the whole carveout at VEN_CARVEOUT_BASE
0x4000_0000, MAP_SHARED). This matters: busybox devmem maps a single
page and returns byte-lane-shifted/duplicated garbage for this carveout, so it is
useless for reading guest memory. vpeek folds the guest address into the
carveout (addr & 0x0FFF_FFFF) and prints each word plus its four little-endian
bytes (so a byte-lane bug is unmistakable). Read-only + MAP_SHARED, so it is
safe to run alongside a halted ven_soc_app. Requires sudo:
vpeek 0x08048000 8 # expect 464C457F 00010101 ... ("\x7fELF")
vpeek 0x40c34674 24 # the guest stack at a halt
Testbench debug features (tb_main.cpp)¶
The Verilator testbench mirrors the on-die controls (so a sim repro and a board repro use the same vocabulary) and adds sim-only forensics.
Flag / knob |
Effect |
|---|---|
|
Park the core from reset and single-step |
|
Arm a PC breakpoint and run free until the core parks at it. |
|
Drive |
|
Dump the Quake framebuffer ( |
|
Make the testbench AXI read slave faithful: mid-burst |
|
Randomized X-initialization (the testbench passes |
Sim-only \`ifdef-gated $display probes (inert by default, ignored by
synthesis):
VEN_DERAIL_TRIP/VEN_TRIP_N(dpi_retire.cpp) — a tripwire that rings the last 64 committed{n, pc, esp}and trips when the committed EIP leaves the valid code range, dumping the window around the divergence.VEN_DBG_L1PROBE(ven_l1d.sv) — dumps{c_addr, boff, xline, rd_hit, st, rd_armed, rd_addr_q, rd_match, c_ack, c_rdata}per L1 access in a target address window: the L1 read-path forensic probe.VEN_DBG_FETCHPROBE(core_fetch_decode.svh) — dumps{eip, ibuf, op0, d_len, eff_opsize}inS_DECODE: the instruction-length decode forensic probe.
Fault-classification playbook¶
The instruments above compose into a decision procedure for an observed corruption. In order of cheapest-first:
Staging vs. read —
vpeekthe staged region. If DDR is byte-correct, the write/staging path is clean and the fault is in the read/fetch path.Reproduce minimally —
--pokea small controlled instruction stream and--bp+--stepsto get a deterministic, few-instruction repro.Timing vs. functional —
venclk.sh set 50(20 MHz). Byte-identical at 20 MHz ⇒ not setup timing (functional/synthesis).Staleness vs. fill —
--flush-each-steporMODE.GOLDEN_L1. Fault gone ⇒ L1 carried state; fault persists ⇒ downstream of L1.Physical vs. structural — re-flash a different-placement bitstream of the same RTL and re-run the repro. Byte-identical across placements ⇒ structural synth-sim mismatch (not placement/hold timing).
Reproduce in sim — once classified, reproduce in the testbench:
AXI_RGAPfor beat-timing,+verilator+rand+resetfor uninitialized state, or a gate-level sim of the synthesized netlist for a structural divergence — then fix and re-verify against the differential corpus.
This procedure was used to localize the F4 board-only instruction-fetch
corruption to a placement-independent structural synth-sim mismatch in the
ven_l1d read mux (a live-vs-registered byte-select split), reproduced as a
12-byte --poke program and fixed by selecting the read window with the
registered capture address.