The FP execute/commit pipeline¶

The Pentium retires a pipelined FADD/FSUB/FMUL at throughput 1, latency 3: a new arithmetic op can issue every clock, and its result becomes available to a dependent op three clocks later. Ventium reproduces that timing with an emergent FP scoreboard (core.sv) — the result of an arithmetic op is published at issue + latency (fadd = 3), and any dependent (role ≥ 2) consumer is stalled until that cycle.

That scoreboard is also the seam that lets the FPGA build run fast. This page documents how the FP execute/commit path is pipelined behind compile-time flags without disturbing a single observable cycle — and how the second stage (+VEN_FP_PIPE2) takes the half-cache KV260 (XCK26 / K26) build across the 60 MHz line.

The deferred fast-arm commit (`+VEN_FP_PIPE`)¶

In cycle mode the fast arm issues an arithmetic op and would, naively, compute its whole value combinationally in the issue clock — f_eval(op, st0, sti) — and write the x87 register file the same edge. On the FPGA that ties the entire FADD datapath (operand select → align → add → normalize → round → pack) onto the eip → icache → decode → issue front-end path: one enormous cone in one clock.

+VEN_FP_PIPE breaks it with a 1-deep defer. The op’s operands are captured at issue (cycle N) into pipeline registers fpp_a/fpp_b/fpp_aluop/… and the instruction retires normally; one clock later (N+1) the value is evaluated from the registered operands and written to the absolute target via a dedicated we_wabs port on fpu_top. The trick that makes this free:

The scoreboard already publishes the result at issue + 3. The deferred commit lands at N+1 — well before any dependent can read it — so the architectural state at every retire is unchanged. The defer is invisible to correctness; it only moves where the combinational cone sits in time.

A 1-clock read-after-write hazard guards the one window the defer exposes: an FP op that reads the in-flight target the same clock the deferred result is being written would see the pre-edge stale value, so it is stalled one clock. Role ≥ 2 consumers are already scoreboard-stalled past that window, so the hazard only bites FXCH/FLDSTI/slow FST/FCOM — none of which appear in the throughput microbenchmarks, so the FP cycle bands (mb_faddchain CPI ≈ 3, mb_fpindep at throughput 1) are preserved.

The wall: a latency-1 cone P&R cannot shorten ¶

On the half-cache K26 build the deferred commit itself becomes the critical path: fpp → f_eval_s1 → f_eval_s2 (fx_round_pack) → fpr — about 80 logic levels and 43 CARRY8 in series, ~58 % of it routing. A full timing-driven place-and-route sweep (one synthesis, many directives, all over-constrained) lifts the routed Fmax from the production congestion directives’ 50.1 MHz to 52.6 MHz (ExtraNetDelay_high placement) — and then stops:

Placement can only redistribute the route delay on the cone; a Pblock that clusters the FP datapath pushes the FADD cone off the critical path but merely exposes the next one (the µop-cache store_bmap → eip fill path at ~51 MHz).
Retiming cannot help: the deferred commit is a latency-1 path — one register layer between fpp and fpr — so there is nothing to balance. Retiming slides a register along a cone; it does not add pipeline depth.
Synthesis itself caps at 59.4 MHz on this cone.

The binder is logic depth, not routing — so the fix has to shorten the cone.

The 2-stage split (`+VEN_FP_PIPE2`)¶

f_eval is already factored into two functions: f_eval_s1 (special-operand detection + the add/sub/mul front, fx_add_s1/fx_mul_s1) returning an fx_pipe_t carrier, and f_eval_s2 (fx_round_pack + flag assembly). The 1-stage defer composes them in one clock. +VEN_FP_PIPE2 inserts one register at that boundary:

The commit now lands at N+2 instead of N+1. Because the scoreboard still publishes at issue + 3, N+2 ≤ N+3 — the value is in fpr before any dependent reads it. The cone halves: the add/mul front in one clock, the round-pack + write in the next. The read-hazard simply widens to the now-2-deep in-flight window (stall a reader of either the stage-0 or stage-1 target).

Cycle-safety: verified, not asserted ¶

The change is gated entirely under ifdef VEN_FP_PIPE2 and the default build is byte/cycle-identical. The split itself is proven equivalent to the validated 1-stage by make verify-fppipe2 (verif/fppipe/run-fp-pipe2-ab.sh), which builds two cycle testbenches differing only by the flag and runs both on the FP cycle kernels:

mb_faddchain (dependent FADD chain): the --cycle traces are byte-identical — the 2-stage is cycle-for-cycle the 1-stage.
mb_fpindep (independent FADDs at throughput 1): identical final architectural state (no stale read slipped through) and +0.09 % total cycles from the one-clock-wider hazard — comfortably inside the M5 FP band.
Both meet the M5 FP CPI bands (mb_faddchain ≈ 3.0, mb_fpindep below it); make verify stays GREEN on the default build; the 1M-vector f_eval_s2 ∘ f_eval_s1 == f_eval composition gate is bit-exact.

The Fmax result (K26 half-cache)¶

Metric (routed OOC, XCK26, 15 ns)	1-stage `VEN_FP_PIPE`	`+VEN_FP_PIPE2`	Δ
Fmax — synthesis (WNS @ 15 ns)	59.4 MHz	78.4 MHz	+32 %
Fmax — routed (best directive)	52.6 MHz	63.0 MHz	+20 %
CLB LUTs (synth)	77,975	76,700	−1.6 %
CLB Registers	25,565	25,838	+273 FF
Worst path	`fpp → fx_round_pack → fpr`	`u_bcd` (FBLD/FBSTP)	cone gone

Splitting the cone moves the FADD datapath off the critical path entirely (synthesis 59 → 78 MHz) and routes the half-cache build to 63.0 MHz — clearing the 60 MHz target on the K26 — for the price of one ``fx_pipe_t`` stage register (~273 flip-flops), LUT-neutral. The worst path is now the rare iterative BCD engine (ven_bcd, the FBLD/FBSTP packed-decimal conversion), with the µop-cache fill-to-PC path just below it — both well above 60 MHz.

A follow-up cycle-neutral fix takes the next worst path too: ven_bcd’s per-clock step ran two chained divide-by-10; computing one divide-by-100 and extracting the two low digits from q % 100 (a 0..99 value) halves that cone, bit-exact (make verify-bcd). Synthesis rises to 80.6 MHz and the routed half-cache build to 65.3 MHz (ExtraNetDelay_high placement). The remaining wall is the route-bound µop-cache fill → front-end cluster (store_bmap → eip/ic_age), which is placement-variable — the diffuse front-end congestion that in-context floorplanning, not a single-cone fix, addresses.

See Parametric L1 Caches for the +VEN_CACHE_HALF geometry knob this build sits on, and docs/fpga-synthesis.md for the full place-and-route sweep, device views, and reproduction commands.

The FP execute/commit pipeline¶

The deferred fast-arm commit (`+VEN_FP_PIPE`)¶

The wall: a latency-1 cone P&R cannot shorten ¶

The 2-stage split (`+VEN_FP_PIPE2`)¶

Cycle-safety: verified, not asserted ¶

The Fmax result (K26 half-cache)¶

Ventium

Navigation

Related Topics

The FP execute/commit pipeline¶

The deferred fast-arm commit (+VEN_FP_PIPE)¶

The wall: a latency-1 cone P&R cannot shorten¶

The 2-stage split (+VEN_FP_PIPE2)¶

Cycle-safety: verified, not asserted¶

The Fmax result (K26 half-cache)¶

The deferred fast-arm commit (`+VEN_FP_PIPE`)¶

The wall: a latency-1 cone P&R cannot shorten ¶

The 2-stage split (`+VEN_FP_PIPE2`)¶

Cycle-safety: verified, not asserted ¶