The FP execute/commit pipeline¶
The Pentium retires a pipelined FADD/FSUB/FMUL at throughput 1,
latency 3: a new arithmetic op can issue every clock, and its result becomes
available to a dependent op three clocks later. Ventium reproduces that timing
with an emergent FP scoreboard (core.sv) — the result of an arithmetic op
is published at issue + latency (fadd = 3), and any dependent (role ≥ 2)
consumer is stalled until that cycle.
That scoreboard is also the seam that lets the FPGA build run fast. This page
documents how the FP execute/commit path is pipelined behind compile-time flags
without disturbing a single observable cycle — and how the second stage
(+VEN_FP_PIPE2) takes the half-cache KV260 (XCK26 / K26) build across the
60 MHz line.
The deferred fast-arm commit (+VEN_FP_PIPE)¶
In cycle mode the fast arm issues an arithmetic op and would, naively, compute its
whole value combinationally in the issue clock — f_eval(op, st0, sti) — and
write the x87 register file the same edge. On the FPGA that ties the entire FADD
datapath (operand select → align → add → normalize → round → pack) onto the
eip → icache → decode → issue front-end path: one enormous cone in one clock.
+VEN_FP_PIPE breaks it with a 1-deep defer. The op’s operands are captured
at issue (cycle N) into pipeline registers fpp_a/fpp_b/fpp_aluop/… and the
instruction retires normally; one clock later (N+1) the value is evaluated from the
registered operands and written to the absolute target via a dedicated
we_wabs port on fpu_top. The trick that makes this free:
The scoreboard already publishes the result at
issue + 3. The deferred commit lands atN+1— well before any dependent can read it — so the architectural state at every retire is unchanged. The defer is invisible to correctness; it only moves where the combinational cone sits in time.
A 1-clock read-after-write hazard guards the one window the defer exposes: an FP
op that reads the in-flight target the same clock the deferred result is being
written would see the pre-edge stale value, so it is stalled one clock. Role ≥ 2
consumers are already scoreboard-stalled past that window, so the hazard only bites
FXCH/FLDSTI/slow FST/FCOM — none of which appear in the throughput
microbenchmarks, so the FP cycle bands (mb_faddchain CPI ≈ 3, mb_fpindep at
throughput 1) are preserved.
The wall: a latency-1 cone P&R cannot shorten¶
On the half-cache K26 build the deferred commit itself becomes the critical path:
fpp → f_eval_s1 → f_eval_s2 (fx_round_pack) → fpr — about 80 logic levels and
43 CARRY8 in series, ~58 % of it routing. A full timing-driven place-and-route
sweep (one synthesis, many directives, all over-constrained) lifts the routed Fmax
from the production congestion directives’ 50.1 MHz to 52.6 MHz
(ExtraNetDelay_high placement) — and then stops:
Placement can only redistribute the route delay on the cone; a Pblock that clusters the FP datapath pushes the FADD cone off the critical path but merely exposes the next one (the µop-cache
store_bmap → eipfill path at ~51 MHz).Retiming cannot help: the deferred commit is a latency-1 path — one register layer between
fppandfpr— so there is nothing to balance. Retiming slides a register along a cone; it does not add pipeline depth.Synthesis itself caps at 59.4 MHz on this cone.
The binder is logic depth, not routing — so the fix has to shorten the cone.
The 2-stage split (+VEN_FP_PIPE2)¶
f_eval is already factored into two functions: f_eval_s1 (special-operand
detection + the add/sub/mul front, fx_add_s1/fx_mul_s1) returning an
fx_pipe_t carrier, and f_eval_s2 (fx_round_pack + flag assembly). The
1-stage defer composes them in one clock. +VEN_FP_PIPE2 inserts one
register at that boundary:
The commit now lands at N+2 instead of N+1. Because the scoreboard still
publishes at issue + 3, N+2 ≤ N+3 — the value is in fpr before any
dependent reads it. The cone halves: the add/mul front in one clock, the
round-pack + write in the next. The read-hazard simply widens to the now-2-deep
in-flight window (stall a reader of either the stage-0 or stage-1 target).
Cycle-safety: verified, not asserted¶
The change is gated entirely under ifdef VEN_FP_PIPE2 and the default build
is byte/cycle-identical. The split itself is proven equivalent to the validated
1-stage by make verify-fppipe2 (verif/fppipe/run-fp-pipe2-ab.sh), which
builds two cycle testbenches differing only by the flag and runs both on the FP
cycle kernels:
mb_faddchain(dependent FADD chain): the--cycletraces are byte-identical — the 2-stage is cycle-for-cycle the 1-stage.mb_fpindep(independent FADDs at throughput 1): identical final architectural state (no stale read slipped through) and +0.09 % total cycles from the one-clock-wider hazard — comfortably inside the M5 FP band.Both meet the M5 FP CPI bands (
mb_faddchain≈ 3.0,mb_fpindepbelow it);make verifystays GREEN on the default build; the 1M-vectorf_eval_s2 ∘ f_eval_s1 == f_evalcomposition gate is bit-exact.
The Fmax result (K26 half-cache)¶
Metric (routed OOC, XCK26, 15 ns) |
1-stage |
|
Δ |
|---|---|---|---|
Fmax — synthesis (WNS @ 15 ns) |
59.4 MHz |
78.4 MHz |
+32 % |
Fmax — routed (best directive) |
52.6 MHz |
63.0 MHz |
+20 % |
CLB LUTs (synth) |
77,975 |
76,700 |
−1.6 % |
CLB Registers |
25,565 |
25,838 |
+273 FF |
Worst path |
|
|
cone gone |
Splitting the cone moves the FADD datapath off the critical path entirely
(synthesis 59 → 78 MHz) and routes the half-cache build to 63.0 MHz — clearing
the 60 MHz target on the K26 — for the price of one ``fx_pipe_t`` stage register
(~273 flip-flops), LUT-neutral. The worst path is now the rare iterative BCD
engine (ven_bcd, the FBLD/FBSTP packed-decimal conversion), with the
µop-cache fill-to-PC path just below it — both well above 60 MHz.
A follow-up cycle-neutral fix takes the next worst path too: ven_bcd’s per-clock
step ran two chained divide-by-10; computing one divide-by-100 and extracting
the two low digits from q % 100 (a 0..99 value) halves that cone, bit-exact
(make verify-bcd). Synthesis rises to 80.6 MHz and the routed half-cache build
to 65.3 MHz (ExtraNetDelay_high placement). The remaining wall is the
route-bound µop-cache fill → front-end cluster (store_bmap → eip/ic_age),
which is placement-variable — the diffuse front-end congestion that in-context
floorplanning, not a single-cone fix, addresses.
See Parametric L1 Caches for the +VEN_CACHE_HALF geometry knob this
build sits on, and docs/fpga-synthesis.md for the full place-and-route sweep,
device views, and reproduction commands.