==================================================================
The FP execute/commit pipeline
==================================================================

The Pentium retires a pipelined ``FADD``/``FSUB``/``FMUL`` at **throughput 1,
latency 3**: a new arithmetic op can issue every clock, and its result becomes
available to a dependent op three clocks later. Ventium reproduces that timing
with an **emergent FP scoreboard** (``core.sv``) — the result of an arithmetic op
is *published* at ``issue + latency`` (``fadd`` = 3), and any dependent (role ≥ 2)
consumer is stalled until that cycle.

That scoreboard is also the seam that lets the FPGA build run fast. This page
documents how the FP *execute/commit* path is pipelined behind compile-time flags
without disturbing a single observable cycle — and how the second stage
(``+VEN_FP_PIPE2``) takes the half-cache KV260 (XCK26 / **K26**) build across the
**60 MHz** line.

.. contents::
   :local:
   :depth: 1

The deferred fast-arm commit (``+VEN_FP_PIPE``)
===============================================

In cycle mode the fast arm issues an arithmetic op and would, naively, compute its
whole value combinationally in the issue clock — ``f_eval(op, st0, sti)`` — and
write the x87 register file the same edge. On the FPGA that ties the **entire FADD
datapath** (operand select → align → add → normalize → round → pack) onto the
``eip → icache → decode → issue`` front-end path: one enormous cone in one clock.

``+VEN_FP_PIPE`` breaks it with a **1-deep defer**. The op's operands are *captured*
at issue (cycle N) into pipeline registers ``fpp_a/fpp_b/fpp_aluop/…`` and the
instruction retires normally; one clock later (N+1) the value is evaluated from the
**registered** operands and written to the absolute target via a dedicated
``we_wabs`` port on ``fpu_top``. The trick that makes this free:

   The scoreboard already publishes the result at ``issue + 3``. The deferred
   commit lands at ``N+1`` — well before any dependent can read it — so the
   architectural state at every retire is unchanged. The defer is **invisible to
   correctness**; it only moves where the combinational cone sits in time.

A 1-clock read-after-write *hazard* guards the one window the defer exposes: an FP
op that reads the in-flight target the same clock the deferred result is being
written would see the pre-edge stale value, so it is stalled one clock. Role ≥ 2
consumers are already scoreboard-stalled past that window, so the hazard only bites
``FXCH``/``FLDSTI``/slow ``FST``/``FCOM`` — none of which appear in the throughput
microbenchmarks, so the FP cycle bands (``mb_faddchain`` CPI ≈ 3, ``mb_fpindep`` at
throughput 1) are preserved.

The wall: a latency-1 cone P&R cannot shorten
=============================================

On the half-cache K26 build the deferred commit *itself* becomes the critical path:
``fpp → f_eval_s1 → f_eval_s2 (fx_round_pack) → fpr`` — about **80 logic levels and
43 CARRY8 in series**, ~58 % of it routing. A full timing-driven place-and-route
sweep (one synthesis, many directives, all over-constrained) lifts the routed Fmax
from the production congestion directives' 50.1 MHz to **52.6 MHz**
(``ExtraNetDelay_high`` placement) — and then stops:

* **Placement** can only redistribute the route delay on the cone; a Pblock that
  clusters the FP datapath pushes the FADD cone off the critical path but merely
  exposes the next one (the µop-cache ``store_bmap → eip`` fill path at ~51 MHz).
* **Retiming** cannot help: the deferred commit is a **latency-1** path — one
  register layer between ``fpp`` and ``fpr`` — so there is nothing to balance.
  Retiming *slides* a register along a cone; it does not *add* pipeline depth.
* **Synthesis** itself caps at 59.4 MHz on this cone.

The binder is **logic depth**, not routing — so the fix has to shorten the cone.

The 2-stage split (``+VEN_FP_PIPE2``)
=====================================

``f_eval`` is already factored into two functions: ``f_eval_s1`` (special-operand
detection + the add/sub/mul *front*, ``fx_add_s1``/``fx_mul_s1``) returning an
``fx_pipe_t`` carrier, and ``f_eval_s2`` (``fx_round_pack`` + flag assembly). The
1-stage defer composes them in **one** clock. ``+VEN_FP_PIPE2`` inserts **one
register** at that boundary:

.. graphviz::

   digraph fp2 {
     rankdir=LR; node [shape=box, fontsize=10, fontname="monospace"];
     issue   [label="issue N\ncapture fpp_*"];
     s1      [label="N+1\nf_eval_s1\n(front)"];
     reg     [label="fpp2_s1\n(fx_pipe_t reg)", shape=box, style=filled, fillcolor="#cde"];
     s2      [label="N+2\nf_eval_s2\n(round_pack)\n-> we_wabs -> fpr"];
     pub     [label="issue+3\nscoreboard\npublishes", shape=ellipse, style=filled, fillcolor="#dfd"];
     issue -> s1 -> reg -> s2 -> pub [style=dashed,label="<= in time"];
   }

The commit now lands at **N+2** instead of N+1. Because the scoreboard still
publishes at ``issue + 3``, ``N+2 ≤ N+3`` — the value is in ``fpr`` before any
dependent reads it. The cone halves: the add/mul front in one clock, the
round-pack + write in the next. The read-hazard simply widens to the now-2-deep
in-flight window (stall a reader of either the stage-0 or stage-1 target).

Cycle-safety: verified, not asserted
=====================================

The change is gated entirely under ``ifdef VEN_FP_PIPE2`` and the **default build
is byte/cycle-identical**. The split itself is proven equivalent to the validated
1-stage by ``make verify-fppipe2`` (``verif/fppipe/run-fp-pipe2-ab.sh``), which
builds two cycle testbenches differing only by the flag and runs both on the FP
cycle kernels:

* ``mb_faddchain`` (dependent FADD chain): the ``--cycle`` traces are
  **byte-identical** — the 2-stage is cycle-for-cycle the 1-stage.
* ``mb_fpindep`` (independent FADDs at throughput 1): **identical final
  architectural state** (no stale read slipped through) and **+0.09 %** total
  cycles from the one-clock-wider hazard — comfortably inside the M5 FP band.
* Both meet the M5 FP CPI bands (``mb_faddchain`` ≈ 3.0, ``mb_fpindep`` below it);
  ``make verify`` stays GREEN on the default build; the 1M-vector
  ``f_eval_s2 ∘ f_eval_s1 == f_eval`` composition gate is bit-exact.

The Fmax result (K26 half-cache)
================================

.. list-table::
   :header-rows: 1
   :widths: 30 18 18 14

   * - Metric (routed OOC, XCK26, 15 ns)
     - 1-stage ``VEN_FP_PIPE``
     - ``+VEN_FP_PIPE2``
     - Δ
   * - Fmax — synthesis (WNS @ 15 ns)
     - 59.4 MHz
     - **78.4 MHz**
     - +32 %
   * - Fmax — routed (best directive)
     - 52.6 MHz
     - **63.0 MHz**
     - +20 %
   * - CLB LUTs (synth)
     - 77,975
     - 76,700
     - −1.6 %
   * - CLB Registers
     - 25,565
     - 25,838
     - +273 FF
   * - Worst path
     - ``fpp → fx_round_pack → fpr``
     - ``u_bcd`` (FBLD/FBSTP)
     - cone gone

Splitting the cone moves the FADD datapath off the critical path entirely
(synthesis 59 → 78 MHz) and routes the half-cache build to **63.0 MHz** — clearing
the 60 MHz target on the K26 — for the price of **one ``fx_pipe_t`` stage register**
(~273 flip-flops), LUT-neutral. The worst path is now the rare iterative **BCD
engine** (``ven_bcd``, the FBLD/FBSTP packed-decimal conversion), with the
µop-cache fill-to-PC path just below it — both well above 60 MHz.

A follow-up cycle-neutral fix takes the next worst path too: ``ven_bcd``'s per-clock
step ran **two chained divide-by-10**; computing **one divide-by-100** and extracting
the two low digits from ``q % 100`` (a 0..99 value) halves that cone, bit-exact
(``make verify-bcd``). Synthesis rises to **80.6 MHz** and the routed half-cache build
to **65.3 MHz** (``ExtraNetDelay_high`` placement). The remaining wall is the
**route-bound µop-cache fill → front-end cluster** (``store_bmap → eip``/``ic_age``),
which is placement-variable — the diffuse front-end congestion that in-context
floorplanning, not a single-cone fix, addresses.

See :doc:`/microarch/l1-parametric` for the ``+VEN_CACHE_HALF`` geometry knob this
build sits on, and ``docs/fpga-synthesis.md`` for the full place-and-route sweep,
device views, and reproduction commands.