============================================ The r4 SRT divider and the FDIV bug ============================================ The Pentium does not divide floating-point numbers with a one-shot wide divide; it runs the **SRT algorithm** (Sweeney–Robertson–Tocher) in **base 4**, two quotient bits per iteration, in a small carry-save datapath driven by a hard-wired quotient-selection table. That table is the famous **PLA** Ken Shirriff photographed and reverse-engineered from the die, and five missing entries in it are the **FDIV bug** of 1994. Ventium ships the *real* algorithm as an **optional, compile-time** division engine (``fpu_x87_pkg::fx_srt_div``). With its **correct** PLA it produces the correctly-rounded ``floatx80`` quotient — bit-identical to QEMU and to the default fast divider. With its **buggy** PLA it reproduces the FDIV flaw **from first principles**: no operand is special-cased, the famous wrong answer simply *emerges* from the missing table cells. This page documents the algorithm, the datapath, the bug, and how Ventium models and verifies all of it. The primary references are Ken Shirriff, *"Intel's $475 million error: the silicon behind the Pentium division bug"* (righto.com, Dec 2024), and Alan Edelman, *"The Mathematics of the Pentium Division Bug"* (SIAM Review 39(1), 1997), which formalises Tim Coe & Ping Tak Peter Tang's model. Why SRT, and the recurrence =========================== Restoring/non-restoring division yields one quotient bit per step. **Radix-4 SRT** yields *two* — twice as fast — by allowing a **redundant**, signed digit set ``q ∈ {-2, -1, 0, 1, 2}``. With both operands normalised to ``[1, 2)`` and ``p`` the dividend, ``d`` the divisor, the iteration is:: p_0 = p p_k+1 = 4 * (p_k - q_k * d) choosing q_k so that |p_k+1| <= (8/3)*d p / d = sum_{k>=0} q_k / 4^k The redundancy is what makes it cheap: ``q_k`` need not be computed exactly, only *selected* from a coarse, truncated view of the partial remainder and the divisor — that selection is the lookup table. The slack ``|p| <= (8/3)d`` (the "8/3" is ``base 4 minus 1``, scaled) is exactly the headroom that lets a coarse choice still converge. .. note:: A popular misconception is that SRT's redundant digits let it *correct* mistakes. They do not. If an out-of-set ``q_k`` is ever chosen, the partial remainder leaves the representable interval and the algorithm **cannot recover** — which is precisely how five wrong table cells corrupt the result. The quotient-selection PLA ========================== The digit is chosen from a **two-dimensional table** indexed by a *truncated* divisor and a *truncated* partial remainder (Edelman §4): * **Divisor** ``D`` — the top four fraction bits, ``1.dddd`` (16 columns, ``D = 16 + mb[62:59]`` in sixteenths, i.e. ``D/16 ∈ [1, 2)``). * **Partial remainder** ``P`` — a **4-integer-bit, 3-fraction-bit** field (``xxxx.yyy``), i.e. an index in units of ``1/8``. Within a column the five quotient digits occupy five bands (``+2`` at the top down to ``-2``), separated by three-cell **overlap** regions where either adjacent digit is a valid choice (the redundancy). Ventium encodes each column as four lower-bound thresholds and selects with a ladder — *exactly* the behaviour of ``fx_srt_pla``:: q = (P >= T2) ? +2 : (P >= T1) ? +1 : (P >= T0) ? 0 : (P >= Tm1) ? -1 : -2 The 16-column threshold table is generated from Edelman's formal table by ``tools/srt/srt_model.py`` (``python3 tools/srt/srt_model.py pla`` re-emits the SystemVerilog ``case`` block), so the RTL table is never hand-transcribed. The carry-save datapath ======================= The subtle part — and the reason the bug exists at all — is **how** ``P`` is formed. The partial remainder is kept in **ones-complement carry-save**: a sum word ``S`` and a carry word ``C`` whose value is ``S + C``. Each step (``fx_srt_div``): #. **Index.** Take the 4-int/3-frac field of ``S`` and of ``C`` and add them (7-bit, modular) to form ``P``. Because each word is truncated *independently* and then summed, the index can read up to ``1/4`` low — "one cell lower than the correct cell" — and the ones-complement modular wraparound is what makes a post-error excursion land deterministically. #. **Select** ``q`` from the PLA at ``(P, D)``. #. **Subtract.** Form ``-q*d`` by shifting and, for positive ``q``, ones- complementing ``d`` (or ``2d``); add it to ``(S, C)`` with a 3:2 carry-save adder. The ``+1`` ones-complement correction is *delayed* and injected into the carry word's least-significant bit on this step. #. **Shift.** Multiply the new ``(S, C)`` by 4 (left-shift two) for the next iteration. Ventium runs ``NSTEP = 36`` iterations (72 quotient bits), accumulates the signed digits into a quotient register, and rounds to the 64-bit ``floatx80`` significand using the **sign of the final partial remainder** as the round-to-nearest tiebreaker (the hardware-realistic final rounding). With the *correct* PLA this is provably correctly-rounded; validated bit-exact against exact rational division over a 10 000-divide corpus. The FDIV bug ============ Intel's table had **16 omitted entries**; five of them are reachable and cause the bug, eleven are unreachable and harmless. The five bad cells are the *top* of the ``+2`` region (``8 * P_Bad ∈ {23, 27, 31, 35, 39}`` in ``1/8`` units) in the five divisor columns whose significand begins .. list-table:: :header-rows: 1 :widths: 18 14 14 * - significand prefix - column ``D`` - ``8 * P_Bad`` * - ``1.0001…`` - 17/16 - 23 * - ``1.0100…`` - 20/16 - 27 * - ``1.0111…`` - 23/16 - 31 * - ``1.1010…`` - 26/16 - 35 * - ``1.1101…`` - 29/16 - 39 (Edelman further proves these columns are exactly the divisors with six consecutive ``1`` bits in positions 5–10 — the Coe–Tang result.) In those cells the silicon's PLA returns **0 instead of +2**. When a divide's partial remainder lands there, it picks ``0``, the remainder jumps to roughly ``10*d`` — far outside ``[-8/3 d, 8/3 d]`` — and, per the no-recovery property above, the final quotient is wrong, typically at the **13th significant bit** (≈ ``5e-5`` relative worst case for operands in ``[1, 2)``). The bug is rare (Intel quoted ~1 in 9 billion random divides) because *reaching* a bad cell needs an unlucky carry-save trajectory: Edelman shows it can only be entered from the cell just below it (the "foothold"), via a run of ``+2`` digits, and never before the ninth iteration. A triggering **divisor** is therefore *necessary but not sufficient* — most divides by such a divisor are still correct. How Ventium models it ===================== ``fx_div`` is a compile-time dispatcher. The defaults leave the project bit-exact vs QEMU; the SRT engine is opt-in: .. list-table:: :header-rows: 1 :widths: 34 66 * - Build - Division datapath * - *(no define)* — **default** - ``fx_div_exact`` — fast behavioural wide divide (correctly rounded). * - ``+define+VEN_SRT_DIV`` - ``fx_srt_div`` with the **correct** PLA — the genuine SRT engine; still correctly-rounded / bit-exact vs QEMU. * - ``+define+VEN_SRT_DIV`` ``+define+VEN_SRT_FDIV_BUG`` - ``fx_srt_div`` with the **buggy** PLA — the FDIV flaw, reproduced for *all* operands from first principles. With the buggy PLA the canonical pairs flaw exactly as documented, while a triggering divisor alone does not: .. list-table:: :header-rows: 1 :widths: 26 18 56 * - operands - column - result (buggy PLA) * - ``4195835 / 3145727`` - ``1.0111…`` - flawed ``floatx80 0x3FFF_AAB7F6392A768638`` → double ``0x3FF556FEC7254ED1`` = ``1.3337390689…`` (wrong at the 13th bit); bug hit at iteration 8. * - ``5505001 / 294911`` - ``1.0001…`` - also flaws (the second widely-quoted pair). * - ``7654321 / 3145727`` - ``1.0111…`` - **clean** — triggering divisor, but the trajectory never hits a bad cell. * - ``4195835 / 3.0`` - ``1.1000…`` - clean — non-triggering divisor. Verification ============ ``make verify-srt`` (``verif/srt/``) is a standalone Verilator gate, independent of the core/SoC build. It regenerates golden vectors from the single-source model ``tools/srt/srt_model.py`` — a faithful Python implementation of the same datapath, validated correctly-rounded against exact rational division — and asserts ``fx_srt_div`` is **bit-exact for both PLAs** over the famous FDIV pairs, the negative controls, and a random corpus (**609 vectors × 2 PLAs**, ``SRT-GATE-OK``). The default tracks are unaffected: ``make verify`` 69/69 and ``make verify-soc`` 5/5 stay green. Relationship to Erratum 23 ========================== The **runtime** FDIV erratum (``fx_div_errata``, gated by ``errata_en[ERR_FDIV]`` — see the :doc:`Instruction Catalog <../isa/index>` and ``docs/m6-errata-spec.md``) is unchanged. It returns the *documented* flawed value for the one published bit-exact operand pair and is the honest fallback for a model that has **no oracle** for the general flaw. The SRT engine *is* that oracle: it bit-reproduces the documented flaw (and a second published pair) with no operand special-cased — the very thing the erratum model's own comments noted "would require bit-reproducing Intel's exact buggy SRT iteration." The two coexist: ``fx_div_errata`` remains the default-build, self-checking anchor; ``fx_srt_div`` is the opt-in, first-principles reproduction.