The r4 SRT divider and the FDIV bug

The Pentium does not divide floating-point numbers with a one-shot wide divide; it runs the SRT algorithm (Sweeney–Robertson–Tocher) in base 4, two quotient bits per iteration, in a small carry-save datapath driven by a hard-wired quotient-selection table. That table is the famous PLA Ken Shirriff photographed and reverse-engineered from the die, and five missing entries in it are the FDIV bug of 1994.

Ventium ships the real algorithm as an optional, compile-time division engine (fpu_x87_pkg::fx_srt_div). With its correct PLA it produces the correctly-rounded floatx80 quotient — bit-identical to QEMU and to the default fast divider. With its buggy PLA it reproduces the FDIV flaw from first principles: no operand is special-cased, the famous wrong answer simply emerges from the missing table cells.

This page documents the algorithm, the datapath, the bug, and how Ventium models and verifies all of it. The primary references are Ken Shirriff, “Intel’s $475 million error: the silicon behind the Pentium division bug” (righto.com, Dec 2024), and Alan Edelman, “The Mathematics of the Pentium Division Bug” (SIAM Review 39(1), 1997), which formalises Tim Coe & Ping Tak Peter Tang’s model.

Why SRT, and the recurrence

Restoring/non-restoring division yields one quotient bit per step. Radix-4 SRT yields two — twice as fast — by allowing a redundant, signed digit set q {-2, -1, 0, 1, 2}. With both operands normalised to [1, 2) and p the dividend, d the divisor, the iteration is:

p_0   = p
p_k+1 = 4 * (p_k - q_k * d)          choosing q_k so that |p_k+1| <= (8/3)*d
p / d = sum_{k>=0} q_k / 4^k

The redundancy is what makes it cheap: q_k need not be computed exactly, only selected from a coarse, truncated view of the partial remainder and the divisor — that selection is the lookup table. The slack |p| <= (8/3)d (the “8/3” is base 4 minus 1, scaled) is exactly the headroom that lets a coarse choice still converge.

Note

A popular misconception is that SRT’s redundant digits let it correct mistakes. They do not. If an out-of-set q_k is ever chosen, the partial remainder leaves the representable interval and the algorithm cannot recover — which is precisely how five wrong table cells corrupt the result.

The quotient-selection PLA

The digit is chosen from a two-dimensional table indexed by a truncated divisor and a truncated partial remainder (Edelman §4):

  • Divisor D — the top four fraction bits, 1.dddd (16 columns, D = 16 + mb[62:59] in sixteenths, i.e. D/16 [1, 2)).

  • Partial remainder P — a 4-integer-bit, 3-fraction-bit field (xxxx.yyy), i.e. an index in units of 1/8.

Within a column the five quotient digits occupy five bands (+2 at the top down to -2), separated by three-cell overlap regions where either adjacent digit is a valid choice (the redundancy). Ventium encodes each column as four lower-bound thresholds and selects with a ladder — exactly the behaviour of fx_srt_pla:

q = (P >= T2) ? +2 : (P >= T1) ? +1 : (P >= T0) ? 0 : (P >= Tm1) ? -1 : -2

The 16-column threshold table is generated from Edelman’s formal table by tools/srt/srt_model.py (python3 tools/srt/srt_model.py pla re-emits the SystemVerilog case block), so the RTL table is never hand-transcribed.

The carry-save datapath

The subtle part — and the reason the bug exists at all — is how P is formed. The partial remainder is kept in ones-complement carry-save: a sum word S and a carry word C whose value is S + C. Each step (fx_srt_div):

  1. Index. Take the 4-int/3-frac field of S and of C and add them (7-bit, modular) to form P. Because each word is truncated independently and then summed, the index can read up to 1/4 low — “one cell lower than the correct cell” — and the ones-complement modular wraparound is what makes a post-error excursion land deterministically.

  2. Select q from the PLA at (P, D).

  3. Subtract. Form -q*d by shifting and, for positive q, ones- complementing d (or 2d); add it to (S, C) with a 3:2 carry-save adder. The +1 ones-complement correction is delayed and injected into the carry word’s least-significant bit on this step.

  4. Shift. Multiply the new (S, C) by 4 (left-shift two) for the next iteration.

Ventium runs NSTEP = 36 iterations (72 quotient bits), accumulates the signed digits into a quotient register, and rounds to the 64-bit floatx80 significand using the sign of the final partial remainder as the round-to-nearest tiebreaker (the hardware-realistic final rounding). With the correct PLA this is provably correctly-rounded; validated bit-exact against exact rational division over a 10 000-divide corpus.

The FDIV bug

Intel’s table had 16 omitted entries; five of them are reachable and cause the bug, eleven are unreachable and harmless. The five bad cells are the top of the +2 region (8 * P_Bad {23, 27, 31, 35, 39} in 1/8 units) in the five divisor columns whose significand begins

significand prefix

column D

8 * P_Bad

1.0001…

17/16

23

1.0100…

20/16

27

1.0111…

23/16

31

1.1010…

26/16

35

1.1101…

29/16

39

(Edelman further proves these columns are exactly the divisors with six consecutive 1 bits in positions 5–10 — the Coe–Tang result.) In those cells the silicon’s PLA returns 0 instead of +2. When a divide’s partial remainder lands there, it picks 0, the remainder jumps to roughly 10*d — far outside [-8/3 d, 8/3 d] — and, per the no-recovery property above, the final quotient is wrong, typically at the 13th significant bit (≈ 5e-5 relative worst case for operands in [1, 2)).

The bug is rare (Intel quoted ~1 in 9 billion random divides) because reaching a bad cell needs an unlucky carry-save trajectory: Edelman shows it can only be entered from the cell just below it (the “foothold”), via a run of +2 digits, and never before the ninth iteration. A triggering divisor is therefore necessary but not sufficient — most divides by such a divisor are still correct.

How Ventium models it

fx_div is a compile-time dispatcher. The defaults leave the project bit-exact vs QEMU; the SRT engine is opt-in:

Build

Division datapath

(no define)default

fx_div_exact — fast behavioural wide divide (correctly rounded).

+define+VEN_SRT_DIV

fx_srt_div with the correct PLA — the genuine SRT engine; still correctly-rounded / bit-exact vs QEMU.

+define+VEN_SRT_DIV +define+VEN_SRT_FDIV_BUG

fx_srt_div with the buggy PLA — the FDIV flaw, reproduced for all operands from first principles.

With the buggy PLA the canonical pairs flaw exactly as documented, while a triggering divisor alone does not:

operands

column

result (buggy PLA)

4195835 / 3145727

1.0111…

flawed floatx80 0x3FFF_AAB7F6392A768638 → double 0x3FF556FEC7254ED1 = 1.3337390689… (wrong at the 13th bit); bug hit at iteration 8.

5505001 / 294911

1.0001…

also flaws (the second widely-quoted pair).

7654321 / 3145727

1.0111…

clean — triggering divisor, but the trajectory never hits a bad cell.

4195835 / 3.0

1.1000…

clean — non-triggering divisor.

Verification

make verify-srt (verif/srt/) is a standalone Verilator gate, independent of the core/SoC build. It regenerates golden vectors from the single-source model tools/srt/srt_model.py — a faithful Python implementation of the same datapath, validated correctly-rounded against exact rational division — and asserts fx_srt_div is bit-exact for both PLAs over the famous FDIV pairs, the negative controls, and a random corpus (609 vectors × 2 PLAs, SRT-GATE-OK). The default tracks are unaffected: make verify 69/69 and make verify-soc 5/5 stay green.

Relationship to Erratum 23

The runtime FDIV erratum (fx_div_errata, gated by errata_en[ERR_FDIV] — see the Instruction Catalog and docs/m6-errata-spec.md) is unchanged. It returns the documented flawed value for the one published bit-exact operand pair and is the honest fallback for a model that has no oracle for the general flaw. The SRT engine is that oracle: it bit-reproduces the documented flaw (and a second published pair) with no operand special-cased — the very thing the erratum model’s own comments noted “would require bit-reproducing Intel’s exact buggy SRT iteration.” The two coexist: fx_div_errata remains the default-build, self-checking anchor; fx_srt_div is the opt-in, first-principles reproduction.