The r4 SRT divider and the FDIV bug¶
The Pentium does not divide floating-point numbers with a one-shot wide divide; it runs the SRT algorithm (Sweeney–Robertson–Tocher) in base 4, two quotient bits per iteration, in a small carry-save datapath driven by a hard-wired quotient-selection table. That table is the famous PLA Ken Shirriff photographed and reverse-engineered from the die, and five missing entries in it are the FDIV bug of 1994.
Ventium ships the real algorithm as an optional, compile-time division
engine (fpu_x87_pkg::fx_srt_div). With its correct PLA it produces the
correctly-rounded floatx80 quotient — bit-identical to QEMU and to the
default fast divider. With its buggy PLA it reproduces the FDIV flaw from
first principles: no operand is special-cased, the famous wrong answer simply
emerges from the missing table cells.
This page documents the algorithm, the datapath, the bug, and how Ventium models and verifies all of it. The primary references are Ken Shirriff, “Intel’s $475 million error: the silicon behind the Pentium division bug” (righto.com, Dec 2024), and Alan Edelman, “The Mathematics of the Pentium Division Bug” (SIAM Review 39(1), 1997), which formalises Tim Coe & Ping Tak Peter Tang’s model.
Why SRT, and the recurrence¶
Restoring/non-restoring division yields one quotient bit per step. Radix-4
SRT yields two — twice as fast — by allowing a redundant, signed digit
set q ∈ {-2, -1, 0, 1, 2}. With both operands normalised to [1, 2) and
p the dividend, d the divisor, the iteration is:
p_0 = p
p_k+1 = 4 * (p_k - q_k * d) choosing q_k so that |p_k+1| <= (8/3)*d
p / d = sum_{k>=0} q_k / 4^k
The redundancy is what makes it cheap: q_k need not be computed exactly, only
selected from a coarse, truncated view of the partial remainder and the
divisor — that selection is the lookup table. The slack |p| <= (8/3)d (the
“8/3” is base 4 minus 1, scaled) is exactly the headroom that lets a coarse
choice still converge.
Note
A popular misconception is that SRT’s redundant digits let it correct
mistakes. They do not. If an out-of-set q_k is ever chosen, the partial
remainder leaves the representable interval and the algorithm cannot
recover — which is precisely how five wrong table cells corrupt the result.
The quotient-selection PLA¶
The digit is chosen from a two-dimensional table indexed by a truncated divisor and a truncated partial remainder (Edelman §4):
Divisor
D— the top four fraction bits,1.dddd(16 columns,D = 16 + mb[62:59]in sixteenths, i.e.D/16 ∈ [1, 2)).Partial remainder
P— a 4-integer-bit, 3-fraction-bit field (xxxx.yyy), i.e. an index in units of1/8.
Within a column the five quotient digits occupy five bands (+2 at the top
down to -2), separated by three-cell overlap regions where either
adjacent digit is a valid choice (the redundancy). Ventium encodes each column
as four lower-bound thresholds and selects with a ladder — exactly the
behaviour of fx_srt_pla:
q = (P >= T2) ? +2 : (P >= T1) ? +1 : (P >= T0) ? 0 : (P >= Tm1) ? -1 : -2
The 16-column threshold table is generated from Edelman’s formal table by
tools/srt/srt_model.py (python3 tools/srt/srt_model.py pla re-emits the
SystemVerilog case block), so the RTL table is never hand-transcribed.
The carry-save datapath¶
The subtle part — and the reason the bug exists at all — is how P is
formed. The partial remainder is kept in ones-complement carry-save: a sum
word S and a carry word C whose value is S + C. Each step
(fx_srt_div):
Index. Take the 4-int/3-frac field of
Sand ofCand add them (7-bit, modular) to formP. Because each word is truncated independently and then summed, the index can read up to1/4low — “one cell lower than the correct cell” — and the ones-complement modular wraparound is what makes a post-error excursion land deterministically.Select
qfrom the PLA at(P, D).Subtract. Form
-q*dby shifting and, for positiveq, ones- complementingd(or2d); add it to(S, C)with a 3:2 carry-save adder. The+1ones-complement correction is delayed and injected into the carry word’s least-significant bit on this step.Shift. Multiply the new
(S, C)by 4 (left-shift two) for the next iteration.
Ventium runs NSTEP = 36 iterations (72 quotient bits), accumulates the signed
digits into a quotient register, and rounds to the 64-bit floatx80
significand using the sign of the final partial remainder as the
round-to-nearest tiebreaker (the hardware-realistic final rounding). With the
correct PLA this is provably correctly-rounded; validated bit-exact against
exact rational division over a 10 000-divide corpus.
The FDIV bug¶
Intel’s table had 16 omitted entries; five of them are reachable and cause
the bug, eleven are unreachable and harmless. The five bad cells are the top
of the +2 region (8 * P_Bad ∈ {23, 27, 31, 35, 39} in 1/8 units) in
the five divisor columns whose significand begins
significand prefix |
column |
|
|---|---|---|
|
17/16 |
23 |
|
20/16 |
27 |
|
23/16 |
31 |
|
26/16 |
35 |
|
29/16 |
39 |
(Edelman further proves these columns are exactly the divisors with six
consecutive 1 bits in positions 5–10 — the Coe–Tang result.) In those cells
the silicon’s PLA returns 0 instead of +2. When a divide’s partial remainder
lands there, it picks 0, the remainder jumps to roughly 10*d — far
outside [-8/3 d, 8/3 d] — and, per the no-recovery property above, the final
quotient is wrong, typically at the 13th significant bit (≈ 5e-5 relative
worst case for operands in [1, 2)).
The bug is rare (Intel quoted ~1 in 9 billion random divides) because reaching
a bad cell needs an unlucky carry-save trajectory: Edelman shows it can only be
entered from the cell just below it (the “foothold”), via a run of +2 digits,
and never before the ninth iteration. A triggering divisor is therefore
necessary but not sufficient — most divides by such a divisor are still
correct.
How Ventium models it¶
fx_div is a compile-time dispatcher. The defaults leave the project
bit-exact vs QEMU; the SRT engine is opt-in:
Build |
Division datapath |
|---|---|
(no define) — default |
|
|
|
|
|
With the buggy PLA the canonical pairs flaw exactly as documented, while a triggering divisor alone does not:
operands |
column |
result (buggy PLA) |
|---|---|---|
|
|
flawed |
|
|
also flaws (the second widely-quoted pair). |
|
|
clean — triggering divisor, but the trajectory never hits a bad cell. |
|
|
clean — non-triggering divisor. |
Verification¶
make verify-srt (verif/srt/) is a standalone Verilator gate, independent
of the core/SoC build. It regenerates golden vectors from the single-source model
tools/srt/srt_model.py — a faithful Python implementation of the same
datapath, validated correctly-rounded against exact rational division — and
asserts fx_srt_div is bit-exact for both PLAs over the famous FDIV pairs,
the negative controls, and a random corpus (609 vectors × 2 PLAs,
SRT-GATE-OK). The default tracks are unaffected: make verify 69/69 and
make verify-soc 5/5 stay green.
Relationship to Erratum 23¶
The runtime FDIV erratum (fx_div_errata, gated by
errata_en[ERR_FDIV] — see the Instruction Catalog and
docs/m6-errata-spec.md) is unchanged. It returns the documented flawed
value for the one published bit-exact operand pair and is the honest fallback for
a model that has no oracle for the general flaw. The SRT engine is that
oracle: it bit-reproduces the documented flaw (and a second published pair) with
no operand special-cased — the very thing the erratum model’s own comments noted
“would require bit-reproducing Intel’s exact buggy SRT iteration.” The two
coexist: fx_div_errata remains the default-build, self-checking anchor;
fx_srt_div is the opt-in, first-principles reproduction.