Instruction Catalog¶
This is the per-instruction reference for the Ventium P5/P54C replica. Every instruction the integer and x87 cores decode is listed under its category, together with how it drives the pipeline datapath and which pipe (U or V) it may issue to.
Read the Primer and Legend first: they define the five-stage dual-issue
datapath and the U/V pairing classes (UV / PU / PV / NP) that
every table column below refers to. Each category then has a list-table
giving, per instruction, the mnemonic, encoding, U/V class, a one-line datapath
usage summary, and an honest status; prose under each table fills in what the
instruction computes and the full datapath story where it does not fit in a
cell.
Primer: the five-stage dual-issue datapath¶
Ventium models the Pentium’s classic in-order, dual-issue, five-stage integer pipeline:
Stage |
What happens |
|---|---|
PF Prefetch |
Instruction bytes are fetched from the L1 I-cache into the prefetch/instruction buffer; prefixes are consumed by the prefix machine. |
D1 Decode-1 |
The opcode is decoded. The fast-path decoder
( |
D2 Decode-2 |
Operands are read from the GPR file; the AGU forms effective addresses for memory operands (and is the source of AGI interlocks). |
EX Execute |
The shared single-cycle ALU / shifter / branch-resolve logic runs in whichever pipe hosts the instruction. Memory loads issue to the L1 D-cache here. |
WB Writeback |
Results are committed to the register file and EFLAGS;
EFLAGS / register results are forwarded by the full
|
Two parallel execution slots run this pipeline: the U pipe (the primary slot, which may lead a pair) and the V pipe (the secondary slot, which fills behind a U-pipe instruction). When two adjacent instructions satisfy the pairing rules they dual-issue and retire together (up to 2/clock); otherwise the instruction issues alone in U with the V slot idle.
Not every instruction takes this fast path. Byte-operand forms, memory
read-modify-write forms, immediate+ModR/M forms, all 0F-prefixed (two-byte)
opcodes, string primitives, the x87 escapes, and every system/microcoded op
instead decode on the slow multi-cycle FSM in core.sv
(S_DECODE → S_LOAD → S_EXEC → S_STORE and friends, plus the S_USEQ
microsequencer for multi-beat ops). The slow path is functionally identical
where the fast path also exists (it shares the same alu_result /
flags_next combinational logic), but it serializes: an instruction on the
slow FSM issues alone, holding the in-order pipe until it retires. Many
instructions are therefore “AP-500 pairable by class” yet, in this RTL, only
ever run on the slow path — the catalog records both facts honestly.
Legend: the U/V pairing classes¶
The Pentium optimization reference (informally “AP-500”) classifies each instruction by how it may participate in a U/V pair. Ventium uses the same four classes. The class is a property of the instruction’s datapath needs:
UV— pairable in either pipeA simple, single-cycle ALU/MOV op with no carry-chain input, no shifter, and no microcode. It can lead a pair (as the U member) or fill the V slot, subject only to the ordinary RAW/WAW/displacement-plus-immediate pairing checks. Datapath rationale: both the U and V ALUs implement the same one-cycle
alu_result/flags_nextlogic, so either can host it.PU— pairable U-pipe onlyThe op may lead a pair (U member) but can never fill the V slot. Datapath rationale: it needs a resource only the U pipe provides — the forwarded/latched carry (
ADC/SBB/RCL/RCR) or the shifter and its CF/OF latch (SHL/SHR/SAR). The V ALU has no carry forwarding and no shift unit, so an op placed there would read the stale architecturalEFLAGS[0](carry) and corrupt state, or have no shifter at all. A prefixed instruction is also U-only per AP-500 §5.6.2.3 (the prefix-decode slot is a U-pipe resource).PV— pairable V-pipe onlyThe op may fill the V slot (behind a leading U op) but can never lead a pair. Datapath rationale: branches. A taken branch redirects the fetch stream, so nothing can issue after it in the same clock — it must be the trailing member. This is exactly the
cmp;jcc/dec;jnzspecial pair: a flags-writing U op forwards its result flags to the V-slot branch, which resolves against the new flags.NP— not pairableThe op always issues alone in U with V idle. Datapath rationale: it is microcoded / multi-cycle (
MUL/DIV, string ops,PUSHA), a unary or two-source op outside the simple template (NEG/NOT/SHLD), a memory read-modify-write or two-access op, a system / privileged op, or simply not whitelisted by the fast-path decoder so it falls to the serializing slow FSM.
Note
Two distinct facts are tracked throughout. AP-500 class is the
architectural pairing class the real Pentium assigns. Realized pairing
is whether this Ventium RTL actually dual-issues the instruction on its
fast path. They often agree, but where the fast-path decoder does not
whitelist an otherwise-pairable form (so it runs on the slow FSM and
serializes), the U/V-class cell states the AP-500 class and the prose /
status notes the divergence. A status of “slow FSM” means functionally
correct but serialized; “deferred / HALTs” means the opcode is not
decoded and traps to a loud S_HALT rather than mis-execute.
INT-ALU — integer arithmetic and logic¶
The integer ALU group is the core of the fast path. The simple register-form
01/03-style ops are single-cycle and pairable in either pipe; the
carry-chain ops (ADC / SBB) are pinned to U; the unary ops (NEG /
NOT) and most byte / memory / accumulator-immediate forms run on the slow
multi-cycle FSM. All share the same combinational alu_result /
flags_next logic, so the slow forms are bit-identical to the fast ones —
they simply do not pair.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
UV |
Single-cycle U/V ALU_ADD; reg-form |
implemented |
|
|
UV |
Same ALU, dst = reg field; |
implemented |
|
|
UV (AP-500) |
Slow FSM only — no fast-path arm, so unpaired in practice. |
implemented (slow FSM) |
|
|
UV (AP-500) |
Slow FSM group1; reg write or mem load→ALU→store RMW. |
implemented (slow FSM) |
|
|
UV |
Canonical pairable imm form: imm-only (no disp), single-cycle U/V; mem on slow FSM. |
implemented |
|
|
PU |
Carry-chain ALU_ADC; reg-form fast-pathed U-only (V has no CF forwarding). |
implemented (PU) |
|
|
PU |
|
implemented |
|
|
UV |
Single-cycle ALU_SUB, no CF input; reg-form fast-path, byte/mem slow FSM. |
implemented |
|
|
UV |
|
implemented |
|
|
PU |
Borrow-chain ALU_SBB; reg-form fast-pathed U-only (the ADC twin). |
implemented (PU) |
|
|
PU |
|
implemented |
|
|
UV |
Single-cycle ALU_AND (CF/OF/AF cleared); reg-form fast-path. |
implemented |
|
|
UV |
|
implemented |
|
|
UV |
Single-cycle ALU_OR; reg-form fast-path, byte/mem slow FSM. |
implemented |
|
|
UV |
|
implemented |
|
|
UV |
Single-cycle ALU_XOR; the |
implemented |
|
|
UV |
|
implemented |
|
|
UV |
ALU_SUB, EFLAGS only (no reg write); forwards flags U→V to a paired Jcc. |
implemented |
|
|
UV |
|
implemented |
|
|
UV (AP-500) |
ALU_TEST, EFLAGS only; slow FSM only — unpaired in practice. |
implemented (slow FSM) |
|
|
UV |
|
implemented |
|
|
NP |
group3 slow FSM; reg-form only — memory form sets |
partial |
|
|
UV |
ALU_INC (CF preserved), |
implemented |
|
|
UV |
ALU_DEC (CF preserved); |
implemented |
|
|
UV (AP-500) |
group4/5 slow FSM; reg write or mem RMW — unpaired in practice. |
implemented (slow FSM) |
|
|
NP |
Unary ALU_NEG on the slow FSM; never paired (matches AP-500 NP). |
implemented (slow FSM, NP) |
|
|
NP |
Unary ALU_NOT, no flags, on the slow FSM; never paired. |
implemented (slow FSM, NP) |
- ADD (
ADD r/m,r00/01;ADD r,r/m02/03) Computes
dst <- dst + srcand sets all six status flags (CF/PF/AF/ZF/SF/OF).alu_op = ALU_ADD; the result isa+b, CF is the carry-out at the operand width, OF is signed overflow. Datapath: the register-form 32-bit op (01/03withmod==11) is the canonical fast-path case — D1 decodes itsimple, the pairing checker admits it (simple, no displacement+immediate, no RAW/WAW), D2 reads both GPRs, EX runs the shared single-cyclealu_result/flags_nextin whichever pipe hosts it, and WB commits the register and EFLAGS. FullEX→EXandWB→EXbypass lets a dependentADDchain run at 1/clock; UV because it needs no carry input, so either ALU can host it. The byte forms (00/02) and all memory-operand forms drop to the slow FSM (load+ALU, or load→ALU→store RMW): functionally identical, but serialized.- ADD with immediate (
04/05acc;80/81 /0;83 /0) dst <- dst + imm, same flags. Only the ``83 /0`` sign-extended-imm8 register form is fast-pathed and pairable: it is imm-only (no displacement), so thedisp_immpairing veto does not fire, and it runs single-cycle in U or V. The accumulator forms (04/05) and the full-width group1 immediate forms (80/81) have no fast-path arm — they decode only on the slow FSM and therefore issue alone, even though AP-500 rates them UV.- ADC / SBB — the carry-chain ops (
10–13,18–1B, group1/2/3) ADC: dst <- dst + src + CF;SBB: dst <- dst - src - CF. These are the multi-word add/subtract primitives, and they are PU — U-pipe only. This is the central AP-500 finding the project grounds on: the carry inputcfincomes fromEFLAGS[0], and only the U pipe forwards/latches the carry. The decoder makes this explicit — forADC/SBBit setspairs_first=1butpairs_second=0(viapairs_second = !(b0[5:3]==010 || ==011)), with the RTL comment stating that “the V ALU path has no CF forwarding, so anadc/sbbin V would consume the STALE architectural carry and corrupt arch state.” So anADCmay lead a pair (U member) provided the V slot holds a non-ADC/SBB op, but can never fill V. The83 /2and83 /3reg forms are fast-pathed PU; the accumulator (14/15,1C/1D) and group1 immediate forms run on the slow FSM with the same stale-carry rationale.- SUB / AND / OR / XOR (
28–2Betc.) Simple ALU ops with no carry input, so all UV.
SUBisa-bwith borrow → CF and signed-subtract overflow → OF, setting all six flags.AND/OR/XORforce CF=OF=AF=0 and set ZF/SF/PF from the result. Reg-reg forms (29/2B,21/23,09/0B,31/33) are fast-pathed; byte and memory forms run on the slow FSM. Notexor eax,eaxstill reads and writeseax, so it will not fill V behind a U op that writeseax(RAW/WAW masks), but is otherwise UV.- CMP / TEST (
38–3B,3C/3D,83 /7;84/85,A8/A9,F6/F7 /0,/1) CMPcomputesa-blikeSUBbut discards the result and writes only EFLAGS (the register write mask is 0). It is UV, and is the U member of thecmp;jccspecial pair: whenCMPleads and aJccfills V, the core computesu_flags_effand forwards the new result flags so the paired branch resolves against them. The39/3Band83 /7reg forms are fast-pathed with this U→V flags forwarding; other forms run on the slow FSM.TESTis a non-storingAND(ALU_TEST, reusing the AND result), EFLAGS only. Only theA9 eAX,imm32form is fast-pathed UV;84/85andA8run on the slow FSM.TEST r/m,imm(F6/F7 /0,/1) is NP per AP-500 (only the reg,reg/mem,reg/imm,acc forms are UV); its register form runs on the slow FSM, and its memory form is markedd_unknownand deferred.- INC / DEC (
40+r,48+r;FE/FF /0,/1) dst <- dst ± 1with CF preserved — distinct from the unaryNEG/NOTbelow,INC/DECare UV. The ALU reusesa+bwith the second operand forced to 1; theALU_INC/ALU_DECflag arms keep CF unchanged and set OF/AF/ZF/SF/PF. The 32-bit40+r/48+rencodings are 1-byte and fast-pathed (a 1-byte first member is always allowed to pair per the I-cache-split exception).dec;jnzforwards its flags U→V exactly likeCMP— the loop idiom. TheFE/FFr/m forms (including the byteINC/DEC) decode only on the slow FSM (reg write or mem RMW) and so issue alone here.- NEG / NOT (
F6/F7 /3,F6/F7 /2) Unary ops, both NP — AP-500 explicitly classes them not-pairable despite looking ALU-like, and Ventium has no fast-path arm for them, so they always serialize on the group3 slow FSM.
NEGis0 - dst(two’s-complement) and sets CF=(dst≠0) plus OF/AF/ZF/SF/PF.NOTis~dst(one’s-complement) and affects no flags (d_writes_flagsstays 0). Both have a reg form (writes the GPR) and a mem form (load→modify→store RMW).
DATA-MOV — data movement¶
Data movement spans the simplest pairable op in the machine (NOP) and the
most microcoded (segment-register loads). The plain register MOV and
load-immediate forms are fast-pathed UV; register-base loads and LEA [base]
are fast-pathed but interact with the AGU (and the AGI interlock); MOVZX /
MOVSX / XCHG / LAHF / CBW are NP slow-FSM ops; segment-register
MOV is a microcoded system path; and XLAT is undecoded.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
UV |
Reg-form |
implemented |
|
|
UV |
|
implemented |
|
|
UV |
|
implemented |
|
|
UV (AP-500) |
No fast-path arm → slow FSM only (the disp+imm form the checker excludes); issues alone. |
implemented (slow FSM) |
|
|
UV (+ erratum) |
Slow-FSM functional; |
implemented |
|
|
NP |
System path ( |
partial |
|
|
NP |
System path ( |
partial |
|
|
NP |
0F-prefixed slow FSM ( |
implemented (slow FSM, NP) |
|
|
NP |
Read-modify-write swap ( |
implemented (slow FSM, NP) |
|
|
UV |
Zero side-effect 1-byte op; fast-pathed, pairs in either slot with anything. |
implemented |
|
|
UV |
|
implemented |
|
|
NP |
Dedicated flags↔AH transfer ( |
implemented (slow FSM, NP) |
|
|
NP |
Accumulator sign-extend convert ( |
implemented (slow FSM, NP) |
|
|
NP |
Table-lookup load — not decoded; falls to |
deferred / HALTs |
Segment-override prefixes |
|
PU/NP |
Prefix machine redirects the next memory ref’s segment; prefixed op stays on slow FSM. |
implemented (prefix) |
Operand/address-size prefixes |
|
PU/NP |
Prefix machine folds size into |
implemented (prefix) |
- MOV register and immediate forms (
88/89,8A/8B,B0-BF) Copies a value with no flag effect:
alu_op = ALU_MOVreturns the source operand straight through the ALU. The register-form89(store reg→reg,mod==11) and8B(load reg←reg) are fast-pathed UV — D2 reads the source GPR, EX passes it through the U or V ALU, WB writes the destination, single-cycle with full bypass and no flags.B8+r(imm32) is the only fast-pathed load-immediate: it takes the literal straight to WB and, reading no operand, is never an AGI source. Sub-32-bit destinations usereg_mergeto preserve the unwritten bytes (high-8AH..BHviad_dst_high8/d_src_high8). All byte (88/8A/B0-B7), 16-bit, and memory-destination forms decode only on the slow FSM (K_ALU/S_STORE) and serialize.MOV r/m,imm(C6/C7) has no fast-path arm at all — its (often disp+imm) encoding is exactly what the pairing checker excludes — so it is slow-FSM-only.- MOV r, r/m as a load (
8B,mod==00) The register-base load sub-form is a real L1 D-cache access.
decode.svsetsis_load=1,base=rm,addr_mask=onehot(rm); D2’s AGU forms the address fromgpr[base], EX issues the load, WB writes the destination, and the 2-way LRU hit/miss state machine defers a miss penalty (P5_DMISS, plusP5_MISALIGNfor a split) to the next instruction. Because a V-slot load is forbidden by the conservativeissue_uvchecker (v.is_load⇒ no pair), this load is UV only when leading a pair (a U-member load); it cannot fill V. AGI: ifbasewas written the immediately-preceding clock,pipe_agifires a 1-cycle stall. Any disp / SIB load (and the byte8A) goes to the slow FSM.- MOV moffs (
A0-A3) and the Erratum-59 model Absolute-displacement
MOVbetween(e)AXand the memory at a 32-bit moffs (no ModR/M — the 4-byte displacement is the EA). Functional execution is slow-FSM only (A0/A1loadAL/eAX;A2/A3store them). The fast-path decoder recognisesA2/A3only in cycle mode (is_moffs, readsEAX) purely to model the retire/pairing and Erratum 59: witherrata_en[ERR_MOFFS]set, a moffs store fails to pair when the following (V) instruction referencesEAX— a falseEAXdependency the modeled P5 instruction unit injects (themoffs_falsedepcheck suppresses pairing). With errata off the core pairs them normally.- MOV to/from segment registers (
8C,8E) NP — AP-500 excludes seg-register
MOVfrom the UV data-MOV class. A selector access is a microcoded system datapath, not the simple ALU.8C(SYS_MOVSREG_FROM) writes the zero-extended 16-bit selector ofSregintor/m16;8E(SYS_MOVSREG_TO) loadsSregfromr/m16(real/flat mode:base=sel<<4,limit=0xFFFF,attr=0x93; the protected-mode descriptor-load fault is computed but delivery is a later milestone, so a fault can only HALT). Both are reg-form-only — a memory operand raisesd_unknownand HALTs.- MOVZX / MOVSX (
0F B6/B7/BE/BF) NP — AP-500 lists these as not-pairable (0F-prefixed, 3+ cycles), and the two-byte
0Fescape is not in the fast-path decoder, so they always run on the slow FSM (K_EXT). They loadr/m8orr/m16into a 16/32-bit register, zero-extended (B6/B7) or sign-extended (BE/BF);d_ext_srcwselects source width andd_ext_signedthe extension, withreg_mergeinto the destination at the operand width. No flags; high-8 byte sources handled viad_src_high8.- XCHG / NOP (
86/87,90+r,90) XCHGis NP: a read-modify-write swap (two register/memory writes, an implicitLOCKon memory forms), not a simple single-write ALU op, so it runs on the slow FSM (K_XCHG) — reg-form cross-writes both GPRs, mem- form does a locked load+store. ``NOP`` (90with no0x66) is the exception: it is UV,is_nop=1, fast-pathed, and writes nothing (no register, memory, or flag), so it pairs trivially in either slot; being 1-byte it also satisfies the I-cache-split pairing exception.- LEA (
8D) UV —
LEAcomputes an effective address in the AGU and writes it to a register without a memory access or flag write, so it slots cleanly into U or V. Only the simplest[base]form (mod==00,rmnot100/101) is fast-pathed:is_lea,base=rm,addr_mask=onehot(rm), and the U/V commit writesgpr[dst] <= gpr[base]directly (EA == base value), single-cycle with no memory port used. AGI:addr_maskdrivespipe_agi, so anLEAwhose base was written the prior clock takes the 1-cycle AGI stall (it consumes the AGU). Full SIB/disp/indexLEAruns on the slow FSM (gpr[dst] <= q_ea).- LAHF / SAHF and CBW/CWDE / CWD/CDQ (
9F/9E,98/99) All NP.
LAHFcopiesEFLAGS[7:0]intoAH;SAHFwrites the five status flags back fromAH(K_STKMISC, no memory).98sign-extendsAL→AXorAX→EAX(CWDE);99sign-extendsAX→DX:AX(CWD) orEAX→EDX:EAX(CDQ) (K_CONV). All are microcoded converts / flag transfers on the slow FSM, serializing.- XLAT (
D7) NP and not decoded — a table-lookup load (
AL <- [(E)BX + AL], implicit addressing) that has no opcode arm at all;D7falls through to the one-byte defaultd_unknownand HALTs loudly, so it never reaches an execute datapath.- Prefixes (segment-override
2E/36/3E/26/64/65;66/67) A prefix is PU on the slow path (AP-500 §5.6.2.3 makes a prefixed instruction U-only-pairable) but effectively NP in this RTL, because the fast-path decoder only recognises unprefixed opcodes — so any prefixed instruction misses the fast path and serializes. The segment overrides record
d_pfx_seg_en/d_pfx_seg_idxand redirect the next memory reference’s segment in the AGU.0x66/0x67toggle operand / address size (with the real-modedef16inversion, so0x66selects 32-bit there), feedingd_wand the length functions across every decode arm; they compute nothing themselves but the instruction they prefix then executes on the slow FSM at the chosen width.
STACK — push, pop, and frame management¶
The stack group exercises the store/load AGU with the implicit ESP
register. A subtlety threads the whole group: the decoder masks ``ESP`` out
of the reads/writes bitmasks (_onehot returns 0 for R_ESP), so a
push/push or push/call sequence never trips a false ESP RAW/WAW —
the AP-500 §5.6.4 special-pair rule. Consequently PUSH/POP of a register
or immediate are UV by class. However, none of the stack ops are
whitelisted by the fast-path decoder (decode.sv does not decode them), so in
this Ventium they all run on the slow FSM and do not emergently dual-issue
— the catalog records the AP-500 class, not a realized pairing. The
memory-form, multi-register, and flags/segment forms are genuinely NP /
microcoded.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
UV (AP-500) |
Pre-decrement ESP store; SS-based AGU; slow-FSM single store (not fast-pathed). |
implemented (slow FSM) |
|
|
UV (AP-500) |
Load from [SS:ESP] then ESP += w; slow-FSM single load (not fast-pathed). |
implemented (slow FSM) |
|
|
UV (AP-500) |
Pre-decrement ESP store of the latched immediate; slow-FSM single store. |
implemented (slow FSM) |
|
|
NP |
Two-access op (operand load + stack store); slow FSM, issues alone. |
implemented (slow FSM, NP) |
|
|
NP |
Stack load + store-to-EA (two memory accesses); slow FSM, issues alone. |
implemented (slow FSM, NP) |
|
|
NP |
8-beat micro-sequence ( |
implemented (microcoded) |
|
|
NP |
8-beat ascending load micro-sequence ( |
implemented (microcoded) |
|
|
NP |
Stores EFLAGS as the datum (serializing implicit source); slow-FSM single store. |
implemented (slow FSM, NP) |
|
|
NP |
Masked EFLAGS rewrite (may set TF/IF/DF); slow-FSM load + flag write. |
implemented (slow FSM, NP) |
|
|
NP |
Fused ESP←EBP then POP EBP (EBP-based load); slow FSM, issues alone. |
implemented (slow FSM, NP) |
|
|
NP |
Microcoded frame-build — not decoded; |
deferred / HALTs |
|
|
NP |
Segment-selector push — not decoded; |
deferred / HALTs |
|
|
NP |
Descriptor-reloading segment pop — not decoded; |
deferred / HALTs |
- PUSH / POP register and immediate (
50+r,58+r,68/6A) PUSHpre-decrementsESPby the operand width then stores the source to[SS:ESP];POPloads from[SS:ESP]into the destination then post-incrementsESP(aPOPintoESPitself suppresses the+w).PUSH immstores the decode-latched immediate (6Asign- extends its imm8). Datapath: the AGU forms the descending store addressgpr[ESP] - w(pre-decrement adder in D2) or the ascending load addressgpr[ESP], with theSSbase applied;ESPis rewritten at WB. These are UV by AP-500 class (ESPmasked from contention), but becausedecode.svdoes not decode them,simplestays 0 andS_PIPEhands them to the slow FSM (S_DECODE → S_EXEC → S_STORE, or→ S_LOAD → S_EXEC) — a single ~1-cycle memory access that serializes. So they are pairable by class but do not emergently dual-issue in this RTL.- PUSH / POP r/m (
FF /6,8F /0) Both NP.
PUSH r/m32for a memory operand needs two memory accesses — load[EA]then store to[SS:ESP]— which exceeds the single-cycle template, so it serializes alone in U (the register form routes through the same NP slow path).POP r/m32likewise pops the stack word and stores it to a memory destination (load-from-stack + store-to-EA), a two-memory-access op. Both run multi-cycle on the slow FSM.- PUSHA / POPA (
60,61) NP / microcoded.
PUSHApushes all eight GPRs in order (EAX,ECX,EDX,EBX,originalESP,EBP,ESI,EDI) as 8 sequential stores over theS_USEQmicro-sequencer; the originalESPis latched intopusha_espon entry so every descending address (and the pushedESPslot) is computed off the fixed value, and at step 7 it commitsESP <= original - 32.POPAis the inverse: 8 ascending loads intoEDI,ESI,EBP,(skipESP)``,EBX,EDX,ECX,EAX``, thenESP += 32. Each is ~8+ cycles gated onmem_ackand holds the in-order pipe for the whole run.- PUSHF / POPF (
9C,9D) Both NP.
PUSHFreads the architecturalEFLAGSas its store datum (a serializing implicit source — the V pipe has noEFLAGSforwarding for this) and writes it to[SS:ESP]-w.POPFloads a word and writes it intoEFLAGSunder the user-mode writability mask (0x00244DD5—CF|PF|AF|ZF|SF|TF|DF|OF|NT|AC|ID;IF/IOPL/VM/RFpreserved, bit 1 forced 1), thenESP += w. BecausePOPFcan setTF, the pipe carries the issue-time flags for the step-trap decision.- LEAVE / ENTER (
C9,C8) LEAVEis NP: a fused two-step frame teardown —ESP <- EBPthenPOP EBP— with anEBP-based load (SM_LEAVE: read[EBP], writeEBPandESP <= old-EBP + slot), serializing on the slow FSM (both 32-bit and 66h 16-bit forms).ENTERis NP and not implemented: opcode0xC8has no decode arm, so it hitsd_unknownand HALTs loudly; there is no frame-build / display-copy micro-sequence.- PUSH / POP segment registers (
06/0E/16/1E,07/17/1F,0F A0/A1/A8/A9) NP by class and unimplemented — none of these one-byte or 0F-prefixed forms have a decode arm, so they resolve to
d_unknownand HALT. A segment-register push is a special-source store the P5 does not pair, and a segment pop triggers a descriptor reload (a microcoded system action); neither datapath exists for these opcodes here.
Note
LAHF / SAHF (9F/9E) share the K_STKMISC micro-op group
and decode path with the stack ops, but touch no memory and do not
move ESP. They are NP flag↔AH transfers (LAHF: AH <- EFLAGS[7:0];
SAHF rebuilds the five status flags from AH), implemented on the slow
FSM. They are documented in full under DATA-MOV — data movement.
SHIFT-BIT — shifts, rotates, and bit operations¶
The shift group is the home of the PU class. The immediate-count
SHL/SHR/SAR register forms (C1 /4-/7) are fast-pathed but
U-pipe only: the shifter and its CF/OF latch live on the U-pipe EX datapath,
and the V ALU has neither, so a shift may lead a pair but never fill V. The
by-1 (D0/D1) and by-CL (D2/D3) forms, the rotates, the
double-precision SHLD/SHRD, the bit-test family, BSF/BSR,
BSWAP, and SETcc are all slow-FSM-only (and the 0F-prefixed ones are NP
by class), so they serialize.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
PU |
Reg-form fast-pathed U-only single-cycle shifter; mem-form slow |
implemented |
|
|
PU |
Reg-form fast-pathed U-only; mem-form slow RMW. |
implemented |
|
|
PU |
Reg-form fast-pathed U-only (sign-replicating |
implemented |
|
|
PU (AP-500) |
Not fast-pathed → slow |
implemented (slow FSM) |
|
|
NP |
Implicit CL read + variable latency; slow |
implemented (slow FSM, NP) |
|
|
PU/NP (AP-500) |
Intentionally not fast-pathed (richer OF); all forms slow |
implemented (slow FSM) |
|
|
PU/NP (AP-500) |
Carry-through rotate; slow |
implemented (slow FSM) |
|
|
NP |
Double-precision two-source shift; slow |
implemented (reg dst) |
|
|
NP |
Double-precision right; slow |
implemented (reg dst) |
|
|
NP |
Bit test → CF, no write; slow |
implemented (reg dst) |
|
|
NP |
Test-and-set; slow |
implemented (reg dst) |
|
|
NP |
Test-and-reset; slow |
implemented (reg dst) |
|
|
NP |
Test-and-complement; slow |
implemented (reg dst) |
|
|
NP |
Forward bit-scan (priority encoder); slow |
implemented (slow FSM, NP) |
|
|
NP |
Reverse bit-scan; slow |
implemented (slow FSM, NP) |
|
|
NP |
Byte-reverse permute, no flags; slow |
implemented (slow FSM, NP) |
|
|
NP |
Condition → byte ( |
implemented (slow FSM, NP) |
- SHL / SHR / SAR by immediate (
C1 /4-/7) Shift the operand by an immediate count (masked to 5 bits):
SHL/SALzero-fills from the right,SHRzero-fills from the left,SARsign-replicates. CF is the last bit shifted out; SF/ZF/PF come from the result, AF=0, and OF (defined for the cnt==1 case) isMSB(shifted-by-cnt-1) XOR MSB(result). Datapath / why PU: the register form (mod==11) is fast-pathed —is_shift,shrot=reg_f,shimm=b2[4:0]— and flowsPF→D1→D2(read dst)→EX→WBin the U pipe in a single cycle, withshrot_result/shrot_cfcomputing the value and CF andsbitgiving OF, full bypass. It is PU becausedecode.svsetspairs_first=1, pairs_second=0: only the U pipe has the shifter and the CF/OF latch, so a shift can lead a pair but never fill V.count==0changes nothing. The memory form drops to the slowK_SHIFTload-modify-store RMW and serializes.- Shift by 1 and by CL (
D0/D1,D2/D3) AP-500 rates by-1 shifts the same PU as by-imm, but in this RTL the by-1 forms have no fast-path arm — they decode only on the slow
K_SHIFTFSM (shift_one→sh_cnt=1), so they serialize (NP-effective) even though architecturally PU. The by-CL forms are genuinely NP: the count is an implicitCLread with variable latency, so AP-500 excludes them from PU;sh_cnt={0, gpr[ECX][4:0]}is read in EX and the op runs on the slow FSM, issuing alone (cnt==0is a no-op).- ROL / ROR / RCL / RCR (
C0/C1,D0-D3/0-/3) Rotates affect only CF and OF.
ROL/RORrotate without fill bits;RCL/RCRrotate through the carry, seeding the per-bit loop withcfin=EFLAGS[0](the architectural carry — the same reason a pairable rotate-through-carry would be U-only). Architecturally by-1/by-imm rotates are PU and by-CL rotates NP, butdecode.svdeliberately does not fast-path any rotate (a comment notes the SHL/SHR/SAL/SAR group is fast-pathed but the rotates keep their richer OF semantics on the slow path). So all rotate forms here run on the slowK_SHIFTFSM and serialize.- SHLD / SHRD (
0F A4/A5,0F AC/AD) NP — double-precision shifts read two source registers plus a count and drive a wide multi-bit shifter, outside the simple/pairable set, and the
0Fescape is not in the fast-path decoder anyway.SHLDshiftsdstleft bycntfilling from the top ofsrc;SHRDshifts right filling from the bottom. CF is the last bit shifted out ofdst; SF/ZF/PF from the result, AF=0. Implemented on the slowK_SHLDRDFSM for a register destination only (imm8 and CL counts); a memory destination setsd_unknownand HALTs (deferred).- BT / BTS / BTR / BTC (
0F A3,0F AB/B3/BB,0F BA /4-/7) NP — 0F-prefixed single-bit select-and-test, not in the simple set. All copy the selected bit (
index mod operand-size) into CF and define only CF (SF/ZF/PF/OF/AF unchanged).BTdoes not modify the destination;BTS/BTR/BTCthen set / reset / complement the bit and write the destination at operand width. Implemented on the slowK_BITTESTFSM for a register-direct destination only; the memory bit-string form (which would need full-index byte addressing) setsd_unknownand HALTs.- BSF / BSR / BSWAP (
0F BC,0F BD,0F C8+r) All NP.
BSF/BSRscan for the lowest / highest set bit of the source, writing its index to the destination and setting ZF iff the source is zero (destination unchanged then); this RTL also fills CF=OF=AF=0 and SF/PF from the source even though those are architecturally undefined. They run on the slowK_BITSCANFSM (reg or mem source, 16/32-bit).BSWAPreverses the 4 bytes of a 32-bit register (endianness flip), no flags, on the slowK_BSWAPFSM.- SETcc (
0F 90+cc) NP — a 0F-prefixed op that evaluates an EFLAGS condition into a byte. It writes 1 to the
r/m8destination if the condition (ccnibble vs. EFLAGS via the sharedcond_truehelper) holds, else 0; no flags affected. All 16 conditions are handled on the slowK_SETCCFSM, for both register (high-8-aware) and memory destinations.
MULDIV-BCD — multiply / divide and BCD/ASCII adjust¶
Every instruction in this group is NP. The multiplies and divides are
microcoded, produce or consume the implicit EDX:EAX double-width pair (which
has no V-pipe writeback path), and are never whitelisted by the fast-path
decoder — issue_uv.fp_can_pair() returns 0 immediately on !u.simple, so
they hold the U pipe alone. They execute on the slow FSM in a single
combinational S_EXEC arm (the arithmetic uses native Verilog * / /
/ %, not an iterative shift-add or SRT hardware loop, and not the
S_USEQ microsequencer — the multi-cycle character is the per-instruction
FETCH/DECODE/LOAD/EXEC FSM stepping). The BCD/ASCII adjusts (AAA,
AAS, AAM, AAD, DAA, DAS) are not decoded at all and HALT.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
NP |
Unsigned |
implemented (slow FSM) |
|
|
NP |
Signed one-operand into |
implemented (slow FSM) |
|
|
NP |
Signed, truncated to width into a single dest reg ( |
implemented (slow FSM) |
|
|
NP |
Signed |
implemented (slow FSM) |
|
|
NP |
Signed |
implemented (slow FSM) |
|
|
NP |
Unsigned |
implemented (slow FSM) |
|
|
NP |
Signed divide; |
implemented (slow FSM) |
|
|
NP |
ASCII-adjust-after-add — not decoded; |
deferred / HALTs |
|
|
NP |
ASCII-adjust-after-sub — not decoded; |
deferred / HALTs |
|
|
NP |
ASCII-adjust-after-mul — not decoded; |
deferred / HALTs |
|
|
NP |
ASCII-adjust-before-div — not decoded; |
deferred / HALTs |
|
|
NP |
Decimal-adjust-after-add — not decoded; |
deferred / HALTs |
|
|
NP |
Decimal-adjust-after-sub — not decoded; |
deferred / HALTs |
- MUL / IMUL one-operand (
F6/F7 /4,F6/F7 /5) Multiply the accumulator by
r/minto the implicit double-width result: 8-bitAX = AL * r/m8, 16-bitDX:AX = AX * r/m16, 32-bitEDX:EAX = EAX * r/m32.MULis unsigned,IMUL(one-operand) signed. CF=OF=1 iff the upper half of the product is significant (forIMUL, iff the full product is not the sign-extension of its low half), else 0; SF/ZF/AF/PF are architecturally undefined but Ventium computes ZF/SF/PF from the low half and AF=0 (matching QEMU). Datapath: slow path, one combinationalS_EXECarm — the product is the native{32'd0,EAX}*{32'd0,src}(or$signed*$signed), split intoEDX(high) andEAX(low) with partial-width upper bits preserved. The AGU / D2 forms a load address only for a memory operand.EDX:EAXis the implicit dest/source pair — no V-pipe path, no forwarding.- IMUL two- and three-operand (
0F AF,69,6B) These write a single destination register (not
EDX:EAX) — the high half of the signed product is discarded. The two-operand form isdst = dst * r/m; the three-operand forms aredst = r/m * imm32(69) ordst = r/m * sign-ext(imm8)(6B). CF=OF=1 iff truncating to the operand width lost significant bits; SF/ZF/PF filled from the low result, AF=0. All decode asK_IMUL2on the slow FSM (the three-operand forms setimul_3opand latch the immediate at decode);$signed * $signedwith the low half written back throughreg_merge. They are NP —0F AFis a two-byte op and69/6Bare absent from the fast-path casez, sod.simple=0.- DIV / IDIV (
F6/F7 /6,F6/F7 /7) Divide the implicit double-width dividend by
r/m: quotient →EAX/AL(orAX), remainder →EDX/AH(orDX).DIVis unsigned,IDIVsigned (operands$signed-extended, remainder takes the dividend’s sign). #DE on divide-by-zero or quotient overflow; all status flags are undefined and the RTL leaves EFLAGS unchanged (flags_we=0). Datapath: slow path, one combinationalS_EXECarm using native/and%— not an iterative non-restoring / SRT radix-4 hardware loop and not inS_USEQ. The longest-latency integer ops, microcoded,EDX:EAX-coupled, hence NP.Note
The famous Pentium SRT-radix-4 divider (and the FDIV erratum) live in the x87 path, not here:
decode.svmodels x87FDIV/FDIVR(D8 /6,/7) at latency/occupancy 39 with thefdiv_err(Erratum 23) hook, and the genuine radix-4 SRT engine itself (fx_srt_div) is an optional, compile-time division datapath — see The r4 SRT divider and the FDIV bug. The integerDIV/IDIVabove do not touch that SRT datapath. A#DEis expected to be avoided by the test corpus (which keeps divisors safe); on real hardware user-mode QEMU would deliverSIGFPE.- BCD / ASCII adjusts (
AAA37,AAS3F,AAMD4,AADD5,DAA27,DAS2F) All NP and not implemented. None of these opcodes has a decode arm in either the fast-path decoder or the slow one-byte / two-byte casez; each falls through to
default: d_unknown=1and the FSM goes toS_HALT(a loud, out-of-scope HALT rather than a mis-execution). Architecturally they adjustAL/AXto valid (unpacked or packed) BCD digits after an add/subtract/multiply/before a divide, but no EX-stage BCD-adjust datapath exists — they never reach an execute arm and never pair. The test corpus is advised to avoid them.
CONTROL-FLOW — branches, calls, returns, loops, interrupts¶
This is the home of the PV class. The short conditional Jcc rel8
(70-7F) and short unconditional JMP rel8 (EB) are fast-pathed
V-only: a branch resolves and redirects in the V slot, so it can fill V
behind a leading flags-writer (the cmp;jcc / dec;jnz special pair) but
can never lead (a taken branch ends the issue window). Everything else — the
rel32 branches, indirect and far transfers, calls and returns, the loop
family, and the interrupt/return ops — runs on the slow FSM (or is undecoded),
so it is NP or NP-effective in this RTL.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
PV |
Fast-path V-member; D1 BTB lookup + 2-bit predict; re-evaluates under forwarded U flags. |
implemented |
|
|
NP (this RTL) |
No fast-path 0F arm → slow FSM; functional taken/target, no BTB; serializes. |
implemented (slow FSM) |
|
|
PV |
Fast-path V-member, unconditionally taken; BTB-tracked; uncond mispredict = 3 bubbles. |
implemented |
|
|
NP (this RTL) |
No fast-path E9 arm → slow FSM; functional, no BTB. |
implemented (slow FSM) |
|
|
NP |
Data-dependent target from reg/load ( |
implemented (slow FSM) |
|
|
NP |
CS reload + GDT-descriptor microsequence ( |
implemented (system mode) |
|
|
NP (this RTL) |
Pushes return addr (store) then redirects ( |
implemented (slow FSM) |
|
|
NP |
Push + data-dependent target (store + load) ( |
implemented (slow FSM) |
|
|
NP |
Far call — not decoded; |
deferred / HALTs |
|
|
NP |
Pops return IP (load) then redirects ( |
implemented (slow FSM) |
|
|
NP |
RET plus |
implemented (slow FSM) |
|
|
NP |
Far return — not decoded; |
deferred / HALTs |
|
|
NP |
Dec (E)CX + conditional branch RMW; slow FSM; no flags written. |
implemented (slow FSM) |
|
|
NP |
|
implemented (slow FSM) |
|
|
NP |
|
implemented (slow FSM) |
|
|
NP |
Tests the count register (GPR read, not flags) ( |
implemented (slow FSM) |
|
|
NP |
IDT[3] trap microsequence (system mode); user mode HALTs. |
implemented (system mode) |
|
|
NP |
IDT[n] trap with DPL≥CPL check (system); user mode INT 0x80 → HALT, others HALT. |
implemented (system mode) |
|
|
NP |
Conditional #OF trap through IDT[4] (system); user mode HALTs. |
implemented (system mode) |
|
|
NP |
Pops EIP/CS/EFLAGS ( |
implemented (system mode) |
|
|
NP |
Bounds-check + conditional #BR — not decoded; |
deferred / HALTs |
- Jcc rel8 — the PV branch (
70-7F) Tests EFLAGS per the condition nibble (
cond_trueagainst CF/PF/ZF/SF/OF) and, if true, setsEIP = next_eip + sign-ext(rel8); otherwise falls through. It reads no GPR, writes no register, consumes flags only. Datapath / why PV: the70-7Farm is fast-pathed and single-cycle — D1 decodes it as a conditional branch with the BTB looked up (predict-taken iff the 2-bit counter ≥ 2), and EX needs no ALU/AGU (the taken bit is justcond_true(cc, flags)). When paired into V behind a flags-writer (cmp/test/dec/sub), it re-evaluates under U’s forwarded result flags (u_flags_eff → v_br_taken_eff) — this is the P5cmp;jcc/add;jnzspecial pair. It is PV (pairs_first=0, pairs_second=1) because a taken branch redirects the fetch stream, so it must be the trailing member. On resolve,btb_update_takensaturates the counter; a conditional mispredict costsP5_MISPREDICT_V = 4V-pipe bubbles, a correct prediction 0.- Jcc rel32 (
0F 80-8F) and JMP rel32 (E9) AP-500 classes the near
rel32branch / direct JMP as PV like the short forms, but the Ventium fast path only whitelists the 1-byte70-7F/EBencodings. The0F-prefixedJcc rel32and the 5-byteE9 JMP rel32have no fast-path arm, so they fall to the slow FSM and issue alone — NP in this implementation. They are functionally correct (the taken decision is computed at decode from the live EFLAGS, and retire commitsnext_eip + rel), but they do not update the BTB and are not part of the dual-issue cycle model.- JMP rel8 (
EB) and indirect / far JMP (FF /4,EA) JMP rel8is PV — like aJccbut unconditionally taken (br_cond=0, br_taken=1); the BTB is warmed strongly-taken on first allocation, and an unconditional mispredict costsP5_MISPREDICT_UNCOND = 3bubbles regardless of pipe (distinct from the conditional V penalty of 4).JMP r/m(FF /4) is NP (CT_JMPIND): the target comes from a register read or a load, is data-dependent (so the BTB cannot usefully predict it), and runs on the slow FSM (memory form:S_LOADthe pointer,S_EXECsets the new EIP).JMP ptr16:16/32(EA,SYS_LJMP) is NP and system-mode only: it reloads CS, which in protected mode means fetching and validating an 8-byte GDT descriptor — a long microcoded segment-load sequence used by the real→protected bootstrap.- CALL (
E8near direct,FF /2indirect,9Afar) AP-500 makes near-direct
CALLPV (it can fill V, thepush;callspecial pair), but Ventium implementsCALLonly on the slow FSM because it must push the return address (a memory store) before redirecting — soE8issues alone here (NP-effective):S_EXEC(store thenext_eip) →S_STORE(write[ESP-w],ESP -= w,EIP <= next_eip + rel). The 0x66 form truncates the target to 16 bits.CALL r/m(FF /2,CT_CALLIND) is genuinely NP — it both pushes (store) and takes a data-dependent target (register read or load), two serialized memory micro-ops, with the AGU used twice.CALL ptr16:16/32(9A) is NP and not decoded (deferred to the M2S system milestone) →d_unknown→ HALT.- RET / RET imm16 / RETF (
C3,C2,CB/CA) RETis NP: it pops the return IP from[ESP](a load) and redirects, with no return-stack predictor in Ventium, on the slow FSM (CT_RETN:S_LOAD [ESP]→S_EXECsets EIP from the popped word,ESP += w).RET imm16adds the decode-latchedimm16to theESPincrement (releasing caller args).RETF/RETF imm16(far) would popCS:EIPand reload the code segment — NP and not decoded (deferred to M2S) →d_unknown→ HALT.- LOOP / LOOPE / LOOPNE / JCXZ (
E0-E3) All NP — microcoded read-modify-write on
(E)CXplus a conditional branch, not a simple ALU op, and not BTB-tracked.LOOPdecrements the count register (writeback via NBA), zero-tests it to gate the redirect, and writes no flags.LOOPE/LOOPZadditionally requiresZF==1,LOOPNE/LOOPNZrequiresZF==0(gated from EFLAGS inS_EXEC).JCXZ/JECXZ(E3,CT_JECXZ) is distinct — it tests the count register (a GPR read) for zero rather than EFLAGS, so unlikeJccit is NP (not a simple flag-branch); it writes no flags and uses no ALU result. The0x67address-size prefix selects the 16-bitCXcount path.- INT3 / INT n / INTO / IRET (
CC,CD,CE,CF) All NP / privileged. In system mode they vector through the IDT (gate fetch + exception-frame push), a long microsequence:
INT3→S_INT_GATE(read the 8-byte IDT[3] gate) →S_INT_PUSH(push EFLAGS, CS, the next EIP for a TRAP) → redirect;INT nadds the gateDPL≥CPLcheck;INTOtraps through IDT[4] only ifOFis set (else a plain EIP advance);IRETis the inverse (S_IRETpops EIP/CS/EFLAGS). In user mode there is no IDT, soINT3/INTO/IRETand mostINT nHALT, withINT 0x80treated as the syscall-exit HALT — all preserving the M0-M6 user-gate bit-identity.- BOUND (
62 /r) NP and not decoded — it would read a two-word bounds pair from memory, do two compares, and conditionally raise
#BRthrough the IDT (a microcoded compare-and-maybe-fault sequence). Opcode62has no decode arm, so it hits the top-level defaultd_unknownand HALTs.
STRING — string primitives and REP prefixes¶
Every string primitive is NP: decode.sv never sets simple for the
A4-AF string opcodes, so issue_uv (which requires u.simple /
v.simple) can never make one a U-member or V-candidate. They are microcoded,
multi-cycle ops on the K_STR slow path that hold the in-order pipe for their
whole run. Each REP element is its own retire record at the same PC: a
non-final iteration sets new_eip = q_pc so the FSM re-enters the same
instruction. Direction is from DF (EFLAGS[10]): str_step = DF ? -w :
+w. The REP/REPE/REPNE prefixes are themselves NP (a prefixed
K_STR op). The port-I/O string forms (INS/OUTS) and the register
port-I/O ops are not decoded (no I/O space is modelled) and HALT.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
NP |
Copy DS:[ESI]→ES:[EDI], advance both per DF; |
implemented (slow FSM) |
|
|
NP |
Word/dword MOVS ( |
implemented (slow FSM) |
|
|
NP |
Store AL→ES:[EDI] (no load), advance EDI; |
implemented (slow FSM) |
|
|
NP |
Store AX/EAX→ES:[EDI]; store-only microsequence ( |
implemented (slow FSM) |
|
|
NP |
Load DS:[ESI]→AL ( |
implemented (slow FSM) |
|
|
NP |
Load DS:[ESI]→AX/EAX (width-correct merge); load-only. |
implemented (slow FSM) |
|
|
NP |
|
implemented (slow FSM) |
|
|
NP |
Width-correct CMP vs ES:[EDI]; REPE/REPNE early-out. |
implemented (slow FSM) |
|
|
NP |
|
implemented (slow FSM) |
|
|
NP |
Width-correct two-load compare; REPE/REPNE early-out. |
implemented (slow FSM) |
|
|
NP |
Repeat MOVS/STOS/LODS ECX times (no ZF early-out); prefixed |
implemented |
|
|
NP |
Repeat SCAS/CMPS while ZF==1; stop on first non-equal element. |
implemented |
|
|
NP |
Repeat SCAS/CMPS while ZF==0; stop on first matching element. |
implemented |
|
|
NP |
Port-I/O input — not decoded; |
deferred / HALTs |
|
|
NP |
Port-I/O output — not decoded; |
deferred / HALTs |
|
|
NP |
Register port-I/O — not decoded; |
deferred / HALTs |
- MOVS (
A4/A5) Copies one element from
DS:[ESI]toES:[EDI]then advances both pointers by±wperDF; flags are unaffected. Datapath: per element the slow FSM runsS_DECODE → S_LOAD(readDS:[ESI]intomem_load_data)→ S_EXEC(str_wdata = mem_load_data,ESI += str_step,EDI += str_step; the pre-increment[EDI]is latched intostr_store_addrbecauseEDIupdates via NBA the same cycle)→ S_STORE(writeES:[EDI]). WithREP,ECXis decremented and the run terminates atECX==0(no ZF test forMOVS); a leadingREPwithECX==0retires a single no-op advancingEIP.- STOS / LODS (
AA/AB,AC/AD) STOSstoresAL/AX/EAXtoES:[EDI](no load —S_DECODEgoes straight toS_EXECsinced_mem_read=0), advancesEDI, and withREPfills memory.LODSloadsDS:[ESI]intoAL/AX/EAXviareg_merge(partial-register-correct — the low byte / word is merged, upper bytes preserved) and advancesESI(no store stage);REP LODSis legal but pointless (only the final element survives), and Ventium still iteratesECXtimes. Neither has a ZF early-out.- SCAS / CMPS (
AE/AF,A6/A7) Both are non-storing compares that set all six status flags via
flags_next(ALU_CMP, ...).SCAScomputes(AL/AX/EAX) - [ES:EDI](one load fromES:[EDI]) and advancesEDI.CMPScomputes[DS:ESI] - [ES:EDI]and is the only string op needing two loads —S_LOADreadsDS:[ESI]intomem_load_data, then the CMPS-onlyS_LOAD2readsES:[EDI]intomem_load_data2— before the compare, advancing bothESIandEDI. With aREPE/REPNEprefix, after each comparelast_iter = (ECX-1==0) || cmp_termwherecmp_termisREPE ? (ZF==0) : REPNE ? (ZF==1)— the ZF early-out is taken from this element’s freshly computed flags before deciding to re-enterq_pc. Two D-cache accesses perCMPSelement (each can take a miss/misalign penalty in cycle mode).- REP / REPE / REPNE (
F3,F3,F2) The prefix machine decodes
F3 → pfx_rep=3(q_rep) andF2 → pfx_rep=2(q_repne). On the non-comparing primitives (MOVS/STOS/LODS) there is no early-out — the run terminates only whenECXreaches 0. On the comparing primitives (SCAS/CMPS),REPEstops whenZF==0andREPNEstops whenZF==1(in addition toECX==0). All are NP prefixedK_STRops:ECXis decremented each element, a non-final element setsnew_eip = q_pcto re-enter the same instruction, and the op holds the in-order pipe (issues alone) for the whole run.- INS / OUTS and IN / OUT (
6C-6F,E4-E7,EC-EF) All NP by AP-500 (I/O instructions) and not decoded — Ventium models a flat user environment with no I/O port space. None of these opcodes appears in the opcode case, so each falls through to
default: d_unknown=1and the core HALTs loudly rather than mis-execute; noK_STRmicrosequence is generated. (Note: the hex valuesE4/E5/EC/EDonly have meaning elsewhere as the second byte of aD9x87 escape — e.g.D9 E4 = FTST— but the standalone primary opcodes are undecoded.)
SYSTEM — control/debug registers, descriptor tables, and serializing ops¶
The system group is almost entirely NP: control- and debug-register
moves, descriptor-table loads, flag-bit mutators, HLT, and the serializing
CPUID/RDTSC/MSR ops are all microcoded or privileged, never whitelisted
by the fast-path decoder, and serialize on the slow FSM. The lone exception is
NOP (90), the canonical zero-side-effect UV op. Many forms are
gated on sys_mode (in user mode they stay d_unknown → HALT, preserving
the pre-system bit-identity), and several are deferred (undecoded → HALT).
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
NP |
Control-register read ( |
implemented |
|
|
NP |
CR write ( |
implemented |
|
|
NP |
Debug-register read ( |
implemented (sys mode) |
|
|
NP |
DR write ( |
implemented (sys mode) |
|
|
NP |
6-byte pseudo-descriptor read ( |
implemented |
|
|
NP |
Store GDTR/IDTR — not matched (only /2,/3); |
deferred / HALTs |
|
|
NP |
CR0-alias store/load — not matched; |
deferred / HALTs |
|
|
NP |
TSS-descriptor read microsequence ( |
implemented (sys mode, reg) |
|
|
NP |
TR-selector store ( |
implemented (sys mode, reg) |
|
|
NP |
LDTR load/store — not matched (only /1,/3); |
deferred / HALTs |
|
|
NP |
Clear CR0.TS — not decoded; |
deferred / HALTs |
|
|
NP |
Access-rights / limit query — not decoded; |
deferred / HALTs |
|
|
NP |
Segment-verify — not matched; |
deferred / HALTs |
|
|
NP |
RPL adjust — not decoded; |
deferred / HALTs |
|
|
NP |
Far-pointer load into DS/ES — not decoded; |
deferred / HALTs |
|
|
NP |
Far-pointer load into SS/FS/GS — not decoded; |
deferred / HALTs |
|
|
NP |
Direct EFLAGS.IF clear/set ( |
implemented |
|
|
NP |
Direct EFLAGS.DF clear/set; one |
implemented |
|
|
NP |
Direct EFLAGS.CF clear/set/complement (V has no CF forwarding); serializes. |
implemented |
|
|
NP |
Clean stop → |
implemented |
|
|
UV |
Zero-side-effect 1-byte op; fast-pathed; pairs in either slot. |
implemented |
|
|
NP |
FP-sync barrier — standalone 0x9B not decoded; |
deferred / HALTs |
|
|
NP |
CPU-ID leaf dispatch — not decoded; |
deferred / HALTs |
|
|
NP |
Time-stamp read — not decoded; |
deferred / HALTs |
|
|
NP |
MSR read/write — not decoded (no MSR file); |
deferred / HALTs |
|
|
NP |
SMRAM state restore microsequence ( |
implemented (SMM) |
|
|
NP |
Guaranteed #UD via IDT (system); user mode |
implemented (system mode) |
|
|
PU/NP |
Atomic-RMW prefix ( |
implemented (prefix) |
- MOV to/from CRn (
0F 20,0F 22) NP — a control-register access is a microcoded slow-FSM op outside the U/V ALU datapath, with no fast-path uop, so it can never pair.
0F 20readsCRn(CR0/CR2/CR3/CR4selected by ModR/M.reg) into a GPR via a singleS_EXECmux;0F 22writesCRnfrom a GPR. A CR3 write additionally invalidates all ITLB/DTLB entries (per MOV-CR3 semantics). Neither writes flags (flags_we=0). Architecturally a CR0.PE / CR0.PG write is a serializing mode-change event.- MOV to/from DRn (
0F 21,0F 23) NP,
sys_mode-gated (in user mode0F 21/0F 23stayd_unknown→ HALT, byte-identical to pre-M2S).0F 21reads debug registerDRn(DR0-DR3breakpoint addresses,DR6status,DR7control;DR4/DR5aliasDR6/DR7whenCR4.DE=0) into a GPR;0F 23writes them (forcing the reserved-1 masksDR6_FIXED_1=0xFFFF0FF0andDR7_FIXED_1=0x400so read-back is deterministic). Both carry a pre-execute fault path: withCR4.DE=1an access toDR4/DR5diverts to#UDdelivery before any access; theDR7.GD#DBis decoded but gated off (DBG_GD_ENABLE=0) to match the QEMU golden.- LGDT / LIDT and the rest of the 0F 01 group (
0F 01 /2,/3;/0,/1,/4,/6) LGDT/LIDTare NP microcoded: they read a 6-byte in-memory pseudo-descriptor (2-byte limit + 4-byte base) across two bus beats in theS_LGDTmicrosequence (S_LGDTbeat-0 reads the low word, beat-1 at+4supplies the high base bits) and load the hiddenGDTR/IDTR, then retire once. Only/2and/3of the0F 01group are decoded;SGDT/SIDT(/0//1) andSMSW/LMSW(/4//6) fall to the else-armd_unknownand HALT (explicitly deferred).- LTR / STR and the rest of the 0F 00 group (
0F 00 /1,/3;/0,/2,/4,/5) LTR(/3, NP,sys_mode-gated, reg-form only) loads the task registerTRfrom a GDT TSS selector — a multi-beatS_LTRdescriptor read that populatestr_base/tr_limitand sets the busy bit.STR(/1, NP, sys-mode, reg-form only) stores the currentTRselector (zero-extended) intor/m16via oneS_EXECreg_merge. The other0F 00sub-ops —SLDT/LLDT(/0//2) andVERR/VERW(/4//5) — and any memory form are not matched and HALT (deferred).- Undecoded protection ops (
CLTS0F 06,LAR/LSL0F 02/03,ARPL63,LDS/LESC5/C4,LSS/LFS/LGS0F B2/B4/B5) All NP by class and not decoded —
CLTS(clearCR0.TS),LAR/LSL(load access-rights / segment-limit),ARPL(adjust RPL), and the far-pointer segment loadsLDS/LES/LSS/LFS/LGS— each is absent from its decode casez and resolves tod_unknown→S_HALTrather than mis-execute. (Segment state itself is reachable viaMOV Sregand farJMP, but these far-pointer-load opcodes are not implemented.)- Flag-bit mutators (
CLI/STIFA/FB,CLD/STDFC/FD,CLC/STC/CMCF8/F9/F5) All NP single-byte ops decoded by the slow FSM (never
simple, sofp_can_pairfails on!u.simple). Each directly mutates one EFLAGS bit in a singleS_EXECretire withflags_we=0(a direct write, not via the ALU flag path) and no AGU/ALU/register write:CLI/STIclear/setIF(±0x200),CLD/STDclear/setDF(±0x400),CLC/STC/CMCclear/set/complementCF(±1/^1). The CF ops are U-only for the same datapath reason asADC/SBB— the V ALU path has no CF forwarding.- HLT and NOP (
F4,90) HLTis NP: it stops instruction retirement entirely — it cannot pair because no following instruction issues.S_DECODEroutesd_halttoS_HALT; in cycle mode it first emits one retire record (so the trace matches the oracle’s terminating-instruction record) then halts. This is a clean stop, distinct from the loud no-retired_unknownHALT.NOP(90with no0x66) is the one UV op in this category: fast-pathed with empty reads/writes masks, it flows through PF/D1/D2/EX/WB as a 1-cycle uop, pairs in U or V with any partner (no possible RAW/WAW/disp+imm conflict), and retires at up to 2/clock. It is implemented in both the fast path and the slow FSM.- Serializing / privileged ops (
WAIT9B,CPUID0F A2,RDTSC0F 31,RDMSR/WRMSR0F 32/30) All NP.
WAIT/FWAIT(standalone0x9B) is an FP-sync barrier that is not in the top-level opcode decoder, so it HALTs (d_unknown); theFX_FWAITx87 sub-op handles the FP-escape case as a no-op, but that is a different decode path.CPUID(0F A2— distinct from the single-byteA2 = MOV moffs8,AL),RDTSC(0F 31), andRDMSR/WRMSR(0F 32/0F 30) are all absent from the two-byte casez and HALT asd_unknown(no MSR file is modelled).- RSM / UD2 (
0F AA,0F 0B) RSMis NP and heavily microcoded: it leaves SMM and restores the entire CPU state (CR0/CR3/CR4/CR2, EFLAGS, EIP, all GPRs, all segment selectors and hidden descriptors, GDTR/IDTR, SMBASE) from the SMRAM save-state map in the longS_RSMmicrosequence (many bus beats into holding registers, then a single-clock commit), gated onsys_mode && smm_active(outside SMM →d_unknown→ HALT).UD2(0F 0B) is the guaranteed-invalid opcode delivering#UD(vector 6, a fault) through the IDT in system mode; in user mode there is no IDT, so it is a HALT (byte-identical to pre-M2S).- LOCK prefix (
F0) AP-500 §5.6.2.3 makes a prefixed instruction U-only (PU: may lead a pair, never fill V), but in Ventium the fast path decodes only unprefixed forms, so any
LOCK-prefixed instruction hassimple=0and serializes outright (NP-effective). A locked atomic RMW must hold the U pipe through its memory access. As a prefix it only adjustspfx_len/m_idx; the instruction it guards runs on the slow FSM. The one special architectural model is Erratum 81 (F00F): aLOCK CMPXCHG8Bwith a register destination (0F C7 /1,mod==11) setsd_f00fand, witherrata_en[ERR_F00F]andpfx_lock, enters the documented bus-lock hangS_F00F_HANGinstead of a clean HALT.CMPXCHG8Bitself is not implemented (both forms ared_unknown; the locked reg-dst form additionally hangs under errata).
X87-FPU — floating-point stack¶
Every x87 escape runs alone in the U pipe: issue_uv only pairs simple
integer uops, and an x87 escape is never simple, so in the functional FSM
they are all NP. The engine is the core’s slow microsequenced path
(S_FLOAD → S_FEXEC → S_FSTORE) plus the fpu_x87_pkg helpers operating on
an 80-bit (floatx80) stack file (fpr[8] + ftop, with
st(i) = fpr[(ftop+i)&7]); the fpu_top.sv 8-stage pipe is an M0 stub.
AP-500 classes many of these as FX (pair with a trailing FXCH), but the
RTL does not implement that parallel-pairing case — the divergence is noted
per instruction. In cycle-mode a small whitelist (FK_*) models result
latency and occupancy (e.g. FADD lat 3 / FDIV lat 39) but still issues
U-alone. Tier markers (Tier-1/2/3) follow the FPU spec’s accuracy tiers; any
non-extended precision control (PC != 11) HALTs.
Mnemonic |
Encoding |
U/V class |
Datapath usage |
Status |
|---|---|---|---|---|
|
|
NP |
Push: pre-dec TOP, load ST(0); |
implemented (Tier-1) |
|
|
NP |
Store ST(0) (convert/round), FSTP pops; |
implemented (Tier-1/2) |
|
|
NP |
Integer-mem load/store with exact/rounded int↔floatx80; FIST overflow-errata hook. |
implemented (Tier-1) |
|
|
NP |
Packed-BCD load/store — not routed (DF /4,/6); |
deferred / HALTs |
|
|
NP |
Real 80-bit regfile cross-swap in one clock; parallel-FXCH bypass not modeled. |
implemented (Tier-1) |
|
|
NP |
Stack-management (tag/TOP) ops, no data move; one |
implemented (Tier-1) |
|
|
NP |
Arith via |
implemented (Tier-2) |
|
|
NP |
sqrt(ST0) via |
implemented (Tier-2/3) |
|
|
NP |
Exact sign-bit clear / toggle on ST(0); one |
implemented (Tier-1) |
|
|
NP |
Not enumerated in the D9 reg-form case; |
deferred / HALTs |
|
|
NP |
Ordered (signaling) compare → C3/C2/C0, IE on any NaN; |
implemented (Tier-1) |
|
|
NP |
Unordered (quiet) compare; IE only on SNaN; |
implemented (Tier-1) |
|
|
NP |
Signaling compare of ST(0) vs +0.0 → C3/C2/C0; one |
implemented (Tier-1) |
|
|
NP |
Classify ST(0) (sign + class) into C3/C2/C1/C0 from value + tag. |
implemented (Tier-1) |
|
|
NP |
Integer-mem signaling compare → C3/C2/C0; |
implemented (Tier-1) |
|
|
NP |
Push an 80-bit ROM constant ( |
implemented (Tier-1) |
|
|
NP |
Reset FPU state (TOP/ctrl/status/tags); FNINIT works, FINIT’s 0x9B HALTs. |
implemented (FNINIT) |
|
|
NP |
Clear exception/busy bits, keep C0-C3 + TOP; FNCLEX works, FCLEX 0x9B HALTs. |
implemented (FNCLEX) |
|
|
NP |
Load 16-bit control word (RC/PC/masks); |
implemented (Tier-1) |
|
|
NP |
Store control word; |
implemented (FNSTCW) |
|
|
NP |
Store status word (TOP overlaid) to AX or mem; FSTSW’s 0x9B HALTs. |
implemented |
|
|
NP |
Standalone 0x9B not decoded; |
deferred / HALTs |
|
|
NP |
14/28-byte environment image — not enumerated; |
deferred / HALTs |
|
|
NP |
94/108-byte full state image — not routed (DD /4,/6); |
deferred / HALTs |
Transcendentals ( |
|
NP |
Polynomial/constant-ROM engine — not enumerated; |
deferred / HALTs |
- FLD / FST / FSTP (
D9/DD /0,DB /5,D9 C0+i;D9/DD /2,/3,DB /7,DD D0+i/D8+i) FLDpushes a value:TOPis pre-decremented and the newST(0)loaded.m32/m64are converted tofloatx80(viafx_from_f32/fx_from_f64),m80is loaded canonically, andFLD ST(i)pushes a copy of the oldST(i). Datapath: memory forms setd_f_mem_read+d_f_mbytes(4/8/10), the AGU computes the address on the same ModR/M/SIB path as integer loads,S_FLOADreads 1-3 bus beats intof_mem80, andS_FEXECdecrementsTOP, clears the new tag, and writes the converted value.FST/FSTPstoreST(0)(FSTPthen pops):m32/m64round per RC (setting PE on inexact, IE on overflow),m80is exact;S_FEXEClatches the store value and sticky flags, andS_FSTOREdrives 1-3 beats, applying the pop on the last. NP in the functional FSM; AP-500 ratesFLD m32/m64/ST(i)andFST/FSTPas FX, but the parallel-pairing case is not implemented.- FILD / FIST / FISTP (
DF/DB /0,DF/DB /2,/3,DF /5,/7) FILDpushes a signed 16/32/64-bit integer converted exactly tofloatx80(fx_from_int);FIST/FISTPconvertST(0)to a signed integer (rounded per RC), store it, and (FISTP) pop, with the documented Pentium FIST overflow erratum hook (fist_errata_overflow/fx_to_int_errata) gated behinderrata_en. All NP (AP-500 explicitly classesFILD/FIST/FISTPas NP, not FX).- FXCH (
D9 C8+i) — the documented divergence Exchanges
ST(0)withST(i)(defaultST(1)) — no flags/arith, just a swap. Datapath: a real 80-bit register-file cross-write (fpr[ftop] <= fst(i); fpr[fri(i)] <= fst(0)) in one U-pipe clock, no AGU/ALU/flag/TOP change. On a real P5,FXCHis PV/FX and executes for free in the WF stage paired with a preceding FX op (giving 0 effective latency). Ventium does not model that parallel-FXCH bypass — it does a plain 1-cycle U-pipe swap. The swap value is correct (Tier-1); the cycle-level free-pairing special case is the noted divergence.- FFREE / FINCSTP / FDECSTP / FNOP (
DD C0+i,D9 F7/F6/D0) All NP stack-management ops with no data move and no memory:
FFREEmarksST(i)’s tag empty;FINCSTP/FDECSTProtateTOP±1 without touching tags or data (and clear C1/C2/C3);FNOPis a true no-op. Each is oneS_FEXECstep.- FADD / FSUB / FSUBR / FMUL / FDIV / FDIVR (
D8/DC,D8 /r,DC /r,DA/DE, pop forms) ST(dst) op= operand. ST0-dest forms computeST0 = ST0 op src; ST(i)-dest forms computeST(i) = ST(i) op ST0with the classic x87SUBR/SUBandDIVR/DIVsense-swap (decode flips the aluop bit for reg = 4..7). TheFIADD/FISUB/… forms take a 16/32-bit integer memory operand converted tofloatx80; thep/ipencodings pop after the op. Datapath: memory/int formsS_FLOADthe operand, thenS_FEXECcallsf_eval→{ie, ze, inexact, result}(fx_add/fx_mul/fx_div, round-to-nearest, 64-bit extended), writing the dest slot and latching sticky PE/IE/ZE;f_evalmodels QEMU specials bit-exactly (x/0 → ±Inf+ZE,0/0 → QNaN indefinite+IE). The Pentium FDIV SRT erratum is reproduced two ways: the runtimefx_div_errataanchor for the one published bit-exact operand pair (gated byerrata_en[ERR_FDIV]), and — optionally, at compile time (+define+VEN_SRT_DIV+VEN_SRT_FDIV_BUG) — the genuine radix-4 SRT dividerfx_srt_div, which reproduces the flaw from first principles for all operands. See The r4 SRT divider and the FDIV bug. All NP functionally (AP-500 FX, not modeled); in cycle-mode theD8reg-form models result latency (3 add/sub, 3 mul, 39 div) and occupancy so a dependentfaddchain emerges at CPI ~3. Precision control: any arith withPC != 11setsf_pc_bad→S_HALT(the datapath only implements 64-bit extended).- FSQRT / FABS / FCHS (
D9 FA,D9 E1,D9 E0) FSQRTcomputessqrt(ST0)(fx_isqrtfixed-point + round per RC):sqrt(±0) = ±0(C2 set),sqrt(negative finite) =real-indefiniteQNaN + IE, positive operands take the normal path (PE on inexact); the samePC != 11 → HALTgate applies.FABSclearsST(0)’s sign bit andFCHStoggles it — exact bit twiddles with no flag/IE update. All NP (AP-500 listsFABS/FCHSas FX; not modeled).- FCOM / FUCOM / FTST / FXAM / FICOM families (
D8/DC /2,/3,DD E0+i/E8+i,D9 E4/E5,DA/DE /2,/3,DE D9,DA E9) All compares set the condition codes
C3:C2:C0(000>,001<,100=,111unordered) viaapply_cmp(which preserves C1).FCOM/FCOMP/FCOMPPare ordered (signaling) — IE on any NaN operand (FCOMPpops once,FCOMPPtwice).FUCOM/FUCOMP/FUCOMPPare unordered (quiet) — IE only on a signaling NaN (QNaN allowed).FTSTcomparesST(0)against+0.0(signaling).FXAMclassifiesST(0)(reading the TOP tag for empty), reporting sign + class in C3/C2/C1/C0.FICOM/FICOMPcompare against a signed integer memory operand (signaling). All NP (AP-500 rates the compares as FX).- FLD-constants (
D9 E8..EE) Push an 80-bit ROM constant:
FLD1=1.0,FLDL2T=log2(10),FLDL2E=log2(e),FLDPI=π,FLDLG2=log10(2),FLDLN2=ln(2),FLDZ=0.0.S_FEXECpre-decrementsTOPand writesfconst(sel)(a hard-codedfloatx80ROM). All NP; the cycle model gives them occ/lat 2 (vs 1 forFLD ST(i)/mem) to match the oracle.- FPU control ops (
FNINITDB E3,FNCLEXDB E2,FLDCWD9 /5,FNSTCWD9 /7,FNSTSWDF E0/DD /7) All NP.
FNINITresets the FPU (TOP=0, control word0x037F, status0, tag word all-empty).FNCLEXclears the exception/busy bits while preserving C0-C3 and TOP (fstat &= 0x7f00).FLDCWloads the 16-bit control word from memory (feedingf_rcand thef_pc_badcheck).FNSTCWstores it.FNSTSWstores the status word with the liveTOPoverlaid — toAX(a cross-unit write into the integer GPR file, preservingEAX[31:16]; the canonical post-compare flag read) or to a 16-bit memory operand. TheFWAIT-prefixed wait-forms (FINIT,FCLEX,FSTCW,FSTSW,FSTENV,FSAVE) all require the standalone0x9Bbyte, which is undecoded → HALT, so only the no-waitFN*siblings execute.- Deferred x87 (
FBLD/FBSTP,FXTRACTfamily,FWAIT,FLDENV/FNSTENV,FSAVE/FRSTOR, transcendentals) All NP and not reached. The packed-BCD
FBLD/FBSTP(DF /4//6), theFPREM/FPREM1/FRNDINT/FSCALE/FXTRACTfamily (D9 F4/F5/F8/FC/FD), the standaloneFWAIT(0x9B, with theFX_FWAITexecute arm as dead code), the environment opsFLDENV/FNSTENV(D9 /4//6), the full-state opsFSAVE/FNSAVE/FRSTOR(DD /4//6), and the transcendentalsF2XM1/FYL2X/FYL2XP1/FSIN/FCOS/FSINCOS/FPTAN/FPATAN(D9 F0-F3/F9/FB/FE/FF) are each absent from their decode case (or hit adefault: d_unknown), so an assembled instance clearsd_is_x87and takes the loud unknown-opcode HALT rather than mis-execute. These are spec’d as later-milestone work (BCD, environment/state, and an ulp-tolerance transcendental oracle).