=================== Instruction Catalog =================== This is the per-instruction reference for the Ventium P5/P54C replica. Every instruction the integer and x87 cores decode is listed under its category, together with how it drives the pipeline datapath and which pipe (U or V) it may issue to. Read the **Primer** and **Legend** first: they define the five-stage dual-issue datapath and the U/V pairing classes (``UV`` / ``PU`` / ``PV`` / ``NP``) that every table column below refers to. Each category then has a *list-table* giving, per instruction, the mnemonic, encoding, U/V class, a one-line datapath usage summary, and an honest status; prose under each table fills in what the instruction computes and the full datapath story where it does not fit in a cell. .. _primer: Primer: the five-stage dual-issue datapath ========================================== Ventium models the Pentium's classic **in-order, dual-issue, five-stage** integer pipeline: ================ ============================================================ Stage What happens ================ ============================================================ **PF** Prefetch Instruction bytes are fetched from the L1 I-cache into the prefetch/instruction buffer; prefixes are consumed by the prefix machine. **D1** Decode-1 The opcode is decoded. The *fast-path* decoder (``decode.sv``) recognises the simple, single-cycle forms and sets ``simple=1`` plus the pairing hints ``pairs_first`` / ``pairs_second``. The pairing checker (``issue_uv``) decides whether two adjacent instructions may dual-issue. The BTB is looked up here for branches. **D2** Decode-2 Operands are read from the GPR file; the AGU forms effective addresses for memory operands (and is the source of AGI interlocks). **EX** Execute The shared single-cycle ALU / shifter / branch-resolve logic runs in whichever pipe hosts the instruction. Memory loads issue to the L1 D-cache here. **WB** Writeback Results are committed to the register file and EFLAGS; EFLAGS / register results are forwarded by the full ``EX→EX`` and ``WB→EX`` bypass network so a dependent chain of simple ops runs at 1/clock. ================ ============================================================ Two parallel execution slots run this pipeline: the **U pipe** (the primary slot, which may lead a pair) and the **V pipe** (the secondary slot, which fills behind a U-pipe instruction). When two adjacent instructions satisfy the pairing rules they *dual-issue* and retire together (up to 2/clock); otherwise the instruction issues alone in U with the V slot idle. Not every instruction takes this **fast path**. Byte-operand forms, memory read-modify-write forms, immediate+ModR/M forms, all 0F-prefixed (two-byte) opcodes, string primitives, the x87 escapes, and every system/microcoded op instead decode on the **slow multi-cycle FSM** in ``core.sv`` (``S_DECODE → S_LOAD → S_EXEC → S_STORE`` and friends, plus the ``S_USEQ`` microsequencer for multi-beat ops). The slow path is functionally identical where the fast path also exists (it shares the same ``alu_result`` / ``flags_next`` combinational logic), but it *serializes*: an instruction on the slow FSM issues alone, holding the in-order pipe until it retires. Many instructions are therefore "AP-500 pairable by class" yet, in *this* RTL, only ever run on the slow path — the catalog records both facts honestly. .. _legend: Legend: the U/V pairing classes =============================== The Pentium optimization reference (informally "AP-500") classifies each instruction by *how* it may participate in a U/V pair. Ventium uses the same four classes. The class is a property of the instruction's datapath needs: ``UV`` — pairable in **either** pipe A simple, single-cycle ALU/MOV op with no carry-chain input, no shifter, and no microcode. It can **lead** a pair (as the U member) or **fill** the V slot, subject only to the ordinary RAW/WAW/displacement-plus-immediate pairing checks. *Datapath rationale:* both the U and V ALUs implement the same one-cycle ``alu_result`` / ``flags_next`` logic, so either can host it. ``PU`` — pairable **U-pipe only** The op may **lead** a pair (U member) but can **never** fill the V slot. *Datapath rationale:* it needs a resource only the U pipe provides — the forwarded/latched carry (``ADC`` / ``SBB`` / ``RCL`` / ``RCR``) or the shifter and its CF/OF latch (``SHL`` / ``SHR`` / ``SAR``). The V ALU has no carry forwarding and no shift unit, so an op placed there would read the *stale* architectural ``EFLAGS[0]`` (carry) and corrupt state, or have no shifter at all. A prefixed instruction is also U-only per AP-500 §5.6.2.3 (the prefix-decode slot is a U-pipe resource). ``PV`` — pairable **V-pipe only** The op may **fill** the V slot (behind a leading U op) but can **never** lead a pair. *Datapath rationale:* branches. A taken branch redirects the fetch stream, so nothing can issue *after* it in the same clock — it must be the *trailing* member. This is exactly the ``cmp;jcc`` / ``dec;jnz`` special pair: a flags-writing U op forwards its result flags to the V-slot branch, which resolves against the new flags. ``NP`` — **not pairable** The op always issues alone in U with V idle. *Datapath rationale:* it is microcoded / multi-cycle (``MUL`` / ``DIV``, string ops, ``PUSHA``), a unary or two-source op outside the simple template (``NEG`` / ``NOT`` / ``SHLD``), a memory read-modify-write or two-access op, a system / privileged op, or simply not whitelisted by the fast-path decoder so it falls to the serializing slow FSM. .. note:: Two distinct facts are tracked throughout. **AP-500 class** is the *architectural* pairing class the real Pentium assigns. **Realized pairing** is whether *this* Ventium RTL actually dual-issues the instruction on its fast path. They often agree, but where the fast-path decoder does not whitelist an otherwise-pairable form (so it runs on the slow FSM and serializes), the U/V-class cell states the AP-500 class and the prose / status notes the divergence. A status of *"slow FSM"* means *functionally correct but serialized*; *"deferred / HALTs"* means the opcode is not decoded and traps to a loud ``S_HALT`` rather than mis-execute. .. _int-alu: INT-ALU — integer arithmetic and logic ====================================== The integer ALU group is the core of the fast path. The simple register-form ``01/03``-style ops are single-cycle and pairable in either pipe; the carry-chain ops (``ADC`` / ``SBB``) are pinned to U; the unary ops (``NEG`` / ``NOT``) and most byte / memory / accumulator-immediate forms run on the slow multi-cycle FSM. All share the same combinational ``alu_result`` / ``flags_next`` logic, so the slow forms are bit-identical to the fast ones — they simply do not pair. .. list-table:: :header-rows: 1 :widths: 22 22 12 30 14 * - Mnemonic - Encoding - U/V class - Datapath usage - Status * - ``ADD r/m, r`` - ``00 /r`` (8-bit); ``01 /r`` (16/32) - UV - Single-cycle U/V ALU_ADD; reg-form ``01`` fast-pathed, byte/mem on slow FSM. - implemented * - ``ADD r, r/m`` - ``02 /r``; ``03 /r`` - UV - Same ALU, dst = reg field; ``03`` reg-form fast-path, reg←mem on slow FSM. - implemented * - ``ADD AL/eAX, imm`` - ``04 ib``; ``05 iz`` - UV (AP-500) - Slow FSM only — no fast-path arm, so unpaired in practice. - implemented (slow FSM) * - ``ADD r/m, imm`` - ``80 /0 ib``; ``81 /0 iz`` - UV (AP-500) - Slow FSM group1; reg write or mem load→ALU→store RMW. - implemented (slow FSM) * - ``ADD r/m, imm8`` (sign-ext) - ``83 /0 ib`` - UV - Canonical pairable imm form: imm-only (no disp), single-cycle U/V; mem on slow FSM. - implemented * - ``ADC r/m, r`` · ``ADC r, r/m`` - ``10`` / ``11`` / ``12`` / ``13 /r`` - **PU** - Carry-chain ALU_ADC; reg-form fast-pathed **U-only** (V has no CF forwarding). - implemented (PU) * - ``ADC AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``14 ib`` ``15 iz``; ``80 /2`` ``81 /2`` ``83 /2`` - **PU** - ``83 /2`` reg-form fast-pathed PU; acc/group1-imm on slow FSM. - implemented * - ``SUB r/m, r`` · ``SUB r, r/m`` - ``28`` / ``29`` / ``2A`` / ``2B /r`` - UV - Single-cycle ALU_SUB, no CF input; reg-form fast-path, byte/mem slow FSM. - implemented * - ``SUB AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``2C ib`` ``2D iz``; ``80 /5`` ``81 /5`` ``83 /5`` - UV - ``83 /5`` reg-form fast-pathed UV; acc/group1-imm on slow FSM. - implemented * - ``SBB r/m, r`` · ``SBB r, r/m`` - ``18`` / ``19`` / ``1A`` / ``1B /r`` - **PU** - Borrow-chain ALU_SBB; reg-form fast-pathed **U-only** (the ADC twin). - implemented (PU) * - ``SBB AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``1C ib`` ``1D iz``; ``80 /3`` ``81 /3`` ``83 /3`` - **PU** - ``83 /3`` reg-form fast-pathed PU; acc/group1-imm on slow FSM. - implemented * - ``AND r/m, r`` · ``AND r, r/m`` - ``20`` / ``21`` / ``22`` / ``23 /r`` - UV - Single-cycle ALU_AND (CF/OF/AF cleared); reg-form fast-path. - implemented * - ``AND AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``24 ib`` ``25 iz``; ``80 /4`` ``81 /4`` ``83 /4`` - UV - ``83 /4`` reg-form fast-pathed UV; acc/group1-imm on slow FSM. - implemented * - ``OR r/m, r`` · ``OR r, r/m`` - ``08`` / ``09`` / ``0A`` / ``0B /r`` - UV - Single-cycle ALU_OR; reg-form fast-path, byte/mem slow FSM. - implemented * - ``OR AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``0C ib`` ``0D iz``; ``80 /1`` ``81 /1`` ``83 /1`` - UV - ``83 /1`` reg-form fast-pathed UV; acc/group1-imm on slow FSM. - implemented * - ``XOR r/m, r`` · ``XOR r, r/m`` - ``30`` / ``31`` / ``32`` / ``33 /r`` - UV - Single-cycle ALU_XOR; the ``xor reg,reg`` zero idiom pairs. - implemented * - ``XOR AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``34 ib`` ``35 iz``; ``80 /6`` ``81 /6`` ``83 /6`` - UV - ``83 /6`` reg-form fast-pathed UV; acc/group1-imm on slow FSM. - implemented * - ``CMP r/m, r`` · ``CMP r, r/m`` - ``38`` / ``39`` / ``3A`` / ``3B /r`` - UV - ALU_SUB, EFLAGS only (no reg write); forwards flags U→V to a paired Jcc. - implemented * - ``CMP AL/eAX, imm`` · ``r/m, imm`` · ``r/m, imm8`` - ``3C ib`` ``3D iz``; ``80 /7`` ``81 /7`` ``83 /7`` - UV - ``83 /7`` reg-form fast-pathed UV (flags-forward to Jcc); others slow FSM. - implemented * - ``TEST r/m, r`` - ``84 /r``; ``85 /r`` - UV (AP-500) - ALU_TEST, EFLAGS only; slow FSM only — unpaired in practice. - implemented (slow FSM) * - ``TEST eAX, imm`` - ``A8 ib``; ``A9 iz`` - UV - ``A9`` (eAX,imm32) fast-pathed UV, EFLAGS only; ``A8`` byte on slow FSM. - implemented * - ``TEST r/m, imm`` - ``F6 /0,/1 ib``; ``F7 /0,/1 iz`` - **NP** - group3 slow FSM; reg-form only — memory form sets ``d_unknown`` (deferred). - partial * - ``INC r`` - ``40+r`` (40..47) - UV - ALU_INC (CF preserved), ``b_op`` forced to 1; 1-byte op, fast-pathed. - implemented * - ``DEC r`` - ``48+r`` (48..4F) - UV - ALU_DEC (CF preserved); ``dec;jnz`` forwards flags U→V (loop idiom). - implemented * - ``INC/DEC r/m`` - ``FE /0,/1``; ``FF /0,/1`` - UV (AP-500) - group4/5 slow FSM; reg write or mem RMW — unpaired in practice. - implemented (slow FSM) * - ``NEG r/m`` - ``F6 /3``; ``F7 /3`` - **NP** - Unary ALU_NEG on the slow FSM; never paired (matches AP-500 NP). - implemented (slow FSM, NP) * - ``NOT r/m`` - ``F6 /2``; ``F7 /2`` - **NP** - Unary ALU_NOT, no flags, on the slow FSM; never paired. - implemented (slow FSM, NP) ADD (``ADD r/m,r`` ``00/01``; ``ADD r,r/m`` ``02/03``) Computes ``dst <- dst + src`` and sets all six status flags (CF/PF/AF/ZF/SF/OF). ``alu_op = ALU_ADD``; the result is ``a+b``, CF is the carry-out at the operand width, OF is signed overflow. **Datapath:** the register-form 32-bit op (``01`` / ``03`` with ``mod==11``) is the canonical fast-path case — D1 decodes it ``simple``, the pairing checker admits it (simple, no displacement+immediate, no RAW/WAW), D2 reads both GPRs, EX runs the shared single-cycle ``alu_result``/``flags_next`` in whichever pipe hosts it, and WB commits the register and EFLAGS. Full ``EX→EX`` and ``WB→EX`` bypass lets a dependent ``ADD`` chain run at 1/clock; **UV** because it needs no carry input, so either ALU can host it. The byte forms (``00`` / ``02``) and *all* memory-operand forms drop to the slow FSM (load+ALU, or load→ALU→store RMW): functionally identical, but serialized. ADD with immediate (``04/05`` acc; ``80/81 /0``; ``83 /0``) ``dst <- dst + imm``, same flags. Only the **``83 /0``** sign-extended-imm8 *register* form is fast-pathed and pairable: it is imm-only (no displacement), so the ``disp_imm`` pairing veto does not fire, and it runs single-cycle in U or V. The accumulator forms (``04``/``05``) and the full-width group1 immediate forms (``80``/``81``) have *no* fast-path arm — they decode only on the slow FSM and therefore issue alone, even though AP-500 rates them UV. ADC / SBB — the carry-chain ops (``10–13``, ``18–1B``, group1 ``/2`` ``/3``) ``ADC: dst <- dst + src + CF``; ``SBB: dst <- dst - src - CF``. These are the multi-word add/subtract primitives, and they are **PU — U-pipe only**. This is the central AP-500 finding the project grounds on: the carry input ``cfin`` comes from ``EFLAGS[0]``, and **only the U pipe forwards/latches the carry**. The decoder makes this explicit — for ``ADC``/``SBB`` it sets ``pairs_first=1`` but ``pairs_second=0`` (via ``pairs_second = !(b0[5:3]==010 || ==011)``), with the RTL comment stating that "the V ALU path has no CF forwarding, so an ``adc``/``sbb`` in V would consume the STALE architectural carry and corrupt arch state." So an ``ADC`` may *lead* a pair (U member) provided the V slot holds a non-ADC/SBB op, but can never *fill* V. The ``83 /2`` and ``83 /3`` reg forms are fast-pathed PU; the accumulator (``14``/``15``, ``1C``/``1D``) and group1 immediate forms run on the slow FSM with the same stale-carry rationale. SUB / AND / OR / XOR (``28–2B`` etc.) Simple ALU ops with **no carry input**, so all **UV**. ``SUB`` is ``a-b`` with borrow → CF and signed-subtract overflow → OF, setting all six flags. ``AND``/``OR``/``XOR`` force CF=OF=AF=0 and set ZF/SF/PF from the result. Reg-reg forms (``29``/``2B``, ``21``/``23``, ``09``/``0B``, ``31``/``33``) are fast-pathed; byte and memory forms run on the slow FSM. Note ``xor eax,eax`` still reads and writes ``eax``, so it will not fill V behind a U op that writes ``eax`` (RAW/WAW masks), but is otherwise UV. CMP / TEST (``38–3B``, ``3C/3D``, ``83 /7``; ``84/85``, ``A8/A9``, ``F6/F7 /0,/1``) ``CMP`` computes ``a-b`` like ``SUB`` but discards the result and writes only EFLAGS (the register write mask is 0). It is **UV**, and is the U member of the ``cmp;jcc`` special pair: when ``CMP`` leads and a ``Jcc`` fills V, the core computes ``u_flags_eff`` and *forwards* the new result flags so the paired branch resolves against them. The ``39``/``3B`` and ``83 /7`` reg forms are fast-pathed with this U→V flags forwarding; other forms run on the slow FSM. ``TEST`` is a non-storing ``AND`` (ALU_TEST, reusing the AND result), EFLAGS only. Only the ``A9 eAX,imm32`` form is fast-pathed UV; ``84``/``85`` and ``A8`` run on the slow FSM. ``TEST r/m,imm`` (``F6/F7 /0,/1``) is **NP** per AP-500 (only the reg,reg/mem,reg/imm,acc forms are UV); its register form runs on the slow FSM, and its *memory* form is marked ``d_unknown`` and deferred. INC / DEC (``40+r``, ``48+r``; ``FE/FF /0,/1``) ``dst <- dst ± 1`` with **CF preserved** — distinct from the unary ``NEG``/``NOT`` below, ``INC``/``DEC`` *are* **UV**. The ALU reuses ``a+b`` with the second operand forced to 1; the ``ALU_INC``/``ALU_DEC`` flag arms keep CF unchanged and set OF/AF/ZF/SF/PF. The 32-bit ``40+r``/``48+r`` encodings are 1-byte and fast-pathed (a 1-byte first member is always allowed to pair per the I-cache-split exception). ``dec;jnz`` forwards its flags U→V exactly like ``CMP`` — the loop idiom. The ``FE``/``FF`` r/m forms (including the byte ``INC``/``DEC``) decode only on the slow FSM (reg write or mem RMW) and so issue alone here. NEG / NOT (``F6/F7 /3``, ``F6/F7 /2``) Unary ops, both **NP** — AP-500 explicitly classes them not-pairable despite looking ALU-like, and Ventium has no fast-path arm for them, so they always serialize on the group3 slow FSM. ``NEG`` is ``0 - dst`` (two's-complement) and sets CF=(dst≠0) plus OF/AF/ZF/SF/PF. ``NOT`` is ``~dst`` (one's-complement) and affects **no flags** (``d_writes_flags`` stays 0). Both have a reg form (writes the GPR) and a mem form (load→modify→store RMW). .. _data-mov: DATA-MOV — data movement ======================== Data movement spans the simplest pairable op in the machine (``NOP``) and the most microcoded (segment-register loads). The plain register ``MOV`` and load-immediate forms are fast-pathed UV; register-base loads and ``LEA [base]`` are fast-pathed but interact with the AGU (and the AGI interlock); ``MOVZX`` / ``MOVSX`` / ``XCHG`` / ``LAHF`` / ``CBW`` are NP slow-FSM ops; segment-register ``MOV`` is a microcoded system path; and ``XLAT`` is undecoded. .. list-table:: :header-rows: 1 :widths: 26 20 12 28 14 * - Mnemonic - Encoding - U/V class - Datapath usage - Status * - ``MOV r/m, r`` - ``88 /r``; ``89 /r`` - UV - Reg-form ``89`` single-cycle ALU_MOV (no flags); ``88``/16-bit/mem on slow FSM. - implemented * - ``MOV r, r/m`` - ``8A /r``; ``8B /r`` - UV - ``8B`` reg-form & reg-base load fast-pathed (load is U-member only); ``8A``/disp/SIB slow FSM. - implemented * - ``MOV r, imm`` - ``B0+rb``; ``B8+rd`` - UV - ``B8+r`` imm32 single-cycle (no operand read → never an AGI source); ``B0-B7``/16-bit slow FSM. - implemented * - ``MOV r/m, imm`` - ``C6 /0 ib``; ``C7 /0 iz`` - UV (AP-500) - No fast-path arm → slow FSM only (the disp+imm form the checker excludes); issues alone. - implemented (slow FSM) * - ``MOV AL/eAX, moffs`` / ``moffs, AL/eAX`` - ``A0`` ``A1`` (load); ``A2`` ``A3`` (store) - UV (+ erratum) - Slow-FSM functional; ``A2``/``A3`` modeled in cycle-mode for pairing + Erratum 59. - implemented * - ``MOV r/m16, Sreg`` - ``8C /r`` - **NP** - System path (``SYS_MOVSREG_FROM``); reg-form only, mem dst deferred. - partial * - ``MOV Sreg, r/m16`` - ``8E /r`` - **NP** - System path (``SYS_MOVSREG_TO``); reg-form source only, mem source deferred. - partial * - ``MOVZX`` / ``MOVSX`` - ``0F B6/B7`` (zx); ``0F BE/BF`` (sx) - **NP** - 0F-prefixed slow FSM (``K_EXT``); zero/sign-extend then reg_merge; never paired. - implemented (slow FSM, NP) * - ``XCHG`` - ``86 /r``; ``87 /r``; ``90+rd`` - **NP** - Read-modify-write swap (``K_XCHG``); two writes / locked mem RMW; slow FSM. - implemented (slow FSM, NP) * - ``NOP`` - ``90`` (no 66h) - UV - Zero side-effect 1-byte op; fast-pathed, pairs in either slot with anything. - implemented * - ``LEA r, m`` - ``8D /r`` - UV - ``[base]`` form fast-pathed: AGU writes ``gpr[base]`` to dst, no mem port; AGI-aware. SIB/disp on slow FSM. - implemented * - ``LAHF`` / ``SAHF`` - ``9F``; ``9E`` - **NP** - Dedicated flags↔AH transfer (``K_STKMISC``); slow FSM, no memory. - implemented (slow FSM, NP) * - ``CBW/CWDE`` · ``CWD/CDQ`` - ``98``; ``99`` - **NP** - Accumulator sign-extend convert (``K_CONV``); slow FSM; no memory. - implemented (slow FSM, NP) * - ``XLAT`` / ``XLATB`` - ``D7`` - **NP** - Table-lookup load — *not decoded*; falls to ``d_unknown`` → HALT. - deferred / HALTs * - Segment-override prefixes - ``2E 36 3E 26 64 65`` - PU/NP - Prefix machine redirects the next memory ref's segment; prefixed op stays on slow FSM. - implemented (prefix) * - Operand/address-size prefixes - ``66``; ``67`` - PU/NP - Prefix machine folds size into ``eff_opsize``/``eff_addr`` + length; prefixed op on slow FSM. - implemented (prefix) MOV register and immediate forms (``88/89``, ``8A/8B``, ``B0-BF``) Copies a value with no flag effect: ``alu_op = ALU_MOV`` returns the source operand straight through the ALU. The **register-form** ``89`` (store reg→reg, ``mod==11``) and ``8B`` (load reg←reg) are fast-pathed **UV** — D2 reads the source GPR, EX passes it through the U or V ALU, WB writes the destination, single-cycle with full bypass and no flags. ``B8+r`` (imm32) is the only fast-pathed *load-immediate*: it takes the literal straight to WB and, reading no operand, is never an AGI source. Sub-32-bit destinations use ``reg_merge`` to preserve the unwritten bytes (high-8 ``AH..BH`` via ``d_dst_high8``/``d_src_high8``). All byte (``88``/``8A``/``B0-B7``), 16-bit, and memory-destination forms decode only on the slow FSM (``K_ALU`` / ``S_STORE``) and serialize. ``MOV r/m,imm`` (``C6``/``C7``) has no fast-path arm at all — its (often disp+imm) encoding is exactly what the pairing checker excludes — so it is slow-FSM-only. MOV r, r/m as a load (``8B``, ``mod==00``) The register-base **load** sub-form is a real L1 D-cache access. ``decode.sv`` sets ``is_load=1``, ``base=rm``, ``addr_mask=onehot(rm)``; D2's AGU forms the address from ``gpr[base]``, EX issues the load, WB writes the destination, and the 2-way LRU hit/miss state machine *defers* a miss penalty (``P5_DMISS``, plus ``P5_MISALIGN`` for a split) to the next instruction. Because a V-slot load is forbidden by the conservative ``issue_uv`` checker (``v.is_load`` ⇒ no pair), this load is **UV only when leading** a pair (a U-member load); it cannot fill V. **AGI:** if ``base`` was written the immediately-preceding clock, ``pipe_agi`` fires a 1-cycle stall. Any disp / SIB load (and the byte ``8A``) goes to the slow FSM. MOV moffs (``A0-A3``) and the Erratum-59 model Absolute-displacement ``MOV`` between ``(e)AX`` and the memory at a 32-bit moffs (no ModR/M — the 4-byte displacement *is* the EA). Functional execution is **slow-FSM only** (``A0``/``A1`` load ``AL``/``eAX``; ``A2``/``A3`` store them). The fast-path decoder recognises ``A2``/``A3`` *only in cycle mode* (``is_moffs``, reads ``EAX``) purely to **model** the retire/pairing and **Erratum 59**: with ``errata_en[ERR_MOFFS]`` set, a moffs store fails to pair when the following (V) instruction references ``EAX`` — a false ``EAX`` dependency the modeled P5 instruction unit injects (the ``moffs_falsedep`` check suppresses pairing). With errata off the core pairs them normally. MOV to/from segment registers (``8C``, ``8E``) **NP** — AP-500 excludes seg-register ``MOV`` from the UV data-MOV class. A selector access is a microcoded **system** datapath, not the simple ALU. ``8C`` (``SYS_MOVSREG_FROM``) writes the zero-extended 16-bit selector of ``Sreg`` into ``r/m16``; ``8E`` (``SYS_MOVSREG_TO``) loads ``Sreg`` from ``r/m16`` (real/flat mode: ``base=sel<<4``, ``limit=0xFFFF``, ``attr=0x93``; the protected-mode descriptor-load fault is *computed* but delivery is a later milestone, so a fault can only HALT). Both are reg-form-only — a memory operand raises ``d_unknown`` and HALTs. MOVZX / MOVSX (``0F B6/B7/BE/BF``) **NP** — AP-500 lists these as not-pairable (0F-prefixed, 3+ cycles), and the two-byte ``0F`` escape is not in the fast-path decoder, so they always run on the slow FSM (``K_EXT``). They load ``r/m8`` or ``r/m16`` into a 16/32-bit register, zero-extended (``B6``/``B7``) or sign-extended (``BE``/``BF``); ``d_ext_srcw`` selects source width and ``d_ext_signed`` the extension, with ``reg_merge`` into the destination at the operand width. No flags; high-8 byte sources handled via ``d_src_high8``. XCHG / NOP (``86/87``, ``90+r``, ``90``) ``XCHG`` is **NP**: a read-modify-write *swap* (two register/memory writes, an implicit ``LOCK`` on memory forms), not a simple single-write ALU op, so it runs on the slow FSM (``K_XCHG``) — reg-form cross-writes both GPRs, mem- form does a locked load+store. **``NOP``** (``90`` with no ``0x66``) is the exception: it is **UV**, ``is_nop=1``, fast-pathed, and writes *nothing* (no register, memory, or flag), so it pairs trivially in either slot; being 1-byte it also satisfies the I-cache-split pairing exception. LEA (``8D``) **UV** — ``LEA`` computes an effective address in the AGU and writes it to a register *without* a memory access or flag write, so it slots cleanly into U or V. Only the simplest ``[base]`` form (``mod==00``, ``rm`` not ``100``/``101``) is fast-pathed: ``is_lea``, ``base=rm``, ``addr_mask=onehot(rm)``, and the U/V commit writes ``gpr[dst] <= gpr[base]`` directly (EA == base value), single-cycle with no memory port used. **AGI:** ``addr_mask`` drives ``pipe_agi``, so an ``LEA`` whose base was written the prior clock takes the 1-cycle AGI stall (it consumes the AGU). Full SIB/disp/index ``LEA`` runs on the slow FSM (``gpr[dst] <= q_ea``). LAHF / SAHF and CBW/CWDE / CWD/CDQ (``9F/9E``, ``98/99``) All **NP**. ``LAHF`` copies ``EFLAGS[7:0]`` into ``AH``; ``SAHF`` writes the five status flags back from ``AH`` (``K_STKMISC``, no memory). ``98`` sign-extends ``AL→AX`` or ``AX→EAX`` (CWDE); ``99`` sign-extends ``AX→DX:AX`` (CWD) or ``EAX→EDX:EAX`` (CDQ) (``K_CONV``). All are microcoded converts / flag transfers on the slow FSM, serializing. XLAT (``D7``) **NP** and **not decoded** — a table-lookup load (``AL <- [(E)BX + AL]``, implicit addressing) that has no opcode arm at all; ``D7`` falls through to the one-byte default ``d_unknown`` and HALTs loudly, so it never reaches an execute datapath. Prefixes (segment-override ``2E/36/3E/26/64/65``; ``66``/``67``) A prefix is **PU** on the slow path (AP-500 §5.6.2.3 makes a prefixed instruction U-only-pairable) but **effectively NP** in *this* RTL, because the fast-path decoder only recognises *unprefixed* opcodes — so any prefixed instruction misses the fast path and serializes. The segment overrides record ``d_pfx_seg_en``/``d_pfx_seg_idx`` and redirect the next memory reference's segment in the AGU. ``0x66``/``0x67`` toggle operand / address size (with the real-mode ``def16`` inversion, so ``0x66`` selects 32-bit there), feeding ``d_w`` and the length functions across every decode arm; they compute nothing themselves but the instruction they prefix then executes on the slow FSM at the chosen width. .. _stack: STACK — push, pop, and frame management ======================================= The stack group exercises the store/load AGU with the implicit ``ESP`` register. A subtlety threads the whole group: the decoder *masks ``ESP`` out* of the reads/writes bitmasks (``_onehot`` returns 0 for ``R_ESP``), so a ``push/push`` or ``push/call`` sequence never trips a false ``ESP`` RAW/WAW — the AP-500 §5.6.4 special-pair rule. Consequently ``PUSH``/``POP`` of a register or immediate are **UV by class**. *However*, none of the stack ops are whitelisted by the fast-path decoder (``decode.sv`` does not decode them), so in *this* Ventium they all run on the slow FSM and do **not** emergently dual-issue — the catalog records the AP-500 class, not a realized pairing. The memory-form, multi-register, and flags/segment forms are genuinely NP / microcoded. .. list-table:: :header-rows: 1 :widths: 24 22 14 28 12 * - Mnemonic - Encoding - U/V class - Datapath usage - Status * - ``PUSH r16/r32`` - ``50+rd`` (66h → r16) - UV (AP-500) - Pre-decrement ESP store; SS-based AGU; slow-FSM single store (not fast-pathed). - implemented (slow FSM) * - ``POP r16/r32`` - ``58+rd`` (66h → r16) - UV (AP-500) - Load from [SS:ESP] then ESP += w; slow-FSM single load (not fast-pathed). - implemented (slow FSM) * - ``PUSH imm`` - ``68 id``; ``6A ib`` (sign-ext) - UV (AP-500) - Pre-decrement ESP store of the latched immediate; slow-FSM single store. - implemented (slow FSM) * - ``PUSH r/m32`` - ``FF /6`` - **NP** - Two-access op (operand load + stack store); slow FSM, issues alone. - implemented (slow FSM, NP) * - ``POP r/m32`` - ``8F /0`` - **NP** - Stack load + store-to-EA (two memory accesses); slow FSM, issues alone. - implemented (slow FSM, NP) * - ``PUSHA / PUSHAD`` - ``60`` - **NP** - 8-beat micro-sequence (``S_USEQ``) off the latched original ESP; serializes. - implemented (microcoded) * - ``POPA / POPAD`` - ``61`` - **NP** - 8-beat ascending load micro-sequence (``S_USEQ``), ESP slot skipped; serializes. - implemented (microcoded) * - ``PUSHF / PUSHFD`` - ``9C`` - **NP** - Stores EFLAGS as the datum (serializing implicit source); slow-FSM single store. - implemented (slow FSM, NP) * - ``POPF / POPFD`` - ``9D`` - **NP** - Masked EFLAGS rewrite (may set TF/IF/DF); slow-FSM load + flag write. - implemented (slow FSM, NP) * - ``LEAVE`` - ``C9`` (66h → 16-bit) - **NP** - Fused ESP←EBP then POP EBP (EBP-based load); slow FSM, issues alone. - implemented (slow FSM, NP) * - ``ENTER imm16, imm8`` - ``C8 iw ib`` - **NP** - Microcoded frame-build — *not decoded*; ``d_unknown`` → HALT. - deferred / HALTs * - ``PUSH sreg`` (ES/CS/SS/DS/FS/GS) - ``06 0E 16 1E``; ``0F A0/A8`` - **NP** - Segment-selector push — *not decoded*; ``d_unknown`` → HALT. - deferred / HALTs * - ``POP sreg`` (ES/SS/DS/FS/GS) - ``07 17 1F``; ``0F A1/A9`` - **NP** - Descriptor-reloading segment pop — *not decoded*; ``d_unknown`` → HALT. - deferred / HALTs PUSH / POP register and immediate (``50+r``, ``58+r``, ``68``/``6A``) ``PUSH`` pre-decrements ``ESP`` by the operand width then stores the source to ``[SS:ESP]``; ``POP`` loads from ``[SS:ESP]`` into the destination then post-increments ``ESP`` (a ``POP`` into ``ESP`` itself suppresses the ``+w``). ``PUSH imm`` stores the decode-latched immediate (``6A`` sign- extends its imm8). **Datapath:** the AGU forms the descending store address ``gpr[ESP] - w`` (pre-decrement adder in D2) or the ascending load address ``gpr[ESP]``, with the ``SS`` base applied; ``ESP`` is rewritten at WB. These are **UV by AP-500 class** (``ESP`` masked from contention), but because ``decode.sv`` does not decode them, ``simple`` stays 0 and ``S_PIPE`` hands them to the slow FSM (``S_DECODE → S_EXEC → S_STORE``, or ``→ S_LOAD → S_EXEC``) — a single ~1-cycle memory access that **serializes**. So they are pairable by class but do not emergently dual-issue in this RTL. PUSH / POP r/m (``FF /6``, ``8F /0``) Both **NP**. ``PUSH r/m32`` for a *memory* operand needs *two* memory accesses — load ``[EA]`` then store to ``[SS:ESP]`` — which exceeds the single-cycle template, so it serializes alone in U (the register form routes through the same NP slow path). ``POP r/m32`` likewise pops the stack word *and* stores it to a memory destination (load-from-stack + store-to-EA), a two-memory-access op. Both run multi-cycle on the slow FSM. PUSHA / POPA (``60``, ``61``) **NP / microcoded.** ``PUSHA`` pushes all eight GPRs in order (``EAX,ECX,EDX,EBX,`` original ``ESP,EBP,ESI,EDI``) as 8 sequential stores over the ``S_USEQ`` micro-sequencer; the *original* ``ESP`` is latched into ``pusha_esp`` on entry so every descending address (and the pushed ``ESP`` slot) is computed off the fixed value, and at step 7 it commits ``ESP <= original - 32``. ``POPA`` is the inverse: 8 ascending loads into ``EDI,ESI,EBP,`` (skip ``ESP``)``,EBX,EDX,ECX,EAX``, then ``ESP += 32``. Each is ~8+ cycles gated on ``mem_ack`` and holds the in-order pipe for the whole run. PUSHF / POPF (``9C``, ``9D``) Both **NP**. ``PUSHF`` reads the architectural ``EFLAGS`` as its store datum (a serializing implicit source — the V pipe has no ``EFLAGS`` forwarding for this) and writes it to ``[SS:ESP]-w``. ``POPF`` loads a word and writes it into ``EFLAGS`` under the user-mode writability mask (``0x00244DD5`` — ``CF|PF|AF|ZF|SF|TF|DF|OF|NT|AC|ID``; ``IF``/``IOPL``/ ``VM``/``RF`` preserved, bit 1 forced 1), then ``ESP += w``. Because ``POPF`` can set ``TF``, the pipe carries the issue-time flags for the step-trap decision. LEAVE / ENTER (``C9``, ``C8``) ``LEAVE`` is **NP**: a fused two-step frame teardown — ``ESP <- EBP`` then ``POP EBP`` — with an ``EBP``-based load (``SM_LEAVE``: read ``[EBP]``, write ``EBP`` and ``ESP <= old-EBP + slot``), serializing on the slow FSM (both 32-bit and 66h 16-bit forms). ``ENTER`` is **NP** *and not implemented*: opcode ``0xC8`` has no decode arm, so it hits ``d_unknown`` and HALTs loudly; there is no frame-build / display-copy micro-sequence. PUSH / POP segment registers (``06/0E/16/1E``, ``07/17/1F``, ``0F A0/A1/A8/A9``) **NP** by class and *unimplemented* — none of these one-byte or 0F-prefixed forms have a decode arm, so they resolve to ``d_unknown`` and HALT. A segment-register push is a special-source store the P5 does not pair, and a segment *pop* triggers a descriptor reload (a microcoded system action); neither datapath exists for these opcodes here. .. note:: ``LAHF`` / ``SAHF`` (``9F``/``9E``) share the ``K_STKMISC`` micro-op group and decode path with the stack ops, but touch **no** memory and do **not** move ``ESP``. They are **NP** flag↔AH transfers (``LAHF: AH <- EFLAGS[7:0]``; ``SAHF`` rebuilds the five status flags from ``AH``), implemented on the slow FSM. They are documented in full under :ref:`data-mov`. .. _shift-bit: SHIFT-BIT — shifts, rotates, and bit operations =============================================== The shift group is the home of the **PU** class. The immediate-count ``SHL``/``SHR``/``SAR`` *register* forms (``C1 /4-/7``) are fast-pathed but **U-pipe only**: the shifter and its CF/OF latch live on the U-pipe EX datapath, and the V ALU has neither, so a shift may *lead* a pair but never *fill* V. The by-1 (``D0``/``D1``) and by-CL (``D2``/``D3``) forms, the rotates, the double-precision ``SHLD``/``SHRD``, the bit-test family, ``BSF``/``BSR``, ``BSWAP``, and ``SETcc`` are all slow-FSM-only (and the 0F-prefixed ones are NP by class), so they serialize. .. list-table:: :header-rows: 1 :widths: 24 24 14 26 12 * - Mnemonic - Encoding - U/V class - Datapath usage - Status * - ``SHL/SAL r/m, imm8`` - ``C1 /4`` (``/6`` alias) - **PU** - Reg-form fast-pathed **U-only** single-cycle shifter; mem-form slow ``K_SHIFT`` RMW. - implemented * - ``SHR r/m, imm8`` - ``C1 /5`` - **PU** - Reg-form fast-pathed U-only; mem-form slow RMW. - implemented * - ``SAR r/m, imm8`` - ``C1 /7`` - **PU** - Reg-form fast-pathed U-only (sign-replicating ``>>>``); mem-form slow RMW. - implemented * - ``SHL/SHR/SAR/SAL r/m, 1`` - ``D0 /4-/7``; ``D1 /4-/7`` - PU (AP-500) - *Not* fast-pathed → slow ``K_SHIFT`` (``shift_one``); serializes (NP-effective). - implemented (slow FSM) * - ``SHL/SHR/SAR/SAL r/m, CL`` - ``D2 /4-/7``; ``D3 /4-/7`` - **NP** - Implicit CL read + variable latency; slow ``K_SHIFT``, issues alone. - implemented (slow FSM, NP) * - ``ROL / ROR r/m, count`` - ``C0/C1 /0,/1``; ``D0-D3 /0,/1`` - PU/NP (AP-500) - Intentionally **not** fast-pathed (richer OF); all forms slow ``K_SHIFT``. - implemented (slow FSM) * - ``RCL / RCR r/m, count`` - ``C0/C1 /2,/3``; ``D0-D3 /2,/3`` - PU/NP (AP-500) - Carry-through rotate; slow ``K_SHIFT``, seeds ``cfin=EFLAGS[0]``; serializes. - implemented (slow FSM) * - ``SHLD r/m, r, imm8/CL`` - ``0F A4 /r ib``; ``0F A5 /r`` - **NP** - Double-precision two-source shift; slow ``K_SHLDRD``; reg-dst only (mem → HALT). - implemented (reg dst) * - ``SHRD r/m, r, imm8/CL`` - ``0F AC /r ib``; ``0F AD /r`` - **NP** - Double-precision right; slow ``K_SHLDRD``; reg-dst only (mem → HALT). - implemented (reg dst) * - ``BT r/m, r/imm8`` - ``0F A3 /r``; ``0F BA /4 ib`` - **NP** - Bit test → CF, no write; slow ``K_BITTEST``; reg-direct only (mem → HALT). - implemented (reg dst) * - ``BTS r/m, r/imm8`` - ``0F AB /r``; ``0F BA /5 ib`` - **NP** - Test-and-set; slow ``K_BITTEST`` (``cur | 1<