From bf8397b08d382f8460fed334d51a26fe64e4563b Mon Sep 17 00:00:00 2001
From: tevador <tevador@gmail.com>
Date: Mon, 31 Dec 2018 19:27:31 +0100
Subject: [PATCH] Updated specification

---
 doc/isa.md | 180 +++++++++++++++++++++++++++++------------------------
 doc/vm.md  |   9 +--
 2 files changed, 103 insertions(+), 86 deletions(-)

diff --git a/doc/isa.md b/doc/isa.md
index b5841f8..0c0ab7b 100644
--- a/doc/isa.md
+++ b/doc/isa.md
@@ -1,5 +1,4 @@
 
-
 ## RandomX instruction set
 RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
 
@@ -10,16 +9,19 @@ Each RandomX instruction has a length of 128 bits. The encoding is following:
 *All flags are aligned to an 8-bit boundary for easier decoding.*
 
 #### Opcode
-There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:
+There are 256 opcodes, which are distributed between 30 instructions based on their weight (how often they will occur in the program on average). Instructions are divided into 5 groups:
 
-|operation|number of opcodes||
-|---------|-----------------|----|
-|ALU operations|136|53.1%|
-|FPU operations|78|30.5%|
-|Control flow |42|16.4%|
+|group|number of opcodes||comment|
+|---------|-----------------|----|------|
+|IA|115|44.9%|integer arithmetic operations
+|IS|21|8.2%|bitwise shift and rotate
+|FA|70|27.4%|floating point arithmetic operations
+|FS|8|3.1%|floating point single-input operations
+|CF|42|16.4%|control flow instructions (branches)
+||**256**|**100%**
 
 #### Operand A
-The first operand is read from memory. The location is determined by the `loc(a)` flag:
+The first 64-bit operand is read from memory. The location is determined by the `loc(a)` flag:
 
 |loc(a)[2:0]|read A from|address size (W)
 |---------|-|-|
@@ -40,45 +42,56 @@ read_addr = reg(a)[W-1:0]
 `W` is the address width from the above table. For reading from the scratchpad, `read_addr` is multiplied by 8 for 8-byte aligned access.
 
 #### Operand B
-The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).
+The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (instruction groups IA and IS) or a floating point register (instruction group FA). Instruction group FS doesn't use operand B.
 
-|loc(b)[2:0]|read B from|
-|---------|-|
-|000|register `reg(b)`|
-|001|register `reg(b)`|
-|010|register `reg(b)`|
-|011|register `reg(b)`|
-|100|register `reg(b)`|
-|101|register `reg(b)`|
-|110|`imm8` or `imm32`|
-|111|`imm8` or `imm32`|
+|loc(b)[2:0]|B (IA)|B (IS)|B (FA)|B (FS)
+|---------|-|-|-|-|
+|000|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
+|001|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
+|010|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
+|011|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
+|100|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
+|101|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
+|110|`imm32`|`imm8`|floating point `reg(b)`|-
+|111|`imm32`|`imm8`|floating point `reg(b)`|-
 
-`imm8` is an 8-bit immediate value, which is used for shift and rotate ALU operations.
+`imm8` is an 8-bit immediate value, which is used for shift and rotate integer instructions (group IS). Only bits 0-5 are used.
 
-`imm32` is a 32-bit immediate value which is used for most operations. For operands larger than 32 bits, the value is sign-extended. For FPU instructions, the value is considered a signed 32-bit integer and then converted to a double precision floating point format.
+`imm32` is a 32-bit immediate value which is used for integer instructions from group IA.
+
+Floating point instructions don't use immediate values.
 
 #### Operand C
-The third operand is the location where the result is stored.
+The third operand is the location where the result is stored. It can be a register or a 64-bit scratchpad location, depending on the value of flag `loc(c)`.
 
-|loc\(c\)[2:0]|write C to|address size (W)
-|---------|-|-|
-|000|scratchpad|15 bits|
-|001|scratchpad|11 bits|
-|010|scratchpad|11 bits|
-|011|scratchpad|11 bits|
-|100|register `reg(c)`|-|
-|101|register `reg(c)`|-|
-|110|register `reg(c)`|-|
-|111|register `reg(c)`|-|
+|loc\(c\)[2:0]|address size (W)| C (IA, IS)|C (FA, FS)
+|---------|-|-|-|-|-|
+|000|15 bits|scratchpad|floating point `reg(c)`
+|001|11 bits|scratchpad|floating point `reg(c)`
+|010|11 bits|scratchpad|floating point `reg(c)`
+|011|11 bits|scratchpad|floating point `reg(c)`
+|100|15 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
+|101|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
+|110|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
+|111|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
 
-The `reg(c)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).  For writing to the scratchpad, an integer register is always used and the write address is calculated as:
+Integer operations write either to the scratchpad or to a register. Floating point operations always write to a register and can also write to the scratchpad. In that case, bit 3 of the `loc(c)` flag determines if the low or high half of the register is written:
+
+|loc\(c\)[3]|write to scratchpad|
+|------------|-----------------------|
+|0|floating point `reg(c)[63:0]`
+|1|floating point `reg(c)[127:64]`
+
+The FPROUND instruction is an exception and always writes the low half of the register.
+
+For writing to the scratchpad, an integer register is always used to calculate the address:
 ```
 write_addr = 8 * (addr(c) XOR reg(c)[31:0])[W-1:0]
 ```
-*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 write to memory.*
+*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 writes to memory.*
 
 #### imm8
-An 8-bit immediate value that is used as the shift/rotate count by some ALU instructions and as the jump offset of the CALL instruction.
+An 8-bit immediate value that is used as the shift/rotate count by group IS instructions and as the jump offset of the CALL instruction.
 
 #### addr(a)
 A 32-bit address mask that is used to calculate the read address for the A operand. It's sign-extended to 64 bits.
@@ -88,33 +101,33 @@ A 32-bit address mask that is used to calculate the write address for the C oper
 
 ### ALU instructions
 
-|weight|instruction|signed|A width|B width|C|C width|
-|-|-|-|-|-|-|-|
-|10|ADD_64|no|64|64|A + B|64|
-|2|ADD_32|no|32|32|A + B|32|
-|10|SUB_64|no|64|64|A - B|64|
-|2|SUB_32|no|32|32|A - B|32|
-|21|MUL_64|no|64|64|A * B|64|
-|10|MULH_64|no|64|64|A * B|64|
-|15|MUL_32|no|32|32|A * B|64|
-|15|IMUL_32|yes|32|32|A * B|64|
-|10|IMULH_64|yes|64|64|A * B|64|
-|1|DIV_64|no|64|32|A / B|32|
-|1|IDIV_64|yes|64|32|A / B|32|
-|4|AND_64|no|64|64|A & B|64|
-|2|AND_32|no|32|32|A & B|32|
-|4|OR_64|no|64|64|A &#124; B|64|
-|2|OR_32|no|32|32|A &#124; B|32|
-|4|XOR_64|no|64|64|A ^ B|64|
-|2|XOR_32|no|32|32|A ^ B|32|
-|3|SHL_64|no|64|6|A << B|64|
-|3|SHR_64|no|64|6|A >> B|64|
-|3|SAR_64|yes|64|6|A >> B|64|
-|6|ROL_64|no|64|6|A <<< B|64|
-|6|ROR_64|no|64|6|A >>> B|64|
+|weight|instruction|group|signed|A width|B width|C|C width|
+|-|-|-|-|-|-|-|-|
+|10|ADD_64|IA|no|64|64|`A + B`|64|
+|2|ADD_32|IA|no|32|32|`A + B`|32|
+|10|SUB_64|IA|no|64|64|`A - B`|64|
+|2|SUB_32|IA|no|32|32|`A - B`|32|
+|21|MUL_64|IA|no|64|64|`A * B`|64|
+|10|MULH_64|IA|no|64|64|`A * B`|64|
+|15|MUL_32|IA|no|32|32|`A * B`|64|
+|15|IMUL_32|IA|yes|32|32|`A * B`|64|
+|10|IMULH_64|IA|yes|64|64|`A * B`|64|
+|1|DIV_64|IA|no|64|32|`A / B`|32|
+|1|IDIV_64|IA|yes|64|32|`A / B`|32|
+|4|AND_64|IA|no|64|64|`A & B`|64|
+|2|AND_32|IA|no|32|32|`A & B`|32|
+|4|OR_64|IA|no|64|64|`A | B`|64|
+|2|OR_32|IA|no|32|32|`A | B`|32|
+|4|XOR_64|IA|no|64|64|`A ^ B`|64|
+|2|XOR_32|IA|no|32|32|`A ^ B`|32|
+|3|SHL_64|IS|no|64|6|`A << B`|64|
+|3|SHR_64|IS|no|64|6|`A >> B`|64|
+|3|SAR_64|IS|yes|64|6|`A >> B`|64|
+|6|ROL_64|IS|no|64|6|`A <<< B`|64|
+|6|ROR_64|IS|no|64|6|`A >>> B`|64|
 
 ##### 32-bit operations
-Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are zero.
+Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are set to zero.
 
 ##### Multiplication
 There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
@@ -129,24 +142,27 @@ The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`im
 
 ### FPU instructions
 
-|weight|instruction|conversion method|C|
+|weight|instruction|group|C|
 |-|-|-|-|
-|20|FPADD|`convertSigned52`|A + B|
-|20|FPSUB|`convertSigned52`|A - B|
-|22|FPMUL|`convertSigned51`|A * B|
-|8|FPDIV|`convertSigned51`|A / B|
-|6|FPSQRT|`convert52`|sqrt(A)|
-|2|FPROUND|`convertSigned52`|A|
+|20|FPADD|FA|`A + B`|
+|20|FPSUB|FA|`A - B`|
+|22|FPMUL|FA|`A * B`|
+|8|FPDIV|FA|`A / B`|
+|6|FPSQRT|FS|`sqrt(abs(A))`|
+|2|FPROUND|FS|`convertSigned52(A)`|
 
-#### Rounding
-FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` instruction. Denormal values are not be produced by any operation.
+All floating point instructions apart FPROUND are vector instructions that operate on two packed double precision floating point values.
 
 #### Conversion of operand A
-Operand A is loaded from memory as a 64-bit signed integer and then converted to a double-precision floating point format using one of the following 3 methods:
+Operand A is loaded from memory as a 64-bit value. All floating point instructions apart FPROUND interpret A as two packed 32-bit signed integers and convert them into two packed double precision floating point values.
 
-* *convertSigned52* - Clears the 11 least-significant bits before conversion. This is done so the number fits exactly into the 52-bit mantissa without rounding.
-* *convertSigned51* - Clears the 11 least-significant bits and sets the 12th bit before conversion. This is done so the number fits exactly into the 52-bit mantissa without rounding and avoids 0.
-* *convert52* - Clears the 11 least-significant bits and the sign bit before conversion. This is done so the number fits exactly into the 52-bit mantissa without rounding and avoids negative values.
+The FPROUND instruction has a scalar output and interprets A as a 64-bit signed integer. The 11 least-significant bits are cleared before conversion to a double precision format. This is done so the number fits exactly into the 52-bit mantissa without rounding. Output of FPROUND is always written into the lower half of the result register and only this lower half may be written into the scratchpad.
+
+#### Rounding
+FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` instruction. Denormal values must be flushed to zero.
+
+#### NaN
+If an operation produces NaN, the result is converted into positive zero. NaN results may never be written into registers or memory. Only division and multiplication must be checked for NaN results (`0.0 / 0.0` and `0.0 * Infinity` result in NaN).
 
 ##### FPROUND
 The FPROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two least-significant bits of A.
@@ -165,12 +181,15 @@ The rounding modes are defined by the IEEE-754 standard.
 ### Control instructions
 The following 2 control instructions are supported:
 
-|weight|instruction|function|
-|-|-|-|
-|24|CALL|near procedure call|
-|18|RET|return from procedure|
+|weight|instruction|function|condition|
+|-|-|-|-|
+|20|CALL|near procedure call|(see condition table below)
+|22|RET|return from procedure|stack is not empty
 
-Both instructions are conditional. The condition function takes the lower 32 bits of integer register `reg(b)` and the value `imm32` and evaluates a condition based on the `loc(b)` flag: 
+Both instructions are conditional. If the condition evaluates to `false`, CALL and RET behave as "arithmetic no-op" and simply copy operand A into destination C without jumping.
+
+##### CALL
+The CALL instruction uses a condition function, which takes the lower 32 bits of integer register `reg(b)` and the value `imm32` and evaluates a condition based on the `loc(b)` flag: 
 
 |loc(b)[2:0]|signed|jump condition|probability|*x86*|*ARM*
 |---|---|----------|-----|--|----|
@@ -185,13 +204,10 @@ Both instructions are conditional. The condition function takes the lower 32 bit
 
 The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected jump probability (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
 
-In case the branch is not taken, both CALL and RET become "arithmetic no-op" `C = A`.
-
-##### CALL
 Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm8[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
 
 ##### RET
-The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A XOR retval`. Finally, the instruction jumps back to `raddr`.
+The RET instruction is taken only if the stack is not empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A XOR retval`. Finally, the instruction jumps back to `raddr`.
 
 ## Reference implementation
 A portable C++ implementation of all ALU and FPU instructions is available in [instructionsPortable.cpp](../src/instructionsPortable.cpp).
\ No newline at end of file
diff --git a/doc/vm.md b/doc/vm.md
index 24b324c..92ee2da 100644
--- a/doc/vm.md
+++ b/doc/vm.md
@@ -1,4 +1,5 @@
 
+
 ## RandomX virtual machine
 RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
 
@@ -37,10 +38,10 @@ The control unit (CU) controls the execution of the program. It reads instructio
 #### Stack
 To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
 
-*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB for a program that contains only unconditional CALL instructions (the probability of randomly generating such program is about 5×10<sup>-912</sup>). In reality, the stack size will rarely exceed 1 MiB.*
+*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB. Most programs will use around 4 KiB of stack.*
 
 #### Register file
-The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7`. All registers are 64 bits wide.
+The VM has 8 integer registers `r0`-`r7`  and 8 floating point registers `f0`-`f7`. The integer registers are 64 bits wide. The floating point registers are 128 bits wide and each stores two packed double precision numbers.
 
 *The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
 
@@ -48,7 +49,7 @@ The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7
 The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.
 
 #### FPU
-The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root.
+The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root. All operations work with two packed double precision numbers.
 
 #### Binary encoding
-The VM stores and loads all data in little-endian byte order. Signed numbers are represented using two's complement.
\ No newline at end of file
+The VM stores and loads all data in little-endian byte order. Signed integer numbers are represented using two's complement.
\ No newline at end of file