Interpreter with bytecode

Fixed some undefined behavior with signed types Fixed different results on big endian systems Removed unused code files Restored FNEG_R instructions Updated documentation
2024-08-15 00:23:14 +00:00 · 2019-02-09 15:45:26 +01:00 · 2019-02-09 15:45:26 +01:00 · 32d827d0a6
commit 32d827d0a6
parent a586751f6b
41 changed files with 1517 additions and 3621 deletions
--- a/doc/dataset.md
+++ b/doc/dataset.md
@ -1,13 +1,13 @@

 ## Dataset

-The dataset serves as the source of the first operand of all instructions and provides the memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 65536 blocks, each 64 KiB in size.
+The dataset is randomly accessed 16384 times during each hash calculation, which significantly increases memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 67108864 block of 64 bytes.

-In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 64 MiB cache, which can be used to calculate dataset blocks on the fly. To facilitate this, all random reads from the dataset are aligned to the beginning of a block.
+In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 256 MiB cache, which can be used to calculate dataset rows on the fly.

-Because the initialization of the dataset is computationally intensive, it's recalculated on average every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
+Because the initialization of the dataset is computationally intensive, it is recalculated only every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:

-![Imgur](https://i.imgur.com/JgLCjeq.png)
+![Imgur](https://i.imgur.com/b9WHOwo.png)

 ### Seed block
 The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations.
@ -21,7 +21,7 @@ The whole dataset is constructed from a 256-bit hash of the last block whose hei

 ### Cache construction

-The 32-byte seed block hash is expanded into the 64 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
+The 32-byte seed block hash is expanded into the 256 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.

 Argon2 is used with the following parameters:

@ -29,8 +29,8 @@ Argon2 is used with the following parameters:
 |------------|--|
 |parallelism|1|
 |output size|0|
-|memory|65536 (64 MiB)|
-|iterations|12|
+|memory|262144 (256 MiB)|
+|iterations|3|
 |version|`0x13`|
 |hash type|0 (Argon2d)
 |password|seed block hash (32 bytes)
@ -40,43 +40,66 @@ Argon2 is used with the following parameters:

 The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.

-The use of 12 iterations makes time-memory tradeoffs infeasible and thus 64 MiB is the minimum amount of memory required by RandomX.
-
-When the memory fill is complete, the whole memory array is cyclically shifted backwards by 512 bytes (i.e. bytes 0-511 are moved to the end of the array). This is done to misalign the array so that each 1024-byte cache block spans two subsequent Argon2 blocks.
+The use of 3 iterations makes time-memory tradeoffs infeasible and thus 256 MiB is the minimum amount of memory required by RandomX.

 ### Dataset block generation
-The full 4 GiB dataset can be generated from the 64 MiB cache. Each block is generated separately: a 1024 byte block of the cache is expanded into 64 KiB of the dataset. The algorithm has 3 steps: expansion, AES and shuffle.
+The full 4 GiB dataset can be generated from the 256 MiB cache. Each 64-byte block is generated independently by XORing 16 pseudorandom Cache blocks selected by the `SquareHash` function.

-#### Expansion
-The 1024 cache bytes are split into 128 quadwords and interleaved with 504-byte chunks of null bytes. The resulting sequence is: 8 cache bytes + 504 null bytes + 8 cache bytes + 504 null bytes etc. Total length of the expanded block is 65536 bytes.
+#### SquareHash
+`SquareHash` is a custom hash function with 64-bit input and 64-bit output. It is calculated by repeatedly squaring the input, splitting the 128-bit result in to two 64-bit halves and subtracting the high half from the low half. This is repeated 42 times. It's available as a [portable C implementation](../src/squareHash.h) and [x86-64 assembly version](../src/asm/squareHash.inc).

-#### AES
-The 256-bit seed block hash is expanded into 10 AES round keys `k0`-`k9`. Let `i = 0...65535` be the index of the block that is being expanded. If `i` is an even number, this step uses AES *decryption* and if `i` is an odd number, it uses AES *encryption*.  Since both encryption and decryption scramble random data, no distinction is made between them in the text below.
+Properties of `SquareHash`:

-The AES encryption is performed with 10 identical rounds using round keys `k0`-`k9`. Note that this is different from the typical AES procedure, which uses a different key schedule for decryption and a modified last round.
+* It achieves full [Avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect).
+* Since the whole calculation is a long dependency chain, which uses only multiplication and subtraction, the performance gains by using custom hardware are very limited.
+* A single `SquareHash` calculation takes 40-80 ns, which is about the same time as DRAM access latency. Devices using low-latency memory will be bottlenecked by `SquareHash`, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM.

-Before the AES encryption is applied, each 16-byte chunk is XORed with the ciphertext of the previous chunk. This is similar to the [AES-CBC](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Cipher_Block_Chaining_%28CBC%29) mode of operation and forces the encryption to be sequential. For XORing the initial block, an initialization vector is formed by zero-extending `i` to 128 bits.
+The output of 16 chained SquareHash calculations is used to determine Cache blocks that are XORed together to produce a Dataset block:

-#### Shuffle
-When the AES step is complete, the last 16-byte chunk of the block is used to initialize a PCG32 random number generator. Bits 0-63 are used as the initial state and bits 64-127 are used as the increment. The least-significant bit of the increment is always set to 1 to form an odd number.
+```c++
+void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) {
+  uint64_t r0, r1, r2, r3, r4, r5, r6, r7;

-The whole block is then divided into 16384 doublewords (4 bytes) and the [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) algorithm is applied to it. The algorithm generates a random in-place permutation of the 16384 doublewords. The result of the shuffle is the `i`-th block of the dataset.
+  r0 = 4ULL * blockNumber;
+  r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;

-The shuffle algorithm requires a uniform distribution of random numbers. The output of the PCG32 generator is always properly filtered to avoid the [modulo bias](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modulo_bias).
+  constexpr uint32_t mask = (CacheSize - 1) & CacheLineAlignMask;
+
+  for (auto i = 0; i < DatasetIterations; ++i) {
+    const uint8_t* mixBlock = cache + (r0 & mask);
+    PREFETCHNTA(mixBlock);
+    r0 = squareHash(r0);
+    r0 ^= load64(mixBlock + 0);
+    r1 ^= load64(mixBlock + 8);
+    r2 ^= load64(mixBlock + 16);
+    r3 ^= load64(mixBlock + 24);
+    r4 ^= load64(mixBlock + 32);
+    r5 ^= load64(mixBlock + 40);
+    r6 ^= load64(mixBlock + 48);
+    r7 ^= load64(mixBlock + 56);
+  }
+
+  store64(out + 0, r0);
+  store64(out + 8, r1);
+  store64(out + 16, r2);
+  store64(out + 24, r3);
+  store64(out + 32, r4);
+  store64(out + 40, r5);
+  store64(out + 48, r6);
+  store64(out + 56, r7);
+}
+```
+
+*Note: `SquareHash` doesn't calculate squaring modulo 2<sup>64</sup>+1 because the subtraction is performed modulo 2<sup>64</sup>. Squaring modulo 2<sup>64</sup>+1 can be calculated by adding the carry bit in every iteration (i.e. the sequence in x86-64 assembly would have to be: `mul rax; sub rax, rdx; adc rax, 0`), but this would decrease ASIC-resistance of `SquareHash`.*

 ### Performance
-The initial 64-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be easily parallelized.
+The initial 256-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be parallelized.

-Dataset generation performance depends on the support of the AES-NI instruction set. The following table lists the generation runtimes using the same Ivy Bridge laptop with a single thread:
+On the same laptop, full Dataset initialization takes around 100 seconds using a single thread (1.5 µs per block).

-|AES|4 GiB dataset generation|single block generation|
-|-----|-----------------------------|----------------|
-|hardware (AES-NI)|25 s|380 µs|
-|software|53 s|810 µs|
+While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the Dataset generation time decreases linearly with the number of threads. Using an 8-core AMD Ryzen CPU, the whole dataset can be generated in under 10 seconds.

-While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using a recent 6-core CPU with AES-NI support, the whole dataset can be generated in about 4 seconds.
-
-Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating ~512 dataset blocks per minute (corresponds to less than 1% utilization of a single CPU core).
+Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating 524288 dataset blocks per minute (corresponds to about 1% utilization of a single CPU core).

 ### Light clients
-Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly as the program is being executed. In this case, the program execution time will be increased by roughly 100 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 40 milliseconds per program.
+Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly during hash calculation. In this case, the hash calculation time will be increased by 16384 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 24.5 milliseconds per hash.
--- a/doc/isa-ops.md
+++ b/doc/isa-ops.md
@ -1,130 +1,103 @@
-
 # RandomX instruction listing
-There are 31 unique instructions divided into 3 groups:
-
-|group|# operations|# opcodes||
-|---------|-----------------|----|-|
-|integer (IA)|22|144|56.3%|
-|floating point (FP)|5|76|29.7%|
-|control (CL)|4|36|14.0%
-||**31**|**256**|**100%**
-

 ## Integer instructions
-There are 22 integer instructions. They are divided into 3 classes (MATH, DIV, SHIFT) with different B operand selection rules.
+For integer instructions, the destination is always an integer register (register group R). Source operand (if applicable) can be either an integer register or memory value. If `dst` and `src` refer to the same register, most instructions use `imm32` as the source operand instead of the register. This is indicated in the 'src == dst' column.

-|# opcodes|instruction|class|signed|A width|B width|C|C width|
+Memory operands are loaded as 8-byte values from the address indicated by `src`.  This indirect addressing is marked with square brackets: `[src]`.
+
+|frequency|instruction|dst|src|`src == dst ?`|operation|
 |-|-|-|-|-|-|-|-|
-|12|ADD_64|MATH|no|64|64|`A + B`|64|
-|2|ADD_32|MATH|no|32|32|`A + B`|32|
-|12|SUB_64|MATH|no|64|64|`A - B`|64|
-|2|SUB_32|MATH|no|32|32|`A - B`|32|
-|21|MUL_64|MATH|no|64|64|`A * B`|64|
-|10|MULH_64|MATH|no|64|64|`A * B`|64|
-|15|MUL_32|MATH|no|32|32|`A * B`|64|
-|15|IMUL_32|MATH|yes|32|32|`A * B`|64|
-|10|IMULH_64|MATH|yes|64|64|`A * B`|64|
-|4|DIV_64|DIV|no|64|32|`A / B`|64|
-|4|IDIV_64|DIV|yes|64|32|`A / B`|64|
-|4|AND_64|MATH|no|64|64|`A & B`|64|
-|2|AND_32|MATH|no|32|32|`A & B`|32|
-|4|OR_64|MATH|no|64|64|`A | B`|64|
-|2|OR_32|MATH|no|32|32|`A | B`|32|
-|4|XOR_64|MATH|no|64|64|`A ^ B`|64|
-|2|XOR_32|MATH|no|32|32|`A ^ B`|32|
-|3|SHL_64|SHIFT|no|64|6|`A << B`|64|
-|3|SHR_64|SHIFT|no|64|6|`A >> B`|64|
-|3|SAR_64|SHIFT|yes|64|6|`A >> B`|64|
-|6|ROL_64|SHIFT|no|64|6|`A <<< B`|64|
-|6|ROR_64|SHIFT|no|64|6|`A >>> B`|64|
+|12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`|
+|7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`|
+|16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`|
+|12/256|ISUB_R|R|R|`src = imm32`|`dst = dst - src`|
+|7/256|ISUB_M|R|mem|`src = imm32`|`dst = dst - [src]`|
+|9/256|IMUL_9C|R|-|-|`dst = 9 * dst + imm32`|
+|16/256|IMUL_R|R|R|`src = imm32`|`dst = dst * src`|
+|4/256|IMUL_M|R|mem|`src = imm32`|`dst = dst * [src]`|
+|4/256|IMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64`|
+|1/256|IMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64`|
+|4/256|ISMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64` (signed)|
+|1/256|ISMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64` (signed)|
+|4/256|IDIV_C|R|-|-|`dst = dst + dst / imm32`|
+|4/256|ISDIV_C|R|-|-|`dst = dst + dst / imm32` (signed)|
+|2/256|INEG_R|R|-|-|`dst = -dst`|
+|16/256|IXOR_R|R|R|`src = imm32`|`dst = dst ^ src`|
+|4/256|IXOR_M|R|mem|`src = imm32`|`dst = dst ^ [src]`|
+|10/256|IROR_R|R|R|`src = imm32`|`dst = dst >>> src`|
+|4/256|ISWAP_R|R|R|`src = dst`|`temp = src; src = dst; dst = temp`|

-#### 32-bit operations
-Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are set to zero.
+#### IMULH and ISMULH
+These instructions output the high 64 bits of the whole 128-bit multiplication result. The result differs for signed and unsigned multiplication (`IMULH` is unsigned, `ISMULH` is signed). The variants with a register source operand do not use `imm32` (they perform a squaring operation if `dst` equals `src`).

-#### Multiplication
-There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
+#### IDIV_C and ISDIV_C
+The division instructions use a constant divisor, so they can be optimized into a [multiplication by fixed-point reciprocal](https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant). `IDIV_C` performs unsigned division (`imm32` is zero-extended to 64 bits), while `ISDIV_C` performs signed division. In the case of division by zero, the instructions become a no-op. In the very rare case of signed overflow, the destination register is set to zero.

-#### Division
-For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend `A`.
-
-75% of division instructions use a runtime-constant divisor and can be optimized using a multiplication and shifts.
-
-#### Shift and rotate
-The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm8` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
+#### ISWAP_R
+This instruction swaps the values of two registers. If source and destination refer to the same register, the result is a no-op.

 ## Floating point instructions
-There are 5 floating point instructions. All floating point instructions are vector instructions that operate on two packed double precision floating point values.
+For floating point instructions, the destination can be a group F or group E register. Source operand is either a group A register or a memory value.

-|# opcodes|instruction|C|
-|-|-|-|
-|20|FPADD|`A + B`|
-|20|FPSUB|`A - B`|
-|22|FPMUL|`A * B`|
-|8|FPDIV|`A / B`|
-|6|FPSQRT|`sqrt(abs(A))`|
+Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`.

-#### Conversion of operand A
-Operand A is loaded from memory as a 64-bit value. All floating point instructions interpret A as two packed 32-bit signed integers and convert them into two packed double precision floating point values.
+|frequency|instruction|dst|src|operation|
+|-|-|-|-|-|-|-|
+|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
+|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
+|5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`|
+|20/256|FSUB_R|F|A|`(dst0, dst1) = (dst0 - src0, dst1 - src1)`|
+|5/256|FSUB_M|F|mem|`(dst0, dst1) = (dst0 - [src][0], dst1 - [src][1])`|
+|6/256|FNEG_R|F|-|`(dst0, dst1) = (-dst0, -dst1)`|
+|20/256|FMUL_R|E|A|`(dst0, dst1) = (dst0 * src0, dst1 * src1)`|
+|4/256|FDIV_M|E|mem|`(dst0, dst1) = (dst0 / [src][0], dst1 / [src][1])`|
+|6/256|FSQRT_R|E|-|`(dst0, dst1) = (√dst0, √dst1)`|
+
+#### Denormal and NaN values
+Due to restrictions on the values of the floating point registers, no operation results in `NaN`.
+`FDIV_M` can produce a denormal result. In that case, the result is set to `DBL_MIN = 2.22507385850720138309e-308`, which is the smallest positive normal number.

 #### Rounding
-FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` control instruction. Denormal values must be always flushed to zero.
+All floating point instructions give correctly rounded results. The rounding mode depends on the value of the `fprc` register:

-#### NaN
-If an operation produces NaN, the result is converted into positive zero. NaN results may never be written into registers or memory. Only division and multiplication must be checked for NaN results (`0.0 / 0.0` and `0.0 * Infinity` result in NaN).
-
-## Control instructions
-There are 4 control instructions.
-
-|# opcodes|instruction|description|condition|
-|-|-|-|-|
-|2|FPROUND|change floating point rounding mode|-
-|11|JUMP|conditional jump|(see condition table below)
-|11|CALL|conditional procedure call|(see condition table below)
-|12|RET|return from procedure|stack is not empty
-
-All control instructions behave as 'arithmetic no-op' and simply copy the input operand A into the destination C.
-
-The JUMP and CALL instructions use a condition function, which takes the lower 32 bits of operand B (register) and the value `imm32` and evaluates a condition based on the `B.LOC.C` flag: 
-
-|`B.LOC.C`|signed|jump condition|probability|*x86*|*ARM*
-|---|---|----------|-----|--|----|
-|0|no|`B <= imm32`|0% - 100%|`JBE`|`BLS`
-|1|no|`B > imm32`|0% - 100%|`JA`|`BHI`
-|2|yes|`B - imm32 < 0`|50%|`JS`|`BMI`
-|3|yes|`B - imm32 >= 0`|50%|`JNS`|`BPL`
-|4|yes|`B - imm32` overflows|0% - 50%|`JO`|`BVS`
-|5|yes|`B - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
-|6|yes|`B < imm32`|0% - 100%|`JL`|`BLT`
-|7|yes|`B >= imm32`|0% - 100%|`JGE`|`BGE`
-
-The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected jump probability (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
-
-### FPROUND
-The FPROUND instruction changes the rounding mode for all subsequent FPU operations depending on a two-bit flag. The flag is calculated by rotating A `imm8` bits to the right and taking the two least-significant bits:
-
-```
-rounding flag = (A >>> imm8)[1:0]
-```
-
-|rounding flag|rounding mode|
+|`fprc`|rounding mode|
 |-------|------------|
-|00|roundTiesToEven|
-|01|roundTowardNegative|
-|10|roundTowardPositive|
-|11|roundTowardZero|
+|0|roundTiesToEven|
+|1|roundTowardNegative|
+|2|roundTowardPositive|
+|3|roundTowardZero|

 The rounding modes are defined by the IEEE-754 standard.

-*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 23 and 22 (reversed) of the ARM `FPSCR` register.*
+## Other instructions
+There are 4 special instructions that have more than one source operand or the destination operand is a memory value.

-### JUMP
-If the jump condition is `true`, the JUMP instruction performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm8[6:0] + 1)` bytes (1-128 instructions forward).
+|frequency|instruction|dst|src|operation|
+|-|-|-|-|-|
+|7/256|COND_R|R|R, `imm32`|`if(condition(src, imm32)) dst = dst + 1`
+|1/256|COND_M|R|mem, `imm32`|`if(condition([src], imm32)) dst = dst + 1`
+|1/256|CFROUND|`fprc`|R, `imm32`|`fprc = src >>> imm32`
+|16/256|ISTORE|mem|R|`[dst] = src`

-### CALL
-If the jump condition is `true`, the CALL instruction pushes the value of `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm8[6:0] + 1)` bytes (1-128 instructions forward).
+#### COND

-### RET
-If the stack is not empty, the RET instruction pops the return address from the stack (it's the instruction following the previous CALL) and jumps to it.
+These instructions conditionally increment the destination register. The condition function depends on the `mod.cond` flag and takes the lower 32 bits of the source operand and the value `imm32`.

-## Reference implementation
-A portable C++ implementation of all integer and floating point instructions is available in [instructionsPortable.cpp](../src/instructionsPortable.cpp).
+|`mod.cond`|signed|`condition`|probability|*x86*|*ARM*
+|---|---|----------|-----|--|----|
+|0|no|`src <= imm32`|0% - 100%|`JBE`|`BLS`
+|1|no|`src > imm32`|0% - 100%|`JA`|`BHI`
+|2|yes|`src - imm32 < 0`|50%|`JS`|`BMI`
+|3|yes|`src - imm32 >= 0`|50%|`JNS`|`BPL`
+|4|yes|`src - imm32` overflows|0% - 50%|`JO`|`BVS`
+|5|yes|`src - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
+|6|yes|`src < imm32`|0% - 100%|`JL`|`BLT`
+|7|yes|`src >= imm32`|0% - 100%|`JGE`|`BGE`
+
+The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected probability the condition is true (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
+
+#### CFROUND
+This instruction sets the value of the `fprc` register to the 2 least significant bits of the source register rotated right by `imm32`. This changes the rounding mode of all subsequent floating point instructions.
+
+#### ISTORE
+The `ISTORE` instruction stores the value of the source integer register to the memory at the address specified by the destination register. The `src` and `dst` register can be the same.
--- a/doc/isa.md
+++ b/doc/isa.md
@ -1,182 +1,91 @@
-# RandomX instruction encoding
-The instruction set was designed in such way that any random 16-byte word is a valid instruction and any sequence of valid instructions is a valid program. There are no syntax rules.

-The encoding of each 128-bit instruction word is following:
+# RandomX instruction set architecture
+RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE-754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).

-![Imgur](https://i.imgur.com/xi8zuAZ.png)
+## Registers

-## opcode
-There are 256 opcodes, which are distributed between 3 groups of instructions. There are 31 distinct operations (each operation can be encoded using multiple opcodes - for example opcodes `0x00` to `0x0d` correspond to integer addition).
+RandomX has 8 integer registers `r0`-`r7` (group R) and a total of 12 floating point registers split into 3 groups: `a0`-`a3` (group A), `f0`-`f3` (group F) and `e0`-`e3` (group E). Integer registers are 64 bits wide, while floating point registers are 128 bits wide and contain a pair of floating point numbers. The lower and upper half of floating point registers are not separately addressable.

-**Table 1: Instruction groups**
+*Table 1: Addressable register groups*

-|group|# operations|# opcodes||
+|index|R|A|F|E|F+E|
+|--|--|--|--|--|--|
+|0|`r0`|`a0`|`f0`|`e0`|`f0`|
+|1|`r1`|`a1`|`f1`|`e1`|`f1`|
+|2|`r2`|`a2`|`f2`|`e2`|`f2`|
+|3|`r3`|`a3`|`f3`|`e3`|`f3`|
+|4|`r4`||||`e0`|
+|5|`r5`||||`e1`|
+|6|`r6`||||`e2`|
+|7|`r7`||||`e3`|
+
+Besides the directly addressable registers above, there is a 2-bit `fprc` register for rounding control, which is an implicit destination register of the `CFROUND` instruction, and two architectural 32-bit registers `ma` and `mx`, which are not accessible to any instruction. 
+
+Integer registers `r0`-`r7` can be the source or the destination operands of integer instructions or may be used as address registers for loading the source operand from the memory (scratchpad).
+
+Floating point registers `a0`-`a3` are read-only and may not be written to except at the moment a program is loaded into the VM. They can be the source operand of any floating point instruction. The value of these registers is restricted to the interval `[1, 4294967296)`.
+
+Floating point registers `f0`-`f3` are the *additive* registers, which can be the destination of floating point addition and subtraction instructions. The absolute value of these registers will not exceed `1.0e+12`.
+
+Floating point registers `e0`-`e3` are the *multiplicative* registers, which can be the destination of floating point multiplication, division and square root instructions. Their value is always positive.
+
+## Instruction encoding
+
+Each instruction word is 64 bits long and has the following format:
+
+![Imgur](https://i.imgur.com/FtkWRwe.png)
+
+### opcode
+There are 256 opcodes, which are distributed between 35 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).
+
+*Table 2: Instruction groups*
+
+|group|# instructions|# opcodes||
 |---------|-----------------|----|-|
-|integer (IA)|22|144|56.3%|
-|floating point (FP)|5|76|29.7%|
-|control (CL)|4|36|14.0%
-||**31**|**256**|**100%**
+|integer |20|143|55.9%|
+|floating point |11|88|34.4%|
+|other |4|25|9.7%|
+||**35**|**256**|**100%**

 Full description of all instructions: [isa-ops.md](isa-ops.md).

-## A.LOC
-**Table 2: `A.LOC` encoding**
+### dst
+Destination register. Only bits 0-1 (register groups A, F, E) or 0-2 (groups R, F+E) are used to encode a register according to Table 1.

-|bits|description|
+### src
+
+The `src` flag encodes a source operand register according to Table 1 (only bits 0-1 or 0-2 are used).
+
+Immediate value `imm32` is used as the source operand in cases when `dst` and `src` encode the same register.
+
+For register-memory instructions, the source operand determines the `address_base` value for calculating the memory address (see below).
+
+### mod
+
+The `mod` flag is encoded as:
+
+*Table 3: mod flag encoding*
+
+|`mod`|description|
 |----|--------|
-|0-1|`A.LOC.W` flag|
-|2-5|Reserved|
-|6-7|`A.LOC.X` flag|
+|0-1|`mod.mem` flag|
+|2-4|`mod.cond` flag|
+|5-7|Reserved|

-The `A.LOC.W` flag determines the address width when reading operand A from the scratchpad:
+The `mod.mem` flag determines the address mask when reading from or writing to memory:

-**Table 3: Operand A read address width**
+*Table 3: memory address mask*

-|`A.LOC.W`|address width (W)|
-|---------|-|
-|0|15 bits (256 KiB)|
-|1-3|11 bits (16 KiB)|
+|`mod.mem`|`address_mask`|(scratchpad level)|
+|---------|-|---|
+|0|262136|(L2)|
+|1-3|16376|(L1)|

-If the `A.LOC.W` flag is zero, the address space covers the whole 256 KiB scratchpad. Otherwise, just the first 16 KiB of the scratchpad are addressed.
+Table 3 applies to all memory accesses except for cases when the source operand is an immediate value. In that case, `address_mask` is equal to 2097144 (L3). 

-If the `A.LOC.X` flag is zero, the instruction mixes the scratchpad read address into the `mx` register using XOR. This mixing happens before the address is truncated to W bits (see pseudocode below).
+The address for reading/writing is calculated by applying bitwise AND operation to `address_base` and `address_mask`.

-## A.REG
-**Table 4: `A.REG` encoding**
+The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested.

-|bits|description|
-|----|--------|
-|0-2|`A.REG.R` flag|
-|3-7|Reserved|
-
-The `A.REG.R` flag encodes "readAddressRegister", which is an integer register  `r0`-`r7` to be used for scratchpad read address generation. Read address is generated as follows (pseudocode):
-
-```python
-readAddressRegister = IntegerRegister(A.REG.R)
-readAddressRegister = readAddressRegister XOR SignExtend(A.mask32)
-readAddress = readAddressRegister[31:0]
-# dataset is read if the ic register is divisible by 64
-IF ic mod 64 == 0:
-  DatasetRead(readAddress)
-# optional mixing into the mx register
-IF A.LOC.X == 0:
-  mx = mx XOR readAddress
-# truncate to W bits
-W = GetAddressWidth(A.LOC.W)
-readAddress = readAddress[W-1:0]
-```
-
-Note that the value of the read address register is modified during address generation.
-
-## B.LOC
-**Table 5: `B.LOC` encoding**
-
-|bits|description|
-|----|--------|
-|0-1|`B.LOC.L` flag|
-|0-2|`B.LOC.C` flag|
-|3-7|Reserved|
-
-The `B.LOC.L` flag determines the B operand. It can be either a register or immediate value.
-
-**Table 6: Operand B**
-
-|`B.LOC.L`|IA/DIV|IA/SHIFT|IA/MATH|FP|CL|
-|----|--------|----|------|----|---|
-|0|register|`imm8`|`imm32`|register|register|
-|1|`imm32`|register|register|register|register|
-|2|`imm32`|`imm8`|register|register|register|
-|3|`imm32`|register|register|register|register|
-
-Integer instructions are split into 3 classes: integer division (IA/DIV), shift and rotate (IA/SHIFT) and other (IA/MATH). Floating point (FP) and control (CL) instructions always use a register operand.
-
-Register to be used as operand B is encoded in the `B.REG.R` flag (see below).
-
-The `B.LOC.C` flag determines the condition for the JUMP and CALL instructions. The flag partially overlaps with the `B.LOC.L` flag.
-
-## B.REG
-**Table 7: `B.REG` encoding**
-
-|bits|description|
-|----|--------|
-|0-2|`B.REG.R` flag|
-|3-7|Reserved|
-
-Register encoded by the `B.REG.R` depends on the instruction group:
-
-**Table 8: Register operands by group**
-
-|group|registers|
-|----|--------|
-|IA|`r0`-`r7`|
-|FP|`f0`-`f7`|
-|CL|`r0`-`r7`|
-
-##  C.LOC
-**Table 9: `C.LOC` encoding**
-
-|bits|description|
-|----|--------|
-|0-1|`C.LOC.W` flag|
-|2|`C.LOC.R` flag|
-|3-6|Reserved|
-|7|`C.LOC.H` flag|
-
-The `C.LOC.W` flag determines the address width when writing operand C to the scratchpad:
-
-**Table 10: Operand C write address width**
-
-|`C.LOC.W`|address width (W)|
-|---------|-|
-|0|15 bits (256 KiB)|
-|1-3|11 bits (16 KiB)|
-
-If the `C.LOC.W` flag is zero, the address space covers the whole 256 KiB scratchpad. Otherwise, just the first 16 KiB of the scratchpad are addressed.
-
-The `C.LOC.R` determines the destination where operand C is written:
-
-**Table 11: Operand C destination**
-
-|`C.LOC.R`|groups IA, CL|group FP
-|---------|-|-|
-|0|scratchpad|register
-|1|register|register + scratchpad
-
-Integer and control instructions (groups IA and CL) write either to the scratchpad or to a register. Floating point instructions always write to a register and can also write to the scratchpad. In that case, flag `C.LOC.H` determines if the low or high half of the register is written:
-
-**Table 12: Floating point register write**
-
-|`C.LOC.H`|write bits|
-|---------|----------|
-|0|0-63|
-|1|64-127|
-
-## C.REG
-**Table 13: `C.REG` encoding**
-
-|bits|description|
-|----|--------|
-|0-2|`C.REG.R` flag|
-|3-7|Reserved|
-
-The destination register encoded in the `C.REG.R` flag encodes both the write address register (if writing to the scratchpad) and the destination register (if writing to a register). The destination register depends on the instruction group (see Table 8). Write address is always generated from an integer register:
-
-```python
-writeAddressRegister = IntegerRegister(C.REG.R)
-writeAddress = writeAddressRegister[31:0] XOR C.mask32
-# truncate to W bits
-W = GetAddressWidth(C.LOC.W)
-writeAddress = writeAddress [W-1:0]
-```
-
-## imm8
-`imm8` is an 8-bit immediate value that is used as the B operand by IA/SHIFT instructions (see Table 6). Additionally, it's used by some control instructions.
-
-## A.mask32
-`A.mask32` is a 32-bit address mask that is used to calculate the read address for the A operand. It's sign-extended to 64 bits before use.
-
-## imm32
-`imm32` is a 32-bit immediate value which is used for integer instructions from groups IA/DIV and IA/OTHER (see Table 6). The immediate value is sign-extended for instructions that expect 64-bit operands.
-
-## C.mask32
-`C.mask32` is a 32-bit address mask that is used to calculate the write address for the C operand. `C.mask32` is equal to `imm32`.
+### imm32
+A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits in most cases.