Updated specs

2018-11-05 18:27:48 +01:00 · 2018-11-05 18:27:48 +01:00 · 880f728ca7
parent 5114d6b5fe
commit 880f728ca7
1 changed files with 132 additions and 61 deletions
--- a/README.md
+++ b/README.md
@ -2,50 +2,64 @@
 # RandomX
 RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.

-RandomX uses a simple low-level language (instruction set) to describe a variety of random programs. The instruction set was designed specifically for this proof of work algorithm, because existing languages and instruction sets are designed for a different goal (actual software development) and thus usually have a complex syntax and unnecessary flexibility.
+RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
+
+*Software implementation details and design notes are written in italics.*

 ## Virtual machine
 RandomX is intended to be run efficiently and easily on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:

-![Imgur](https://i.imgur.com/41MKtMl.png)
+![Imgur](https://i.imgur.com/Of1tGPm.png)

 #### DRAM
-The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is static within a single PoW epoch. The exact algorithm to generate the DRAM blob and its update schedule is to be determined.
+The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is generated from the hash of the previous block using AES encryption (TBD). The contents of the DRAM blob change on average every 2 minutes. The DRAM blob is read with a maximum rate of 2.5 GiB/s per thread.
+
+*The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*

 #### MMU
-The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU splits the 4 GiB DRAM blob into 64-byte blocks (corresponding to the most common L1 cache line size). Data within one block is always read sequentially in eight reads (8×8 bytes). Blocks are read mostly sequentially apart from occasional random jumps that happen on average every 256 blocks. The address of the next block to be read is determined 1 block ahead of time to enable efficient prefetching. The MMU uses three internal registers:
+The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). Blocks are read mostly sequentially apart from occasional random jumps that happen on average every 256 blocks. The address of the next block to be read is determined 1 block ahead of time to enable efficient prefetching. The MMU uses three internal registers:
 * **m0** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
-* **m1** - Address of the next block to be read from memory (32-bit, 64-byte aligned).
-* **mx** - Random 64-bit counter that determines if reading continues sequentially or jumps to a random block. When an address `addr` is passed to the MMU, it performs `mx ^= addr` and checks if the last 8 bits of `mx` are zero. If yes, the adjacent 32 bits are copied to register `m1` and 64-byte aligned.
+* **m1** - Address of the next block to be read from memory (32-bit, 256-byte aligned).
+* **mx** - Random 32-bit counter that determines if reading continues sequentially or jumps to a random block. After each read, the read address is mixed with the counter: `mx ^= addr`. When the last quadword of the current block is read (the value of the `m0` register ends with `0xFF`), the MMU checks if the last 8 bits of `mx` are zero. If yes, the value of the `mx` register is copied into register `m1`.

-#### Cache
-The VM contains 256 KiB of cache. The cache is split into two segments (16 KiB and 240 KiB). The cache is randomly accessed for both reading and writing. 75% of accesses are into the first 16 KiB.
+*When the value of the `m1` register is changed, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. The average length of a sequential DRAM read is 64 KiB. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
+
+#### Scratchpad
+The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
+
+*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance (see below), which should limit the impact of L1 cache misses.*

 #### Program
 The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 1024 random 64-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.

+*For high-performance mining, the program should be translated directly into machine code. The whole program will typically fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.*
+
 #### Control unit
 The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:
 * **pc** - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
 * **sp** - Address of the top of the stack (64-bit, 8 byte aligned).
-* **ic** - Instruction counter = the number of instructions to execute before terminating. Initial value is 65536 and the register is decremented after each executed instruction.
+* **ic** - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when `ic` reaches `0`.
+
+*Fixed number of executed instructions per program should ensure roughly equal runtime of each random program.*

 #### Stack
 To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL, DCALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.

 #### Register file
-The VM has 8 integer registers r0-r7 and 8 floating point registers f0-f7. All registers are 64 bits wide.
+The VM has 8 integer registers `r0`-`r7` (each 64 bits wide), 8 floating point registers `f0`-`f7` (each 64 bits wide) and 4 memory address registers `g0`-`g3` (each 32 bits wide).
+
+*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs. The memory address registers `g0`-`g3` can be stored in a single 128-bit vector register (`xmm0`-`xmm15` registers for x86 and `Q0`-`Q15` in ARM) for efficient address generation (see below).*

 #### ALU
-The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with various operand sizes.
+The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with various operand sizes of 64, 32 or 16 bits.

 #### FPU
 The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers.

 ## Instruction set
-The instruction set was designed so that any bitstring is a valid program. The 64-bit instruction is encoded as follows:
+The 64-bit instruction is encoded as follows:

-![Imgur](https://i.imgur.com/TlgeYfk.png)
+![Imgur](https://i.imgur.com/FwYyKBB.png)

 #### Opcode (8 bits)
 There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:
@ -56,40 +70,105 @@ There are 256 opcodes, which are distributed between various operations dependin
 |FPU operations|TBD|TBD|
 |Control flow |32|12.5%|

-#### Parameters a, b, c (8 bits)
-`a` and `b` encode the instruction operands and `c` is the destination. All have the same encoding:
+#### Operand a (8 bits)
+`a` encodes the first operand, which is read from memory.

-![Imgur](https://i.imgur.com/Gj9Bolw.png)
+![Imgur](https://i.imgur.com/JNIadYc.png)

-Register number is encoded in the top 3 bits. ALU instructions use registers r0-r7, while FPU instructions use registers f0-f7. Addresses are always loaded from registers r0-r7. The bottom 3 bits determine where the operand is loaded from/result saved to:
+The `loc(a)` flag determines where the operand `A` is read from where the result `C` is saved to (see Result write-back below):

-|location|A|B|C|
-|---------|-|-|-
-|000|register|register|register|
-|001|register|register|register|
-|010|register|register|register|
-|011|cache|register|cache|
-|100|cache|register|cache|
-|101|DRAM|register|cache|
-|110|DRAM|imm1|cache|
-|111|DRAM|imm1|cache|
-* **register** - Direct register read/write.
-* **cache** - The value of the register is used as an address to read from/write to the cache. The bottom 3 bits of the address are cleared and the address is truncated to the following length depending on the cache bits:
+|loc(a)|read A from|read address|write C to|write address
+|---------|-|-|-|-|
+|000|DRAM|32 bits|scratchpad|18 bits|
+|001|DRAM|32 bits|scratchpad|14 bits|
+|010|DRAM|32 bits|register `x(b)`|-|
+|011|DRAM|32 bits|register `x(b)`|-|
+|100|scratchpad|18 bits|scratchpad|14 bits|
+|101|scratchpad|14 bits|scratchpad|14 bits|
+|110|scratchpad|14 bits|register `x(b)`|-|
+|111|scratchpad|14 bits|register `x(b)`|-|

-|cache|address length|
-|---------|-|
-|00|18 bits (whole 256 KiB)|
-|01, 10, 11|14 bits (first 16 KiB)|
+The `r(a)` flag encodes an integer register (`r0`-`r7`). The value of the register is first XORed with the value of the `g0` register. The read address `addr` is then equal to the bottom 32 bits of `r(a)`. Additionally, the value of the register and all memory address registers are rotated.

-* **DRAM** - The value of the register is used as an address to pass to the MMU for reading from DRAM.
-* **imm1** - 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer, first converted to a single precision floating point format and then to a double precision format.
+The `addr` value is then truncated to the required length (32, 18 or 14 bits). For reading from and writing to the scratchpad, the address is 8-byte aligned by clearing the bottom 3 bits.
+
+If the `gen` flag is equal to `00`, this instruction performs the Address generation step (see below).
+
+Pseudocode:
+```
+FUNCTION GET_ADDRESS
+	r(a) ^= g0
+	addr = r(a)
+	r(a) <<< 32
+	g0 = g1
+	g1 = g2
+	g2 = g3
+	g3 = g0
+	IF gen == 0b00 THEN GENERATE_ADDRESSES
+	return addr
+END FUNCTION
+```
+*The rotation of registers `g0`-`g3` can be performed with a single `SHUFPS` x86 instruction.*
+
+
+
+#### Operand b (8 bits)
+`b` encodes the second operand, which is either a register or immediate value.
+
+![Imgur](https://i.imgur.com/ppEiUfh.png)
+
+|loc(b)|read B from|
+|---------|-|-|
+|000|register `x(b)`|
+|001|register `x(b)`|
+|010|register `x(b)`|
+|011|register `x(b)`|
+|100|register `x(b)`|
+|101|register `x(b)`|
+|110|`imm1`|
+|111|`imm1`|
+
+The `x(b)` flag encodes a register. For ALU operations, this is an integer register (`r0`-`r7`) and for FPU operations, it's a floating point register (`f0`-`f7`).
+
+`imm1` is a 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer, first converted to a single precision floating point format and then to a double precision format.

 #### imm0 (8 bits)
 An 8-bit immediate value that is used to calculate the jump offset of the CALL and DCALL instructions.

-### ALU instructions
+#### Result writeback

-All ALU instructions take 2 operands `A` and `B` and produce result `C`. If `C` is shorter than 64 bits, it is zero-extended to 64 bits. 
+All instructions take the operands `A` and `B` and produce a result `C`. Firstly, if `C` is shorter than 64 bits, it is zero-extended to 64 bits. The value of `C` is then written back either to the register `x(b)` or to the scratchpad using the same address `addr` from operand a (see table above).
+
+*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 write to memory.*
+
+#### Address generation
+
+To ensure that the values of the memory address registers remain pseudorandom, the values of the registers are regenerated on average once in every 4 instructions.
+
+During address generation, the 4 registers `g0`-`g3` are combined into one 128-bit register `G` and the registers `r(a)` and `x(b)` are combined into a 128-bit register `K`. `G` is then encrypted with a single [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) round using `K` as the round key.
+
+In pseudocode:
+```
+PROCEDURE GENERATE_ADDRESSES
+	G[127:96] = g3
+	G[95:64] = g2
+	G[63:32] = g1
+	G[31:0] = g0
+	K[127:64] = r(a)
+	K[63:0] = x(b)
+	G = AES_ROUND(G, K)
+	g3 = G[127:96]
+	g2 = G[95:64]
+	g1 = G[63:32]
+	g0 = G[31:0]
+END PROCEDURE
+```
+`AES_ROUND` consists of the ShiftRows, SubBytes and MixColumns steps followed by XOR with `K`.
+
+*For x86 CPUs, address generation requires 2-3 move instructions to construct the key and a single `AESENC` instruction for encryption. ARM requires two separate instructions `AESE` and `AESMC` (for MixColumns). The whole address generation can run in parallel with the currently executed instruction.*
+
+
+### ALU instructions

 |opcodes|instruction|signed|A width|B width|C|C width|
 |-|-|-|-|-|-|-|
@ -126,22 +205,8 @@ All ALU instructions take 2 operands `A` and `B` and produce result `C`. If `C`
 ##### Division
 For the division instructions, the divisor is half length of the dividend. The result `C` consists of both the quotient and the remainder (remainder is put the upper bits). The result of division by zero is equal to the dividend.

-##### Register scrambling
-Because the values of the integer registers are used as read and write addresses, they must stay pseudorandom. To achieve this, every ALU instruction has a scrambling step at the end. The values of the integer registers `r(a)` and `r(c)` corresponding to operands `A` and `C` are concatenated to form a 128-bit value `D`. The value of the integer register `r(b)` corresponding to the `B` operand is concatenated with its corresponding FPU register `f(b)` to form a 128-bit value `K`. `D` is then encrypted with a single [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) round using `K` as the round key and the result is saved into registers `r(a)` and `r(c)`. 
-
-In pseudocode:
-```
-D[127:64] = r(a)
-D[63:0] = r(c)
-K[127:64] = r(b)
-K[63:0] = f(b)
-E = AES_ROUND(D, K)
-r(a) = E[127:64]
-r(c) = E[63:0]
-```
-`AES_ROUND` consists of the ShiftRows, SubBytes and MixColumns steps followed by XOR with `K`.
-
 ### FPU instructions
+
 |opcodes|instruction|C|
 |-|-|-|
 |TBD|FADD|A + B|
@ -149,15 +214,19 @@ r(c) = E[63:0]
 |TBD|FMUL|A * B|
 |TBD|FDIV|A / B|
 |TBD|FSQRT|sqrt(A)|
-|TBD|FROUND|-|
+|TBD|FROUND|A|

-FPU instructions conform to the IEEE-754 specification, so they must give bit-exact correctly rounded results. Initial rounding mode is RN (Round to Nearest). Denormal values are treated as zero (this corresponds to setting the FTZ flag in x86 SSE and ARM Neon engines).
+FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is RN (Round to Nearest). Denormal values are treated as zero.
+
+*Denormals can be disabled by setting the FTZ flag in x86 SSE and ARM Neon engines. This is done for performance reasons.*

 Operands loaded from memory are treated as signed 64-bit integers and converted to double precision floating point format. Operands loaded from floating point registers are used directly.

 ##### FSQRT
 The sign bit of the FSQRT operand is always cleared first, so only non-negative values are used.

+*In x86, the `SQRTSD` instruction must be used. The legacy `FSQRT` instruction doesn't produce correctly rounded results in all cases.*
+
 ##### FROUND
 The FROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two right-most bits of A:

@ -168,6 +237,7 @@ The FROUND instruction changes the rounding mode for all subsequent FPU operatio
 |10|Round towards Minus Infinity (RM) mode
 |11|Round towards Zero (RZ) mode

+*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 22-33 of the ARM `FPSCR` register.*

 ### Control flow instructions
 The following 3 control flow instructions are supported:
@ -184,17 +254,18 @@ All three instructions are conditional in 75% of cases. The jump is taken only i
 Taken CALL and DCALL instructions push the values `A` and `pc` (program counter) onto the stack and then perform a forward jump relative to the value of `pc`. The forward offset is equal to `8 * (imm0 + 1)` for the CALL instruction and `8 * ((imm0 ^ (A >> 56)) + 1)` for the DCALL instruction. Maximum jump distance is therefore 256 instructions forward (this means that at least 4 correctly spaced CALL/DCALL instructions are needed to form a loop in the program).

 ##### RET
-Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL or DCALL), then pops a return value `retval` from the stack and sets `C = retval`. Finally, the instruction jumps back to `raddr`.
+Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL or DCALL), then pops a return value `retval` from the stack and sets `C = A ^ retval`. Finally, the instruction jumps back to `raddr`.

 ## Program generation
 The program is initialized from a 256-bit seed value using a [PCG random number generator](http://www.pcg-random.org/). The program is generated in this order:
 1. All 1024 instructions are generated as a list of random 64-bit integers.
-2. Initial values of all integer registers r0-r7 are generated as random 64-bit integers.
-3. Initial values of all floating point registers f0-f7 are generated as random 64-bit signed integers converted to a double precision floating point format.
-4. The initial value of the `m0` register is generated as a random 32-bit value with the last 6 bits cleared (64-byte aligned).
-5. The 256 KiB cache is initialized using AES encryption (TBD).
-6. The remaining registers are initialized as `pc = 0`, `sp = 0`, `ic = 65536`, `m1 = m0 + 64`, `mx = 0`.
+2. Initial values of all integer registers `r0`-`r7` are generated as random 64-bit integers.
+3. Initial values of all floating point registers `f0`-`f7` are generated as random 64-bit signed integers converted to a double precision floating point format.
+4. Initial values of all memory address registers `g0`-`g3` are generated as random 32-bit integers.
+5. The initial value of the `m0` register is generated as a random 32-bit value with the last 8 bits cleared (256-byte aligned).
+6. The 256 KiB cache is initialized (TBD).
+7. The remaining registers are initialized as `pc = 0`, `sp = 0`, `ic = 65536` (TBD), `m1 = m0 + 256`, `mx = 0`.


 ## Result
-When the program terminates (the value of `ic` register reaches 0), the cache, the register file and the stack are hashed using the Blake2b hash function to get the final PoW value. The generation/execution can be chained multiple times to discourage mining strategies that search for programs with particular properties.
+When the program terminates (the value of `ic` register reaches 0), the scratchpad, the register file and the stack are hashed using the Blake2b hash function to get the final PoW value. The generation/execution can be chained multiple times to discourage mining strategies that search for programs with particular properties.