17 KiB
RandomX
RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.
RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
Software implementation details and design notes are written in italics.
Virtual machine
RandomX is intended to be run efficiently and easily on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
DRAM
The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is generated from the hash of the previous block using AES encryption (TBD). The contents of the DRAM blob change on average every 2 minutes. The DRAM blob is read with a maximum rate of 2.5 GiB/s per thread.
The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.
MMU
The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address addr
and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). When a block has been consumed, reading jumps to a random block. The address of the next block is calculated 8 reads before the current block is exhausted to enable efficient prefetching. The MMU uses three internal registers:
- m0 - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
- m1 - Address of the next block to be read from memory (32-bit, 256-byte aligned).
- mx - Random 32-bit counter that determines the address of the next block. After each read, the read address is mixed with the counter:
mx ^= addr
. When the 24th quadword of the current block is read (the value of them0
register ends with0xC0
), the value of themx
register is copied into registerm1
and the last 8 bits ofm1
are cleared.
When the value of the m1
register is changed, the memory location can be preloaded into CPU cache using the x86 PREFETCH
instruction or ARM PRFM
instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.
Scratchpad
The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance (see below), which should limit the impact of L1 cache misses.
Program
The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 1024 random 64-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
For high-performance mining, the program should be translated directly into machine code. The whole program should fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.
Control unit
The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:
- pc - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
- sp - Address of the top of the stack (64-bit, 8 byte aligned).
- ic - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when
ic
reaches0
.
Fixed number of executed instructions per program should ensure roughly equal runtime of each random program.
Stack
To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
Register file
The VM has 8 integer registers r0
-r7
(each 64 bits wide), 8 floating point registers f0
-f7
(each 64 bits wide) and 4 memory address registers g0
-g3
(each 32 bits wide).
The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs. The memory address registers g0
-g3
can be stored in a single 128-bit vector register (xmm0
-xmm15
registers for x86 and Q0
-Q15
in ARM) for efficient address generation (see below).
ALU
The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with various operand sizes of 64, 32 or 16 bits.
FPU
The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers.
Endianness
The VM stores and loads all data in little-endian byte order.
Instruction set
The 64-bit instruction is encoded as follows:
Opcode (8 bits)
There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following (TBD):
operation | number of opcodes | |
---|---|---|
ALU operations | 158 | 61.7% |
FPU operations | 66 | 25.8% |
Control flow | 32 | 12.5% |
Operand a (8 bits)
a
encodes the first operand, which is read from memory.
The loc(a)
flag determines where the operand A
is read from where the result C
is saved to (see Result write-back below):
loc(a) | read A from | read address | write C to | write address |
---|---|---|---|---|
000 | DRAM | 32 bits | scratchpad | 18 bits |
001 | DRAM | 32 bits | scratchpad | 14 bits |
010 | DRAM | 32 bits | register x(b) |
- |
011 | DRAM | 32 bits | register x(b) |
- |
100 | scratchpad | 18 bits | scratchpad | 14 bits |
101 | scratchpad | 14 bits | scratchpad | 14 bits |
110 | scratchpad | 14 bits | register x(b) |
- |
111 | scratchpad | 14 bits | register x(b) |
- |
The r(a)
flag encodes an integer register (r0
-r7
). The value of the register is first XORed with the value of the g0
register. The read address addr
is then equal to the bottom 32 bits of r(a)
. Additionally, the value of the register and all memory address registers are rotated.
The addr
value is then truncated to the required length (32, 18 or 14 bits). For reading from and writing to the scratchpad, the address is 8-byte aligned by clearing the bottom 3 bits.
If the gen
flag is equal to 00
, this instruction performs the Address generation step (see below).
Pseudocode:
FUNCTION GET_ADDRESS
r(a) ^= g0
addr = r(a)
r(a) <<<= 32
g0 = g1
g1 = g2
g2 = g3
g3 = g0
IF gen == 0b00 THEN GENERATE_ADDRESSES
return addr
END FUNCTION
The rotation of registers g0
-g3
can be performed with a single PSHUFD
x86 instruction.
Operand b (8 bits)
b
encodes the second operand, which is either a register or immediate value.
loc(b) | read B from |
---|---|
000 | register x(b) |
001 | register x(b) |
010 | register x(b) |
011 | register x(b) |
100 | register x(b) |
101 | register x(b) |
110 | imm1 |
111 | imm1 |
The x(b)
flag encodes a register. For ALU operations, this is an integer register (r0
-r7
) and for FPU operations, it's a floating point register (f0
-f7
).
imm1
is a 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer and converted to a double precision floating point format.
imm0 (8 bits)
An 8-bit immediate value that is used to calculate the jump offset of the CALL instruction.
Result writeback
All instructions take the operands A
and B
and produce a result C
. Firstly, if C
is shorter than 64 bits, it is zero-extended to 64 bits. The value of C
is then written back either to the register x(b)
or to the scratchpad using the same address addr
from operand a (see table above).
CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 write to memory.
Address generation
To ensure that the values of the memory address registers remain pseudorandom, the values of the registers are regenerated on average once in every 4 instructions.
During address generation, the 4 registers g0
-g3
are combined into one 128-bit register G
and the registers r(a)
and x(b)
are combined into a 128-bit register K
. G
is then encrypted with a single AES round using K
as the round key.
In pseudocode:
PROCEDURE GENERATE_ADDRESSES
G[127:96] = g3
G[95:64] = g2
G[63:32] = g1
G[31:0] = g0
K[127:64] = r(a)
K[63:0] = x(b)
G = AES_ROUND(G, K)
g3 = G[127:96]
g2 = G[95:64]
g1 = G[63:32]
g0 = G[31:0]
END PROCEDURE
AES_ROUND
consists of the ShiftRows, SubBytes and MixColumns steps followed by XOR with K
.
For x86 CPUs, address generation requires 2-3 move instructions to construct the key and a single AESENC
instruction for encryption. ARM requires two separate instructions AESE
and AESMC
(for MixColumns). The whole address generation can run in parallel with the currently executed instruction.
ALU instructions
opcodes | instruction | signed | A width | B width | C | C width |
---|---|---|---|---|---|---|
0-13 | ADD_64 | no | 64 | 64 | A + B | 64 |
14-20 | ADD_32 | no | 32 | 32 | A + B | 32 |
21-34 | SUB_64 | no | 64 | 64 | A - B | 64 |
35-41 | SUB_32 | no | 32 | 32 | A - B | 32 |
42-45 | MUL_64 | no | 64 | 64 | A * B | 64 |
46-49 | MULH_64 | no | 64 | 64 | A * B | 64 |
50-53 | MUL_32 | no | 32 | 32 | A * B | 64 |
54-57 | IMUL_32 | yes | 32 | 32 | A * B | 64 |
58-61 | IMULH_64 | yes | 64 | 64 | A * B | 64 |
62 | DIV_64 | no | 64 | 32 | A / B | 32 |
63 | IDIV_64 | yes | 64 | 32 | A / B | 32 |
64-76 | AND_64 | no | 64 | 64 | A & B | 64 |
77-82 | AND_32 | no | 32 | 32 | A & B | 32 |
83-95 | OR_64 | no | 64 | 64 | A | B | 64 |
96-101 | OR_32 | no | 32 | 32 | A | B | 32 |
102-115 | XOR_64 | no | 64 | 64 | A ^ B | 64 |
116-121 | XOR_32 | no | 32 | 32 | A ^ B | 32 |
122-128 | SHL_64 | no | 64 | 6 | A << B | 64 |
129-132 | SHR_64 | no | 64 | 6 | A >> B | 64 |
133-135 | SAR_64 | yes | 64 | 6 | A >> B | 64 |
136-146 | ROL_64 | no | 64 | 6 | A <<< B | 64 |
147-157 | ROR_64 | no | 64 | 6 | A >>> B | 64 |
32-bit operations
Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations are 32 bits long and bits 32-63 of C are zero.
Multiplication
There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
The following table shows an example of the output of the 5 multiplication instructions for inputs A = 0xBC550E96BA88A72B
and B = 0xF5391FA9F18D6273
:
instruction | A interpreted as | B interpreted as | result C |
---|---|---|---|
MUL_64 | 13570769092688258859 | 17670189427727360627 | 0x28723424A9108E51 |
MULH_64 | 13570769092688258859 | 17670189427727360627 | 0xB4676D31D2B34883 |
MUL_32 | 3129517867 | 4052574835 | 0xB001AA5FA9108E51 |
IMUL_32 | -1165449429 | -242392461 | 0x03EBA0C1A9108E51 |
IMULH_64 | -4875974981021292757 | -776554645982190989 | 0x02D93EF1269D3EE5 |
Division
For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend A
.
Division by zero can be handled without branching by conditional move (IF B == 0 THEN B = 1
). Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. In this extremely rare case, ARM produces the "correct" result, but x86 throws a hardware exception, which must be handled.
Shift and rotate
The shift/rotate instructions use just the bottom 6 bits of the B
operand. All treat A
as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
FPU instructions
opcodes | instruction | C |
---|---|---|
158-175 | FADD | A + B |
176-193 | FSUB | A - B |
194-211 | FMUL | A * B |
212-214 | FDIV | A / B |
215-221 | FSQRT | sqrt(A) |
222-223 | FROUND | A |
FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is RN (Round to Nearest). Denormal values may not be produced by any operation.
Denormals can be disabled by setting the FTZ flag in x86 SSE and ARM Neon engines. This is done for performance reasons.
Operands loaded from memory are treated as signed 64-bit integers and converted to double precision floating point format. Operands loaded from floating point registers are used directly.
FSQRT
The sign bit of the FSQRT operand is always cleared first, so only non-negative values are used.
In x86, the SQRTSD
instruction must be used. The legacy FSQRT
instruction doesn't produce correctly rounded results in all cases.
FROUND
The FROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two right-most bits of A:
A[1:0] | rounding mode |
---|---|
00 | Round to Nearest (RN) mode |
01 | Round towards Plus Infinity (RP) mode |
10 | Round towards Minus Infinity (RM) mode |
11 | Round towards Zero (RZ) mode |
The two-bit flag value exactly corresponds to bits 13-14 of the x86 MXCSR
register and bits 22-23 of the ARM FPSCR
register.
Control flow instructions
The following 2 control flow instructions are supported:
opcodes | instruction | function |
---|---|---|
224-240 | CALL | near procedure call |
241-255 | RET | return from procedure |
Both instructions are conditional in 75% of cases. The jump is taken only if B <= imm1
. For the 25% of cases when B
is equal to imm1
, the jump is unconditional. In case the branch is not taken, both instructions become "arithmetic no-op" C = A
.
CALL
Taken CALL instruction pushes the values A
and pc
(program counter) onto the stack and then performs a forward jump relative to the value of pc
. The forward offset is equal to 8 * (imm0 + 1)
. Maximum jump distance is therefore 256 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
RET
The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address raddr
from the stack (it's the instruction following the previous CALL), then pops a return value retval
from the stack and sets C = A ^ retval
. Finally, the instruction jumps back to raddr
.
Program generation
The program is initialized from a 256-bit seed value using a PCG random number generator. The program is generated in this order:
- All 1024 instructions are generated as a list of random 64-bit integers.
- Initial values of all integer registers
r0
-r7
are generated as random 64-bit integers. - Initial values of all floating point registers
f0
-f7
are generated as random 64-bit signed integers converted to a double precision floating point format. - Initial values of all memory address registers
g0
-g3
are generated as random 32-bit integers. - The initial value of the
m0
register is generated as a random 32-bit value with the last 8 bits cleared (256-byte aligned). - A random 128-byte scratchpad seed is generated.
- The initial 256-bit seed is used to generate 10 AES round keys.
- The 256 KiB scratchpad is initialized by repeated 10-round AES encryption starting with the scratchpad seed.
- The remaining registers are initialized as
pc = 0
,sp = 0
,ic = 65536
(TBD),mx = 0
.
Result
When the program terminates (the value of ic
register reaches 0), the final result is calculated as follows:
- The register file is hashed using the Blake2b 256-bit hash function. The order of registers is:
r0
-r7
,f0
-f7
,g0
-g3
(total of 144 bytes). - The 256-bit hash is expanded into 10 AES round keys.
- The 256 KiB scratchpad is imploded into 128 bytes using 10-round AES decryption.
- The 128 byte scratchpad digest is hashed again using the Blake2b 256-bit hash function. This is the result of the PoW.
The stack is not included in the result calculation to enable platform-specific return addresses.
Chaining
The program generation, execution and result calculation can be chained multiple times to discourage mining strategies that search for programs with particular properties.