Updated specification

This commit is contained in:
tevador 2018-11-18 11:38:33 +01:00
parent 7e582c2815
commit ec2d378fce
1 changed files with 121 additions and 24 deletions

145
README.md
View File

@ -1,3 +1,4 @@
# RandomX
RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.
@ -13,7 +14,7 @@ RandomX is intended to be run efficiently and easily on a general-purpose CPU. T
#### DRAM
The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is generated from the hash of the previous block using AES encryption (TBD). The contents of the DRAM blob change on average every 2 minutes. The DRAM blob is read with a maximum rate of 2.5 GiB/s per thread.
*The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
*CPUs without hardware AES support can use a GPU to generate the DRAM blob quickly. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
#### MMU
The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The DRAM blob is read mostly sequentially. After an average of 8192 sequential reads, a random read is performed. An average program reads a total of 4 MiB of DRAM and has 64 random reads.
@ -27,7 +28,7 @@ The MMU uses two internal registers:
#### Scratchpad
The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance (see below), which should limit the impact of L1 cache misses.*
*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance, which should limit the impact of L1 cache misses.*
#### Program
The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
@ -40,21 +41,23 @@ The control unit (CU) controls the execution of the program. It reads instructio
* **sp** - Address of the top of the stack (64-bit, 8 byte aligned).
* **ic** - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when `ic` reaches `0`.
*Fixed number of executed instructions per program should ensure roughly equal runtime of each random program.*
*Fixed number of executed instructions should ensure roughly equal runtime of each random program.*
#### Stack
To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB for a program that contains only unconditional CALL instructions (the probability of randomly generating such program is about 5×10<sup>-912</sup>). In reality, the stack size will rarely exceed 1 MiB.*
#### Register file
The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7`. All registers are 64 bits wide.
*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
#### ALU
The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with operand sizes of 64 or 32 bits.
The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.
#### FPU
The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers.
The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root.
#### Endianness
The VM stores and loads all data in little-endian byte order.
@ -64,6 +67,8 @@ The 128-bit instruction is encoded as follows:
![Imgur](https://i.imgur.com/thpvVHN.png)
*All flags are aligned to an 8-bit boundary for easier decoding.*
#### Opcode
There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:
@ -110,7 +115,7 @@ The second operand is loaded either from a register or from an immediate value e
`imm0` is an 8-bit immediate value, which is used for shift and rotate ALU operations.
`imm1` is a 32-bit immediate value which is used for most operations. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer and converted to a double precision floating point format.
`imm1` is a 32-bit immediate value which is used for most operations. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is first left-shifted by 32 bits, treated as a signed 64-bit integer and converted to a double precision floating point format.
#### Operand C
The third operand is the location where the result is stored.
@ -242,27 +247,119 @@ Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the
##### RET
The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A ^ retval`. Finally, the instruction jumps back to `raddr`.
## Program generation
## Proof of work
### Hash functions
#### Blake2b
The primary cryptographically secure hash function used by RandomX is [Blake2b](https://blake2.net/) with an output size of 256 bits. Blake2b was specifically designed to be fast in software, especially on modern 64-bit processors, where it's around three times faster than SHA-3 and can run at a speed of around 3 clock cycles per byte of input.
`Blake2b(X)` refers to the 256-bit plain hash and `Blake2b(K, X)` refers to the 256-bit keyed hash.
#### HighwayHash
[HighwayHash](https://github.com/google/highwayhash) is a fast keyed pseudorandom function, which can take advantage of SIMD instructions available in modern CPUs. It's used to calculate the scratchpad digest. HighwayHash can run at a speed of about 0.3 clocks per byte using SSE 4.1.
The function is called as `HighwayHash(K, X)`, where `K` is a 256-bit key.
### Pseudo-random number generator
RandomX uses a permuted congruential generator (PCG) for VM initialization. A minimal C implementation is available [here](http://www.pcg-random.org/download.html#minimal-c-implementation). The generator has an internal state of 64 bits and additional 63 bits are used to select the output stream. The generator produces 32 random bits per call.
### DRAM blob initialization
TBD
### VM initialization
#### Scratchpad initialization
The scratchpad is initialized by copying a 256 KiB block from the DRAM blob. The starting offset of the block is `262144 * i`, where `i` is a 14-bit input parameter.
Pseudocode:
```python
# initializes the scratchpad
def InitializeScratchpad(i):
memcpy(Scratchpad, DRAM + 262144 * i, 262144)
```
#### Program initialization
The program is initialized from a 256-bit seed value `S`.
1. A [pcg32](http://www.pcg-random.org/) random number generator is initialized with state `S[63:0]`.
2. The generator is used to generate random 128 bytes `R1`.
3. Integer registers `r0`-`r7` are initialized using bytes 0-63 of `R1`.
4. Floating point registers `f0`-`f7` are initialized using bytes 64-127 of `R1` interpreted as 8 64-bit signed integers converted to a double precision floating point format.
5. The initial value of the `ma` register is set to `S[95:64]` and the the last 3 bits are cleared (8-byte aligned).
6. `S` is expanded into 10 AES round keys `K0`-`K9`.
7. `R1` is exploded into a 264 KiB buffer `B` by repeated 10-round AES encryption.
8. The scratchpad is set to the first 256 KiB of `B`.
9. The program buffer is set to the final 8 KiB of `B`.
10. The remaining registers are initialized as `pc = 0`, `sp = 0`, `ic = 1048576` (TBD), `mx = 0`.
1. The PCG random number generator is initialized with state `S[63:0]` and increment `S[127:64] | 1` (odd number).
2. The generator is used to generate 8324 random bytes.
3. The integer registers `r0`-`r7` are initialized with bytes 0-63.
4. Floating point registers `f0`-`f7` are initialized with bytes 64-127 interpreted as 8 64-bit signed integers converted to a double precision floating point format.
5. The program buffer is initialized with bytes 128-8319.
6. The initial value of the `ma` register is set to bytes 8320-8323, XORed with `S[159:128]` and the last 3 bits are cleared (8-byte aligned).
7. The value of the `mx` register is initialized as `S[191:160]`.
8. The remaining registers are initialized with constant values: `pc = 0`, `sp = 0`, `ic = 1048576`.
Pseudocode:
```python
# S is a 256-bit seed value
# initializes the program buffer and registers
def InitializeProgram(S):
rng = Pcg32(S[63:0], S[127:64] | 1)
a = []
loop 2081 times:
a.append(rng.next())
r0 = a[0..1]
r1 = a[2..3]
r2 = a[4..5]
r3 = a[6..7]
r4 = a[8..9]
r5 = a[10..11]
r6 = a[12..13]
r7 = a[14..15]
f0 = double(a[16..17])
f1 = double(a[18..19])
f2 = double(a[20..21])
f3 = double(a[22..23])
f4 = double(a[24..25])
f5 = double(a[26..27])
f6 = double(a[28..29])
f7 = double(a[30..31])
ProgramBuffer = a[32..2079]
ma = (a[2080] ^ S[159:128]) & 0xFFFFFFF8
mx = S[191:160]
pc = 0
sp = 0
ic = 1048576
```
## Result
When the program terminates (the value of `ic` register reaches 0), the final result is calculated as follows:
1. The register file is treated as a 128-byte value `R2`.
3. The 256 KiB scratchpad is imploded into a 128-byte digest `D` using 10-round AES decryption with keys `K0`-`K9` and XORing each 128-byte chunk with `R2`.
4. `D` is hashed using the Blake2b 256-bit hash function. This is the result of the PoW.
### PoW hash calculation
RandomX produces a 256-bit final hash value to be used for a Hashcash-style proof evaluation.
The hash of the input (block header for a cryptocurrency) is used for the first VM initiazation. The program initialization and program execution are chained three times to discourage mining strategies that search for programs with particular properties. The scratchpad is preserved between the 3 program executions.
Pseudocode:
```python
# H is the input value
# returns a 256-bit PoW hash
def RandomXPoW(H):
K = Blake2b(H)
InitializeScratchpad(K[205:192])
S = K
loop 3 times:
InitializeProgram(S)
ExecuteProgram()
S = Blake2b(K, RegisterFile)
W = HighwayHash(K, Scratchpad)
return Blake2b(K, RegisterFile + W)
```
*The stack is not included in the result calculation to enable platform-specific return addresses.*
*An average program takes roughly 2.5 ms to execute on a recent CPU (preliminary tests). VM initialization and result calculation should take less than 0.1 ms. The total time to calculate the PoW should be under 10 ms (depends on the overhead of translating RandomX code into machine code).*
### Chaining
The program generation, execution and result calculation can be chained multiple times to discourage mining strategies that search for programs with particular properties.
## Test code
A python generator is available to generate a random program and output its C source code.
Generate a random program:
```
python rx2c.py > rx-sample.c
```
Compile the program:
```
gcc -O2 -maes -DRAM -DPREF rx-sample.c -o rx-sample
```
*(Note that the test program can be compiled only by the GCC compiler due to the use of non-standard C features such as computed goto.)*
Run the program:
```
./rx-sample
```
*(Note that the test program execution requires more than 4 GiB of available virtual memory and the AES-NI instruction set support.)*