Documentation formatting

This commit is contained in:
tevador 2019-02-09 16:09:55 +01:00
parent 32d827d0a6
commit 9af0cbf108
4 changed files with 28 additions and 29 deletions

View file

@ -24,7 +24,7 @@ Notable parts of the RandomX VM are:
The structure of the VM mimics the components that are found in a typical general purpose computer equipped with a CPU and a large amount of DRAM. The scratchpad is designed to fit into the CPU cache. The first 16 KiB and 256 KiB of the scratchpad are used more often take advantage of the faster L1 and L2 caches. The ratio of random reads from L1/L2/L3 is approximately 9:3:1, which matches the inverse latencies of typical CPU caches. The structure of the VM mimics the components that are found in a typical general purpose computer equipped with a CPU and a large amount of DRAM. The scratchpad is designed to fit into the CPU cache. The first 16 KiB and 256 KiB of the scratchpad are used more often take advantage of the faster L1 and L2 caches. The ratio of random reads from L1/L2/L3 is approximately 9:3:1, which matches the inverse latencies of typical CPU caches.
The VM executes programs in a special instruction set, which was designed in such way that any random 8-byte word is a valid instruction and any sequence of valid instructions is a valid program. For more details see [RandomX ISA documentation](doc/isa.md). Because there are no "syntax" rules, generating a random program is as easy as filling the program buffer with random data. A RandomX program consists of 256 instructions. See [program.inc](../src/program.inc) as an example of a RandomX program translated into x86-64 assembly. The VM executes programs in a special instruction set, which was designed in such way that any random 8-byte word is a valid instruction and any sequence of valid instructions is a valid program. For more details see [RandomX ISA documentation](doc/isa.md). Because there are no "syntax" rules, generating a random program is as easy as filling the program buffer with random data. A RandomX program consists of 256 instructions. See [program.inc](src/program.inc) as an example of a RandomX program translated into x86-64 assembly.
#### Hash calculation #### Hash calculation

View file

@ -1,15 +1,14 @@
# Dataset
## Dataset The dataset is randomly accessed 16384 times during each hash calculation, which significantly increases memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 67108864 blocks of 64 bytes.
The dataset is randomly accessed 16384 times during each hash calculation, which significantly increases memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 67108864 block of 64 bytes. In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 256 MiB cache, which can be used to calculate dataset blocks on the fly.
In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 256 MiB cache, which can be used to calculate dataset rows on the fly.
Because the initialization of the dataset is computationally intensive, it is recalculated only every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset: Because the initialization of the dataset is computationally intensive, it is recalculated only every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
![Imgur](https://i.imgur.com/b9WHOwo.png) ![Imgur](https://i.imgur.com/b9WHOwo.png)
### Seed block ## Seed block
The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations. The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations.
|block|Seed block| |block|Seed block|
@ -19,7 +18,7 @@ The whole dataset is constructed from a 256-bit hash of the last block whose hei
|2113-3136|2048| |2113-3136|2048|
|...|... |...|...
### Cache construction ## Cache construction
The 32-byte seed block hash is expanded into the 256 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs. The 32-byte seed block hash is expanded into the 256 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
@ -42,19 +41,19 @@ The finalizer and output calculation steps of Argon2 are omitted. The output is
The use of 3 iterations makes time-memory tradeoffs infeasible and thus 256 MiB is the minimum amount of memory required by RandomX. The use of 3 iterations makes time-memory tradeoffs infeasible and thus 256 MiB is the minimum amount of memory required by RandomX.
### Dataset block generation ## Dataset block generation
The full 4 GiB dataset can be generated from the 256 MiB cache. Each 64-byte block is generated independently by XORing 16 pseudorandom Cache blocks selected by the `SquareHash` function. The full 4 GiB dataset can be generated from the 256 MiB cache. Each 64-byte block is generated independently by XORing 16 pseudorandom cache blocks selected by the `SquareHash` function.
#### SquareHash ### SquareHash
`SquareHash` is a custom hash function with 64-bit input and 64-bit output. It is calculated by repeatedly squaring the input, splitting the 128-bit result in to two 64-bit halves and subtracting the high half from the low half. This is repeated 42 times. It's available as a [portable C implementation](../src/squareHash.h) and [x86-64 assembly version](../src/asm/squareHash.inc). `SquareHash` is a custom hash function with 64-bit input and 64-bit output. It is calculated by repeatedly squaring the input, splitting the 128-bit result in to two 64-bit halves and subtracting the high half from the low half. This is repeated 42 times. It's available as a [portable C implementation](../src/squareHash.h) and [x86-64 assembly version](../src/asm/squareHash.inc).
Properties of `SquareHash`: Properties of `SquareHash`:
* It achieves full [Avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect). * It achieves full [Avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect).
* Since the whole calculation is a long dependency chain, which uses only multiplication and subtraction, the performance gains by using custom hardware are very limited. * Since the whole calculation is a long dependency chain, which uses only multiplication and subtraction, the performance gains by using custom hardware are very limited.
* A single `SquareHash` calculation takes 40-80 ns, which is about the same time as DRAM access latency. Devices using low-latency memory will be bottlenecked by `SquareHash`, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM. * A single `SquareHash` calculation takes 40-80 ns, which is about the same time as DRAM access latency. ASIC devices using low-latency memory will be bottlenecked by `SquareHash`, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM.
The output of 16 chained SquareHash calculations is used to determine Cache blocks that are XORed together to produce a Dataset block: The output of 16 chained SquareHash calculations is used to determine cache blocks that are XORed together to produce a dataset block:
```c++ ```c++
void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) { void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) {
@ -92,14 +91,14 @@ void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) {
*Note: `SquareHash` doesn't calculate squaring modulo 2<sup>64</sup>+1 because the subtraction is performed modulo 2<sup>64</sup>. Squaring modulo 2<sup>64</sup>+1 can be calculated by adding the carry bit in every iteration (i.e. the sequence in x86-64 assembly would have to be: `mul rax; sub rax, rdx; adc rax, 0`), but this would decrease ASIC-resistance of `SquareHash`.* *Note: `SquareHash` doesn't calculate squaring modulo 2<sup>64</sup>+1 because the subtraction is performed modulo 2<sup>64</sup>. Squaring modulo 2<sup>64</sup>+1 can be calculated by adding the carry bit in every iteration (i.e. the sequence in x86-64 assembly would have to be: `mul rax; sub rax, rdx; adc rax, 0`), but this would decrease ASIC-resistance of `SquareHash`.*
### Performance ## Performance
The initial 256-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be parallelized. The initial 256-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be parallelized.
On the same laptop, full Dataset initialization takes around 100 seconds using a single thread (1.5 µs per block). On the same laptop, full dataset initialization takes around 100 seconds using a single thread (1.5 µs per block).
While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the Dataset generation time decreases linearly with the number of threads. Using an 8-core AMD Ryzen CPU, the whole dataset can be generated in under 10 seconds. While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using an 8-core AMD Ryzen CPU, the whole dataset can be generated in under 10 seconds.
Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating 524288 dataset blocks per minute (corresponds to about 1% utilization of a single CPU core). Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating 524288 dataset blocks per minute (corresponds to about 1% utilization of a single CPU core).
### Light clients ## Light clients
Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly during hash calculation. In this case, the hash calculation time will be increased by 16384 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 24.5 milliseconds per hash. Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly during hash calculation. In this case, the hash calculation time will be increased by 16384 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 24.5 milliseconds per hash.

View file

@ -6,7 +6,7 @@ For integer instructions, the destination is always an integer register (registe
Memory operands are loaded as 8-byte values from the address indicated by `src`. This indirect addressing is marked with square brackets: `[src]`. Memory operands are loaded as 8-byte values from the address indicated by `src`. This indirect addressing is marked with square brackets: `[src]`.
|frequency|instruction|dst|src|`src == dst ?`|operation| |frequency|instruction|dst|src|`src == dst ?`|operation|
|-|-|-|-|-|-|-|-| |-|-|-|-|-|-|
|12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`| |12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`|
|7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`| |7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`|
|16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`| |16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`|
@ -42,7 +42,7 @@ For floating point instructions, the destination can be a group F or group E reg
Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`. Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`.
|frequency|instruction|dst|src|operation| |frequency|instruction|dst|src|operation|
|-|-|-|-|-|-|-| |-|-|-|-|-|
|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`| |8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`| |20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
|5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`| |5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`|
@ -67,16 +67,16 @@ All floating point instructions give correctly rounded results. The rounding mod
|2|roundTowardPositive| |2|roundTowardPositive|
|3|roundTowardZero| |3|roundTowardZero|
The rounding modes are defined by the IEEE-754 standard. The rounding modes are defined by the IEEE 754 standard.
## Other instructions ## Other instructions
There are 4 special instructions that have more than one source operand or the destination operand is a memory value. There are 4 special instructions that have more than one source operand or the destination operand is a memory value.
|frequency|instruction|dst|src|operation| |frequency|instruction|dst|src|operation|
|-|-|-|-|-| |-|-|-|-|-|
|7/256|COND_R|R|R, `imm32`|`if(condition(src, imm32)) dst = dst + 1` |7/256|COND_R|R|R|`if(condition(src, imm32)) dst = dst + 1`
|1/256|COND_M|R|mem, `imm32`|`if(condition([src], imm32)) dst = dst + 1` |1/256|COND_M|R|mem|`if(condition([src], imm32)) dst = dst + 1`
|1/256|CFROUND|`fprc`|R, `imm32`|`fprc = src >>> imm32` |1/256|CFROUND|`fprc`|R|`fprc = src >>> imm32`
|16/256|ISTORE|mem|R|`[dst] = src` |16/256|ISTORE|mem|R|`[dst] = src`
#### COND #### COND

View file

@ -1,6 +1,6 @@
# RandomX instruction set architecture # RandomX instruction set architecture
RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE-754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE 754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).
## Registers ## Registers
@ -36,16 +36,16 @@ Each instruction word is 64 bits long and has the following format:
![Imgur](https://i.imgur.com/FtkWRwe.png) ![Imgur](https://i.imgur.com/FtkWRwe.png)
### opcode ### opcode
There are 256 opcodes, which are distributed between 35 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program). There are 256 opcodes, which are distributed between 32 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).
*Table 2: Instruction groups* *Table 2: Instruction groups*
|group|# instructions|# opcodes|| |group|# instructions|# opcodes||
|---------|-----------------|----|-| |---------|-----------------|----|-|
|integer |20|143|55.9%| |integer |19|137|53.5%|
|floating point |11|88|34.4%| |floating point |9|94|36.7%|
|other |4|25|9.7%| |other |4|25|9.8%|
||**35**|**256**|**100%** ||**32**|**256**|**100%**
Full description of all instructions: [isa-ops.md](isa-ops.md). Full description of all instructions: [isa-ops.md](isa-ops.md).
@ -88,4 +88,4 @@ The address for reading/writing is calculated by applying bitwise AND operation
The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested. The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested.
### imm32 ### imm32
A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits in most cases. A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits unless specified otherwise.