Updated readme

2024-08-15 00:23:14 +00:00 · 2019-03-28 16:40:53 +01:00 · 2019-03-28 16:40:53 +01:00 · ad7b473388
commit ad7b473388
parent cc70e53bb1
6 changed files with 45 additions and 411 deletions
--- a/README.md
+++ b/README.md
@ -4,82 +4,73 @@ RandomX is a proof-of-work (PoW) algorithm that is optimized for general-purpose
 * Prevent the development of a single-chip [ASIC](https://en.wikipedia.org/wiki/Application-specific_integrated_circuit)
 * Minimize the efficiency advantage of specialized hardware compared to a general-purpose CPU

+## Specification
+
+Full specification available in [specs.md](doc/specs.md).
+
 ## Design

-The core of RandomX is a virtual machine (VM), which can be summarized by the following schematic:
+Design notes available in [design.md](doc/design.md).

-![Imgur](https://i.imgur.com/8RYNWLk.png)
+## Build

-Notable parts of the RandomX VM are:
+Build using `make`. Requires a C++11 compliant compiler. There are no dependencies.

-* a large read-only 4 GiB dataset
-* a 2 MiB scratchpad (read/write), which is structured into three levels L1, L2 and L3
-* 8 integer and 12 floating point registers
-* an arithmetic logic unit (ALU)
-* a floating point unit (FPU)
-* a 2 KiB program buffer
+Precompiled test binaries are available on the [Releases page](https://github.com/tevador/RandomX/releases).

-The structure of the VM mimics the components that are found in a typical general purpose computer equipped with a CPU and a large amount of DRAM. The scratchpad is designed to fit into the CPU cache. The first 16 KiB and 256 KiB of the scratchpad are used more often take advantage of the faster L1 and L2 caches. The ratio of random reads from L1/L2/L3 is approximately 9:3:1, which matches the inverse latencies of typical CPU caches.
+## Usage

-The VM executes programs in a special instruction set, which was designed in such way that any random 8-byte word is a valid instruction and any sequence of valid instructions is a valid program. For more details see [RandomX ISA documentation](doc/isa.md). Because there are no "syntax" rules, generating a random program is as easy as filling the program buffer with random data. A RandomX program consists of 256 instructions. See [program.inc](src/program.inc) as an example of a RandomX program translated into x86-64 assembly.
+```
+Usage: randomx [OPTIONS]
+Supported options:
+  --help        shows this message
+  --mine        mining mode: 2 GiB, x86-64 JIT compiled VM
+  --verify      verification mode: 256 MiB
+  --jit         x86-64 JIT compiled verification mode (default: interpreter)
+  --largePages  use large pages
+  --softAes     use software AES (default: x86 AES-NI)
+  --threads T   use T threads (default: 1)
+  --init Q      initialize dataset with Q threads (default: 1)
+  --nonces N    run N nonces (default: 1000)
+  --genAsm      generate x86-64 asm code for nonce N
+  --genNative   generate RandomX code for nonce N
+```

-### Hash calculation
+### Mining mode
+Mining mode requires >2 GiB of RAM and optimal performance should be obtained with at least 16 KiB of L1 cache, 256 KiB of L2 cache and 2 MiB of L3 cache per mining thread.

-Calculating a RandomX hash consists of initializing the 2 MiB scratchpad with random data, executing 8 RandomX loops and calculating a hash of the scratchpad.
+The reference miner supports only x86 64-bit CPUs at the moment. [AES-NI](https://en.wikipedia.org/wiki/AES_instruction_set) support is not required, but using the `--softAes` option reduces mining performance by about 40%.

-Each RandomX loop is repeated 2048 times. The loop body has 4 parts:
-1. The values of all registers are loaded randomly from the scratchpad (L3)
-2. The RandomX program is executed
-3. A random block is loaded from the dataset and mixed with integer registers
-4. All register values are stored into the scratchpad (L3)
+It is recommended to use [large pages](https://en.wikipedia.org/wiki/Page_(computer_memory)#Multiple_page_sizes) with the `--largePages` option. Using the default page size can reduce performance by up to 50% due to [TLB thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)#TLB_thrashing).

-Hash of the register state after 2048 interations is used to initialize the random program for the next loop. The use of 8 different programs in the course of a single hash calculation prevents mining strategies that search for "easy" programs.
+[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) systems should run one instance of RandomX per NUMA node.

-The loads from the dataset are fully prefetched, so they don't slow down the loop.
+### Light mode

-RandomX uses the [Blake2b](https://en.wikipedia.org/wiki/BLAKE_%28hash_function%29#BLAKE2) cryptographic hash function. Special hashing functions `fillAes1Rx4` and `hashAes1Rx4` based on [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) encryption are used to initialize and hash the scratchpad ([hashAes1Rx4.cpp](src/hashAes1Rx4.cpp)).
-
-### Hash verification
-
-RandomX is a symmetric PoW algorithm, so the verifying party has to repeat the same steps as when a hash is calculated.
-
-However, to allow hash verification on devices that cannot store the whole 4 GiB dataset, RandomX allows a time-memory tradeoff by using just 256 MiB of memory at the cost of 16 times more random memory accesses. See [Dataset initialization](doc/dataset.md) for more details.
+Verification is done in the 'light' mode, which requires only 256 MiB of memory, but runs much slower than the mining mode. Use the `--jit` option on x86-64 CPUs for maximum verification performance. 

 ### Performance
-Preliminary mining performance with the x86-64 JIT compiled VM:
+Preliminary performance using the optimal number of threads and large pages (if possible):

-|CPU|RAM|threads|hashrate [H/s]|comment|
-|-----|-----|----|----------|-----|
-|AMD Ryzen 1700 (desktop)|DDR4-2933|8|4100|
-|Intel i5-3230M (laptop)|DDR3-1333|1|280|without large pages
-|Intel i7-8550U (laptop)|DDR4-2400|4|1650|limited by thermals
-|Intel i5-2500K (desktop)|DDR3-1333|3|1350|
-
-Hash verification is performed using the portable interpreter in "light-client mode" and takes 30-70 ms depending on RAM latency and CPU clock speed. Hash verification in "mining mode" takes 2-4 ms.
-
-### Documentation
-* [RandomX ISA](doc/isa.md)
-* [RandomX instruction listing](doc/isa-ops.md)
-* [Dataset initialization](doc/dataset.md)
+|CPU|RAM|OS|AES|RandomX (mining)|RandomX (light)|
+|---|---|--|---|---------|--------------|
+AMD Ryzen 7 1700|16 GB DDR4|Ubuntu 16.04|HW|4250 H/s (8T)|640 H/s (16T)|
+Intel Core i7-8550U|16 GB DDR4|Windows 10|HW|1660 H/s (4T)|128 H/s (4T)|
+Intel Core i3-3220|2 GB DDR3|Ubuntu 16.04|software|-|187 H/s (4T)|
+Raspberry Pi 3|1 GB DDR2|Ubuntu 16.04|software|-|12.3 H/s (4T)|

 # FAQ

 ### Can RandomX run on a GPU?

-We don't expect GPUs will ever be competitive in mining RandomX. The reference miner is CPU-only.
+RandomX was designed to be efficient on CPUs. Designing an algorithm compatible with both CPUs and GPUs brings too many limitations and ultimately decreases ASIC resistance.

-A rough estimate for AMD Vega 56 GPU gave an upper limit of 1200 H/s, or slightly less than a quad core CPU (details in issue [#24](https://github.com/tevador/RandomX/issues/24)).
+GPUs are expected to be at a disadvantage when running RandomX, but the exact performance has not been determined yet due to lack of a working GPU implementation.

-RandomX was designed to be efficient on CPUs. Designing an algorithm compatible with both CPUs and GPUs brings too many limitations and ultimately decreases ASIC resistance. CPUs have the advantage of not needing proprietary drivers and most CPU architectures support a large common subset of native operations.
-
-Additionally, targeting CPUs allows for more decentralized mining for several reasons:
-
-* Every computer has a CPU and even laptops will be able to mine efficiently.
-* CPU mining is easier to set up - no driver compatibility issues, BIOS flashing etc.
-* CPU mining is more difficult to centralize because computers can usually have only one CPU except for expensive server parts.
+A rough estimate for AMD Vega 56 GPU gave an upper limit of 1200 H/s, comparable to a quad core CPU (details in issue [#24](https://github.com/tevador/RandomX/issues/24)).

 ### Does RandomX facilitate botnets/malware mining or web mining?
-Quite the opposite. Efficient mining requires 4 GiB of memory, which is very difficult to hide in an infected computer and disqualifies many low-end machines. Web mining is nearly impossible due to the large memory requirement and the need for a rather lengthy initialization of the dataset.
+Quite the opposite. Efficient mining requires 2 GiB of memory, which is difficult to hide in an infected computer and disqualifies many low-end machines such as IoT devices. Web mining is nearly impossible due to the large memory requirements and low performance in interpreted mode.

 ### Since RandomX uses floating point calculations, how can it give reproducible results on different platforms?

--- a/doc/dataset.md
+++ b/doc/dataset.md
@ -1,104 +0,0 @@
-# Dataset
-
-The dataset is randomly accessed 16384 times during each hash calculation, which significantly increases memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 67108864 blocks of 64 bytes.
-
-In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 256 MiB cache, which can be used to calculate dataset blocks on the fly.
-
-Because the initialization of the dataset is computationally intensive, it is recalculated only every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
-
-![Imgur](https://i.imgur.com/b9WHOwo.png)
-
-## Seed block
-The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations.
-
-|block|Seed block|
-|------|---------------------------------|
-|1-1088|Genesis block|
-|1088-2112|1024|
-|2113-3136|2048|
-|...|...
-
-## Cache construction
-
-The 32-byte seed block hash is expanded into the 256 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
-
-Argon2 is used with the following parameters:
-
-|parameter|value|
-|------------|--|
-|parallelism|1|
-|output size|0|
-|memory|262144 (256 MiB)|
-|iterations|3|
-|version|`0x13`|
-|hash type|0 (Argon2d)
-|password|seed block hash (32 bytes)
-|salt|`4d 6f 6e 65 72 6f 1a 24` (8 bytes)
-|secret size|0|
-|assoc. data size|0|
-
-The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.
-
-The use of 3 iterations makes time-memory tradeoffs infeasible and thus 256 MiB is the minimum amount of memory required by RandomX.
-
-## Dataset block generation
-The full 4 GiB dataset can be generated from the 256 MiB cache. Each 64-byte block is generated independently by XORing 16 pseudorandom cache blocks selected by the `SquareHash` function.
-
-### SquareHash
-`SquareHash` is a custom hash function with 64-bit input and 64-bit output. It is calculated by repeatedly squaring the input, splitting the 128-bit result in to two 64-bit halves and subtracting the high half from the low half. This is repeated 42 times. It's available as a [portable C implementation](../src/squareHash.h) and [x86-64 assembly version](../src/asm/squareHash.inc).
-
-Properties of `SquareHash`:
-
-* It achieves full [Avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect).
-* Since the whole calculation is a long dependency chain, which uses only multiplication and subtraction, the performance gains by using custom hardware are very limited.
-* A single `SquareHash` calculation takes 40-80 ns, which is about the same time as DRAM access latency. ASIC devices using low-latency memory will be bottlenecked by `SquareHash`, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM.
-
-The output of 16 chained SquareHash calculations is used to determine cache blocks that are XORed together to produce a dataset block:
-
-```c++
-void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) {
-  uint64_t r0, r1, r2, r3, r4, r5, r6, r7;
-
-  r0 = 4ULL * blockNumber;
-  r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;
-
-  constexpr uint32_t mask = (CacheSize - 1) & CacheLineAlignMask;
-
-  for (auto i = 0; i < DatasetIterations; ++i) {
-    const uint8_t* mixBlock = cache + (r0 & mask);
-    PREFETCHNTA(mixBlock);
-    r0 = squareHash(r0);
-    r0 ^= load64(mixBlock + 0);
-    r1 ^= load64(mixBlock + 8);
-    r2 ^= load64(mixBlock + 16);
-    r3 ^= load64(mixBlock + 24);
-    r4 ^= load64(mixBlock + 32);
-    r5 ^= load64(mixBlock + 40);
-    r6 ^= load64(mixBlock + 48);
-    r7 ^= load64(mixBlock + 56);
-  }
-
-  store64(out + 0, r0);
-  store64(out + 8, r1);
-  store64(out + 16, r2);
-  store64(out + 24, r3);
-  store64(out + 32, r4);
-  store64(out + 40, r5);
-  store64(out + 48, r6);
-  store64(out + 56, r7);
-}
-```
-
-*Note: `SquareHash` doesn't calculate squaring modulo 2<sup>64</sup>+1 because the subtraction is performed modulo 2<sup>64</sup>. Squaring modulo 2<sup>64</sup>+1 can be calculated by adding the carry bit in every iteration (i.e. the sequence in x86-64 assembly would have to be: `mul rax; sub rax, rdx; adc rax, 0`), but this would decrease ASIC-resistance of `SquareHash`.*
-
-## Performance
-The initial 256-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be parallelized.
-
-On the same laptop, full dataset initialization takes around 100 seconds using a single thread (1.5 µs per block).
-
-While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using an 8-core AMD Ryzen CPU, the whole dataset can be generated in under 10 seconds.
-
-Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating 524288 dataset blocks per minute (corresponds to about 1% utilization of a single CPU core).
-
-## Light clients
-Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly during hash calculation. In this case, the hash calculation time will be increased by 16384 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 24.5 milliseconds per hash.
--- a/doc/isa-ops.md
+++ b/doc/isa-ops.md
@ -1,108 +0,0 @@
-# RandomX instruction listing
-
-## Integer instructions
-For integer instructions, the destination is always an integer register (register group R). Source operand (if applicable) can be either an integer register or memory value. If `dst` and `src` refer to the same register, most instructions use `imm32` as the source operand instead of the register. This is indicated in the 'src == dst' column.
-
-Memory operands are loaded as 8-byte values from the address indicated by `src`.  This indirect addressing is marked with square brackets: `[src]`.
-
-|frequency|instruction|dst|src|`src == dst ?`|operation|
-|-|-|-|-|-|-|
-|12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`|
-|7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`|
-|16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`|
-|12/256|ISUB_R|R|R|`src = imm32`|`dst = dst - src`|
-|7/256|ISUB_M|R|mem|`src = imm32`|`dst = dst - [src]`|
-|9/256|IMUL_9C|R|-|-|`dst = 9 * dst + imm32`|
-|16/256|IMUL_R|R|R|`src = imm32`|`dst = dst * src`|
-|4/256|IMUL_M|R|mem|`src = imm32`|`dst = dst * [src]`|
-|4/256|IMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64`|
-|1/256|IMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64`|
-|4/256|ISMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64` (signed)|
-|1/256|ISMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64` (signed)|
-|8/256|IMUL_RCP|R|-|-|<code>dst = 2<sup>x</sup> / imm32 * dst</code>|
-|2/256|INEG_R|R|-|-|`dst = -dst`|
-|16/256|IXOR_R|R|R|`src = imm32`|`dst = dst ^ src`|
-|4/256|IXOR_M|R|mem|`src = imm32`|`dst = dst ^ [src]`|
-|10/256|IROR_R|R|R|`src = imm32`|`dst = dst >>> src`|
-|4/256|ISWAP_R|R|R|`src = dst`|`temp = src; src = dst; dst = temp`|
-
-#### IMULH and ISMULH
-These instructions output the high 64 bits of the whole 128-bit multiplication result. The result differs for signed and unsigned multiplication (`IMULH` is unsigned, `ISMULH` is signed). The variants with a register source operand do not use `imm32` (they perform a squaring operation if `dst` equals `src`).
-
-#### IMUL_RCP
-This instruction multiplies the destination register by a reciprocal of `imm32`. The reciprocal is calculated as <code>rcp = 2<sup>x</sup> / imm32</code> by choosing the largest integer `x` such that <code>rcp < 2<sup>64</sup></code>. If `imm32` equals 0, this instruction is a no-op.
-
-#### ISWAP_R
-This instruction swaps the values of two registers. If source and destination refer to the same register, the result is a no-op.
-
-## Floating point instructions
-For floating point instructions, the destination can be a group F or group E register. Source operand is either a group A register or a memory value.
-
-Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`.
-
-Memory operands for group E registers are loaded as described above, then their sign bit is cleared and their exponent value is set to `0x30F` (corresponds to 2<sup>-240</sup>).
-
-|frequency|instruction|dst|src|operation|
-|-|-|-|-|-|
-|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
-|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
-|5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`|
-|20/256|FSUB_R|F|A|`(dst0, dst1) = (dst0 - src0, dst1 - src1)`|
-|5/256|FSUB_M|F|mem|`(dst0, dst1) = (dst0 - [src][0], dst1 - [src][1])`|
-|6/256|FSCAL_R|F|-|<code>(dst0, dst1) = (-2<sup>x0</sup> * dst0, -2<sup>x1</sup> * dst1)</code>|
-|20/256|FMUL_R|E|A|`(dst0, dst1) = (dst0 * src0, dst1 * src1)`|
-|4/256|FDIV_M|E|mem|`(dst0, dst1) = (dst0 / [src][0], dst1 / [src][1])`|
-|6/256|FSQRT_R|E|-|`(dst0, dst1) = (√dst0, √dst1)`|
-
-#### FSCAL_R
-This instruction negates the number and multiplies it by <code>2<sup>x</sup></code>. `x` is calculated by taking the 5 least significant digits of the biased exponent and interpreting them as a binary number using the digit set `{+1, -1}` as opposed to the traditional `{0, 1}`. The possible values of `x` are all odd numbers from -31 to +31.
-
-The mathematical operation described above is equivalent to a bitwise XOR of the binary representation with the value of `0x81F0000000000000`.
-
-#### Denormal and NaN values
-Due to restrictions on the values of the floating point registers, no operation results in `NaN` or a denormal number.
-
-#### Rounding
-All floating point instructions give correctly rounded results. The rounding mode depends on the value of the `fprc` register:
-
-|`fprc`|rounding mode|
-|-------|------------|
-|0|roundTiesToEven|
-|1|roundTowardNegative|
-|2|roundTowardPositive|
-|3|roundTowardZero|
-
-The rounding modes are defined by the IEEE 754 standard.
-
-## Other instructions
-There are 4 special instructions that have more than one source operand or the destination operand is a memory value.
-
-|frequency|instruction|dst|src|operation|
-|-|-|-|-|-|
-|7/256|COND_R|R|R|`if(condition(src, imm32)) dst = dst + 1`
-|1/256|COND_M|R|mem|`if(condition([src], imm32)) dst = dst + 1`
-|1/256|CFROUND|`fprc`|R|`fprc = src >>> imm32`
-|16/256|ISTORE|mem|R|`[dst] = src`
-
-#### COND
-
-These instructions conditionally increment the destination register. The condition function depends on the `mod.cond` flag and takes the lower 32 bits of the source operand and the value `imm32`.
-
-|`mod.cond`|signed|`condition`|probability|*x86*|*ARM*
-|---|---|----------|-----|--|----|
-|0|no|`src <= imm32`|0% - 100%|`JBE`|`BLS`
-|1|no|`src > imm32`|0% - 100%|`JA`|`BHI`
-|2|yes|`src - imm32 < 0`|50%|`JS`|`BMI`
-|3|yes|`src - imm32 >= 0`|50%|`JNS`|`BPL`
-|4|yes|`src - imm32` overflows|0% - 50%|`JO`|`BVS`
-|5|yes|`src - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
-|6|yes|`src < imm32`|0% - 100%|`JL`|`BLT`
-|7|yes|`src >= imm32`|0% - 100%|`JGE`|`BGE`
-
-The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected probability the condition is true (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
-
-#### CFROUND
-This instruction sets the value of the `fprc` register to the 2 least significant bits of the source register rotated right by `imm32`. This changes the rounding mode of all subsequent floating point instructions.
-
-#### ISTORE
-The `ISTORE` instruction stores the value of the source integer register to the memory at the address specified by the destination register. The `src` and `dst` register can be the same.
--- a/doc/isa.md
+++ b/doc/isa.md
@ -1,91 +0,0 @@
-
-# RandomX instruction set architecture
-RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE 754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).
-
-## Registers
-
-RandomX has 8 integer registers `r0`-`r7` (group R) and a total of 12 floating point registers split into 3 groups: `a0`-`a3` (group A), `f0`-`f3` (group F) and `e0`-`e3` (group E). Integer registers are 64 bits wide, while floating point registers are 128 bits wide and contain a pair of floating point numbers. The lower and upper half of floating point registers are not separately addressable.
-
-*Table 1: Addressable register groups*
-
-|index|R|A|F|E|F+E|
-|--|--|--|--|--|--|
-|0|`r0`|`a0`|`f0`|`e0`|`f0`|
-|1|`r1`|`a1`|`f1`|`e1`|`f1`|
-|2|`r2`|`a2`|`f2`|`e2`|`f2`|
-|3|`r3`|`a3`|`f3`|`e3`|`f3`|
-|4|`r4`||||`e0`|
-|5|`r5`||||`e1`|
-|6|`r6`||||`e2`|
-|7|`r7`||||`e3`|
-
-Besides the directly addressable registers above, there is a 2-bit `fprc` register for rounding control, which is an implicit destination register of the `CFROUND` instruction, and two architectural 32-bit registers `ma` and `mx`, which are not accessible to any instruction. 
-
-Integer registers `r0`-`r7` can be the source or the destination operands of integer instructions or may be used as address registers for loading the source operand from the memory (scratchpad).
-
-Floating point registers `a0`-`a3` are read-only and may not be written to except at the moment a program is loaded into the VM. They can be the source operand of any floating point instruction. The value of these registers is restricted to the interval `[1, 4294967296)`.
-
-Floating point registers `f0`-`f3` are the *additive* registers, which can be the destination of floating point addition and subtraction instructions. The absolute value of these registers will not exceed `1.0e+12`.
-
-Floating point registers `e0`-`e3` are the *multiplicative* registers, which can be the destination of floating point multiplication, division and square root instructions. Their value is always positive.
-
-## Instruction encoding
-
-Each instruction word is 64 bits long and has the following format:
-
-![Imgur](https://i.imgur.com/FtkWRwe.png)
-
-### opcode
-There are 256 opcodes, which are distributed between 32 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).
-
-*Table 2: Instruction groups*
-
-|group|# instructions|# opcodes||
-|---------|-----------------|----|-|
-|integer |19|137|53.5%|
-|floating point |9|94|36.7%|
-|other |4|25|9.8%|
-||**32**|**256**|**100%**
-
-Full description of all instructions: [isa-ops.md](isa-ops.md).
-
-### dst
-Destination register. Only bits 0-1 (register groups A, F, E) or 0-2 (groups R, F+E) are used to encode a register according to Table 1.
-
-### src
-
-The `src` flag encodes a source operand register according to Table 1 (only bits 0-1 or 0-2 are used).
-
-Immediate value `imm32` is used as the source operand in cases when `dst` and `src` encode the same register.
-
-For register-memory instructions, the source operand determines the `address_base` value for calculating the memory address (see below).
-
-### mod
-
-The `mod` flag is encoded as:
-
-*Table 3: mod flag encoding*
-
-|`mod`|description|
-|----|--------|
-|0-1|`mod.mem` flag|
-|2-4|`mod.cond` flag|
-|5-7|Reserved|
-
-The `mod.mem` flag determines the address mask when reading from or writing to memory:
-
-*Table 3: memory address mask*
-
-|`mod.mem`|`address_mask`|(scratchpad level)|
-|---------|-|---|
-|0|262136|(L2)|
-|1-3|16376|(L1)|
-
-Table 3 applies to all memory accesses except for cases when the source operand is an immediate value. In that case, `address_mask` is equal to 2097144 (L3). 
-
-The address for reading/writing is calculated by applying bitwise AND operation to `address_base` and `address_mask`.
-
-The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested.
-
-### imm32
-A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits unless specified otherwise.
--- a/doc/vm.md
+++ b/doc/vm.md
@ -1,55 +0,0 @@
-
-
-## RandomX virtual machine
-RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
-
-![Imgur](https://i.imgur.com/ZAfbX9m.png)
-
-#### Dataset
-The VM has access to a read-only dataset which has a size of 4 GiB and changes every ~34 hours. See [dataset.md](dataset.md) for details how the dataset is generated.
-
-#### MMU
-The memory management unit (MMU) interfaces the CPU with the external memory. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from the dataset. The dataset is read mostly sequentially. On average, there is one random read for every 8192 sequential reads. An average program reads a total of 4 MiB of the dataset and has 64 random reads.
-
-The MMU uses two internal registers:
-* **ma** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
-* **mx** - A 32-bit counter that determines if the next read is sequential or random. After each read, the read address is XORed with the counter and if bits 3-15 of the register are zero, bits 0-2 are cleared and the value of the `mx` register is copied into register `ma`. Thus, all random reads are aligned to a 64 KiB block boundary.
-
-*When the value of the `ma` register is changed to a random address, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
-
-#### Scratchpad
-The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
-
-*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance, which should limit the impact of L1 cache misses.*
-
-#### Program
-The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
-
-*For high-performance mining, the program should be translated directly into machine code. The whole program will fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.*
-
-#### Control unit
-The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:
-* **pc** - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
-* **sp** - Address of the last element on the stack (64-bit, 8 byte aligned).
-* **ic** - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when `ic` reaches `0`.
-
-*Fixed number of executed instructions ensure roughly equal runtime of each random program.*
-
-#### Stack
-To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
-
-*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB. Most programs will use around 4 KiB of stack.*
-
-#### Register file
-The VM has 8 integer registers `r0`-`r7`  and 8 floating point registers `f0`-`f7`. The integer registers are 64 bits wide. The floating point registers are 128 bits wide and each stores two packed double precision numbers.
-
-*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
-
-#### ALU
-The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.
-
-#### FPU
-The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root. All operations work with two packed double precision numbers.
-
-#### Binary encoding
-The VM stores and loads all data in little-endian byte order. Signed integer numbers are represented using two's complement.
--- a/src/main.cpp
+++ b/src/main.cpp
@ -115,8 +115,9 @@ void printUsage(const char* executable) {
 	std::cout << "Usage: " << executable << " [OPTIONS]" << std::endl;
 	std::cout << "Supported options:" << std::endl;
 	std::cout << "  --help        shows this message" << std::endl;
-	std::cout << "  --mine        mining mode: 4 GiB, x86-64 compiled VM" << std::endl;
-	std::cout << "  --verify      verification mode: 256 MiB, portable VM" << std::endl;
+	std::cout << "  --mine        mining mode: 2 GiB, x86-64 JIT compiled VM" << std::endl;
+	std::cout << "  --verify      verification mode: 256 MiB" << std::endl;
+	std::cout << "  --jit         x86-64 JIT compiled verification mode (default: interpreter)" << std::endl;
 	std::cout << "  --largePages  use large pages" << std::endl;
 	std::cout << "  --softAes     use software AES (default: x86 AES-NI)" << std::endl;
 	std::cout << "  --threads T   use T threads (default: 1)" << std::endl;