RandomWOW/doc/vm.md
2018-12-31 19:27:31 +01:00

4.8 KiB

RandomX virtual machine

RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:

Imgur

Dataset

The VM has access to a read-only dataset which has a size of 4 GiB and changes every ~34 hours. See dataset.md for details how the dataset is generated.

MMU

The memory management unit (MMU) interfaces the CPU with the external memory. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address addr and outputs a 64-bit value from the dataset. The dataset is read mostly sequentially. On average, there is one random read for every 8192 sequential reads. An average program reads a total of 4 MiB of the dataset and has 64 random reads.

The MMU uses two internal registers:

  • ma - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
  • mx - A 32-bit counter that determines if the next read is sequential or random. After each read, the read address is XORed with the counter and if bits 3-15 of the register are zero, bits 0-2 are cleared and the value of the mx register is copied into register ma. Thus, all random reads are aligned to a 64 KiB block boundary.

When the value of the ma register is changed to a random address, the memory location can be preloaded into CPU cache using the x86 PREFETCH instruction or ARM PRFM instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.

Scratchpad

The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.

The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance, which should limit the impact of L1 cache misses.

Program

The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.

For high-performance mining, the program should be translated directly into machine code. The whole program will fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.

Control unit

The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:

  • pc - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
  • sp - Address of the last element on the stack (64-bit, 8 byte aligned).
  • ic - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when ic reaches 0.

Fixed number of executed instructions ensure roughly equal runtime of each random program.

Stack

To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.

Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB. Most programs will use around 4 KiB of stack.

Register file

The VM has 8 integer registers r0-r7 and 8 floating point registers f0-f7. The integer registers are 64 bits wide. The floating point registers are 128 bits wide and each stores two packed double precision numbers.

The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.

ALU

The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.

FPU

The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root. All operations work with two packed double precision numbers.

Binary encoding

The VM stores and loads all data in little-endian byte order. Signed integer numbers are represented using two's complement.