diff --git a/README.md b/README.md
index fc18ed8..fed319c 100644
--- a/README.md
+++ b/README.md
@@ -1,111 +1,87 @@
 # RandomX
-RandomX is an experimental proof of work (PoW) algorithm that uses random code execution.
+RandomX is a proof-of-work (PoW) algorithm that is optimized for general-purpose CPUs. RandomX uses random code execution (hence the name) together with several memory-hard techniques to achieve the following goals:
 
-### Key features
+* Prevent the development of a single-chip [ASIC](https://en.wikipedia.org/wiki/Application-specific_integrated_circuit)
+* Minimize the efficiency advantage of specialized hardware compared to a general-purpose CPU
 
-* Memory-hard (requires  >4 GiB of memory)
-* CPU-friendly (especially for x86 and ARM architectures)
-* arguably ASIC-resistant
-* inefficient on GPUs
-* unusable for web-mining
+## Design
 
-## Virtual machine
+The core of RandomX is a virtual machine (VM), which can be summarized by the following schematic:
 
-RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
+![Imgur](https://i.imgur.com/8RYNWLk.png)
 
-![Imgur](https://i.imgur.com/ZAfbX9m.png)
+Notable parts of the RandomX VM are:
 
-Full description: [vm.md](doc/vm.md).
+* a large read-only 4 GiB dataset
+* a 2 MiB scratchpad (read/write), which is structured into three levels L1, L2 and L3
+* 8 integer and 12 floating point registers
+* an arithmetic logic unit (ALU)
+* a floating point unit (FPU)
+* a 2 KiB program buffer
 
-## Dataset
+The structure of the VM mimics the components that are found in a typical general purpose computer equipped with a CPU and a large amount of DRAM. The scratchpad is designed to fit into the CPU cache. The first 16 KiB and 256 KiB of the scratchpad are used more often take advantage of the faster L1 and L2 caches. The ratio of random reads from L1/L2/L3 is approximately 9:3:1, which matches the inverse latencies of typical CPU caches.
 
-RandomX uses a 4 GiB read-only dataset. The dataset is constructed using a combination of the [Argon2d](https://en.wikipedia.org/wiki/Argon2) hashing function, [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) encryption/decryption and a random permutation. The dataset is regenerated every ~34 hours.
+The VM executes programs in a special instruction set, which was designed in such way that any random 8-byte word is a valid instruction and any sequence of valid instructions is a valid program. For more details see [RandomX ISA documentation](doc/isa.md). Because there are no "syntax" rules, generating a random program is as easy as filling the program buffer with random data. A RandomX program consists of 256 instructions. See [program.inc](src/program.inc) as an example of a RandomX program translated into x86-64 assembly.
 
-Full description: [dataset.md](doc/dataset.md).
+### Hash calculation
 
-## Instruction set
+Calculating a RandomX hash consists of initializing the 2 MiB scratchpad with random data, executing 8 RandomX loops and calculating a hash of the scratchpad.
 
-RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program. Each RandomX instruction has a length of 128 bits.
+Each RandomX loop is repeated 2048 times. The loop body has 4 parts:
+1. The values of all registers are loaded randomly from the scratchpad (L3)
+2. The RandomX program is executed
+3. A random block is loaded from the dataset and mixed with integer registers
+4. All register values are stored into the scratchpad (L3)
 
-Full description: [isa.md](doc/isa.md).
+Hash of the register state after 2048 interations is used to initialize the random program for the next loop. The use of 8 different programs in the course of a single hash calculation prevents mining strategies that search for "easy" programs.
 
-## Implementation
-Proof-of-concept implementation is written in C++.
-```
-> bin/randomx --help
-Usage: bin/randomx [OPTIONS]
-Supported options:
-        --help                  shows this message
-        --compiled              use x86-64 JIT-compiled VM (default: interpreted VM)
-        --lightClient           use 'light-client' mode (default: full dataset mode)
-        --softAes               use software AES (default: x86 AES-NI)
-        --threads T             use T threads (default: 1)
-        --nonces N              run N nonces (default: 1000)
-        --genAsm                generate x86 asm code for nonce N
-```
+The loads from the dataset are fully prefetched, so they don't slow down the loop.
 
-Two RandomX virtual machines are implemented:
+RandomX uses the [Blake2b](https://en.wikipedia.org/wiki/BLAKE_%28hash_function%29#BLAKE2) cryptographic hash function. Special hashing functions `fillAes1Rx4` and `hashAes1Rx4` based on [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) encryption are used to initialize and hash the scratchpad ([hashAes1Rx4.cpp](src/hashAes1Rx4.cpp)).
 
-### Interpreted VM
-The interpreted VM is the reference implementation, which aims for maximum portability.
+### Hash verification
 
-The VM has been tested for correctness on the following platforms:
-* Linux: x86-64, ARMv7 (32-bit), ARMv8 (64-bit)
-* Windows: x86, x86-64
-* MacOS: x86-64
+RandomX is a symmetric PoW algorithm, so the verifying party has to repeat the same steps as when a hash is calculated.
 
-The interpreted VM supports two modes: "full dataset" mode, which requires more than 4 GiB of virtual memory, and a "light-client" mode, which requires about 64 MiB of memory, but runs significantly slower because dataset blocks are created on the fly rather than simply fetched from memory.
+However, to allow hash verification on devices that cannot store the whole 4 GiB dataset, RandomX allows a time-memory tradeoff by using just 256 MiB of memory at the cost of 16 times more random memory accesses. See [Dataset initialization](doc/dataset.md) for more details.
 
-Software AES implementation is available for CPUs which don't support [AES-NI](https://en.wikipedia.org/wiki/AES_instruction_set).
+### Performance
+Preliminary mining performance with the x86-64 JIT compiled VM:
 
-The following table lists the performance for Intel Core i5-3230M (Ivy Bridge) CPU using a single core on Windows 64-bit, compiled with Visual Studio 2017:
+|CPU|RAM|threads|hashrate [H/s]|comment|
+|-----|-----|----|----------|-----|
+|AMD Ryzen 1700|DDR4-2933|8|4100|
+|Intel i5-3230M|DDR3-1333|1|280|without large pages
+|Intel i7-8550U|DDR4-2400|4|1200|limited by thermals
+|Intel i5-2500K|DDR3-1333|3|1350|
 
-|mode|required memory|AES|initialization time [s]|performance [programs/s]|
-|------|----|-----|-------------------------|------------------|
-|light client|64 MiB|software|1.0|9.2|
-|light client|64 MiB|AES-NI|1.0|16|
-|full dataset|4 GiB|software|54|40|
-|full dataset|4 GiB|AES-NI|26|40|
+Hash verification is performed using the portable interpreter in "light-client mode" and takes 30-70 ms depending on RAM latency and CPU clock speed. Hash verification in "mining mode" takes 2-4 ms.
 
-### JIT-compiled VM
-A JIT compiler is available for x86-64 CPUs. This implementation shows the approximate performance that can be achieved using optimized mining software. The JIT compiler generates generic x86-64 code without any architecture-specific optimizations. Only "full dataset" mode is supported.
+### Documentation
+* [RandomX ISA](doc/isa.md)
+* [RandomX instruction listing](doc/isa-ops.md)
+* [Dataset initialization](doc/dataset.md)
 
-For optimal performance, an x86-64 CPU needs:
-* 32 KiB of L1 instruction cache per thread
-* 16 KiB of L1 data cache per thread
-* 240 KiB of L2 cache (exclusive) per thread
+# FAQ
 
-The following table lists the performance of AMD Ryzen 7 1700 (clock fixed at 3350 MHz, 1.05 Vcore, dual channel DDR4 2400 MHz) on Linux 64-bit (compiled with GCC 5.4.0).
+### Can RandomX run on a GPU?
 
-Power consumption was measured for the whole system using a wall socket wattmeter (±1W). Table lists difference over idle power consumption. [Prime95](https://en.wikipedia.org/wiki/Prime95#Use_for_stress_testing)  (small/in-place FFT) and [Cryptonight V2](https://github.com/monero-project/monero/pull/4218) power consumption are listed for comparison.
+We don't expect GPUs will ever be competitive in mining RandomX. The reference miner is CPU-only.
 
-||threads|initialization time [s]|performance [programs/s]|power [W]
-|-|------|----|-----|-------------------------|
-|RandomX (interpreted)|1|27|52|16|
-|RandomX (interpreted)|8|4.0|390|63|
-|RandomX (interpreted)|16|3.5|620|74|
-|RandomX (compiled)|1|27|407|17|
-|RandomX (compiled)|2|14|810|26|
-|RandomX (compiled)|4|7.3|1620|42|
-|RandomX (compiled)|6|5.1|2410|56|
-|RandomX (compiled)|8|4.0|3200|71|
-|RandomX (compiled)|12|4.0|3670|82|
-|RandomX (compiled)|16|3.5|4110|92|
-|Cryptonight v2|8|-|-|47|
-|Prime95|8|-|-|77|
-|Prime95|16|-|-|81|
+RandomX was designed to be efficient on CPUs. Designing an algorithm compatible with both CPUs and GPUs brings too many limitations and ultimately decreases ASIC resistance. CPUs have the advantage of not needing proprietary drivers and most CPU architectures support a large common subset of primitive operations.
 
-## Proof of work
+Additionally, targeting CPUs allows for more decentralized mining for several reasons:
 
-RandomX VM can be used for PoW using the following steps:
+* Every computer has a CPU and even laptops will be able to mine efficiently.
+* CPU mining is easier to set up - no driver compatibility issues, BIOS flashing etc.
+* CPU mining is more difficult to centralize because computers can usually have only one CPU except for expensive server parts.
 
-1. Initialize the VM using a 256-bit hash of any data.
-2. Execute the RandomX program.
-3. Calculate `blake2b(RegisterFile || t1ha2(Scratchpad))`*
+### Does RandomX facilitate botnets/malware mining or web mining?
+Quite the opposite. Efficient mining requires 4 GiB of memory, which is very difficult to hide in an infected computer and disqualifies many low-end machines. Web mining is nearly impossible due to the large memory requirement and the need for a rather lengthy initialization of the dataset.
 
-\* [blake2b](https://en.wikipedia.org/wiki/BLAKE_%28hash_function%29#BLAKE2) is a cryptographic hash function, [t1ha2](https://github.com/leo-yuriev/t1ha) is a fast hashing function.
+### Since RandomX uses floating point calculations, how can it give reproducible results on different platforms?
 
-The above steps can be chained multiple times to prevent mining strategies that search for programs with particular properties (for example, without division).
+RandomX uses only operations that are guaranteed to give correctly rounded results by the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) standard: addition, subtraction, multiplication, division and square root. Special care is taken to avoid corner cases such as NaN values or denormals.
 
 ## Acknowledgements
 The following people have contributed to the design of RandomX:
@@ -114,13 +90,10 @@ The following people have contributed to the design of RandomX:
 
 RandomX uses some source code from the following 3rd party repositories:
 * Argon2d, Blake2b hashing functions: https://github.com/P-H-C/phc-winner-argon2
-* PCG32 random number generator: https://github.com/imneme/pcg-c-basic
 * Software AES implementation https://github.com/fireice-uk/xmr-stak
-* t1ha2 hashing function: https://github.com/leo-yuriev/t1ha
 
 ## Donations
-
 XMR:
 ```
-4B9nWtGhZfAWsTxWujPDGoWfVpJvADxkxJJTmMQp3zk98n8PdLkEKXA5g7FEUjB8JPPHdP959WDWMem3FPDTK2JUU1UbVHo
-```
+845xHUh5GvfHwc2R8DVJCE7BT2sd4YEcmjG8GNSdmeNsP5DTEjXd1CNgxTcjHjiFuthRHAoVEJjM7GyKzQKLJtbd56xbh7V
+```
\ No newline at end of file
diff --git a/doc/dataset.md b/doc/dataset.md
index b3c0ee3..48d62ed 100644
--- a/doc/dataset.md
+++ b/doc/dataset.md
@@ -1,15 +1,14 @@
+# Dataset
 
-## Dataset
+The dataset is randomly accessed 16384 times during each hash calculation, which significantly increases memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 67108864 blocks of 64 bytes.
 
-The dataset serves as the source of the first operand of all instructions and provides the memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 65536 blocks, each 64 KiB in size.
+In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 256 MiB cache, which can be used to calculate dataset blocks on the fly.
 
-In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 64 MiB cache, which can be used to calculate dataset blocks on the fly. To facilitate this, all random reads from the dataset are aligned to the beginning of a block.
+Because the initialization of the dataset is computationally intensive, it is recalculated only every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
 
-Because the initialization of the dataset is computationally intensive, it's recalculated on average every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
+![Imgur](https://i.imgur.com/b9WHOwo.png)
 
-![Imgur](https://i.imgur.com/JgLCjeq.png)
-
-### Seed block
+## Seed block
 The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations.
 
 |block|Seed block|
@@ -19,9 +18,9 @@ The whole dataset is constructed from a 256-bit hash of the last block whose hei
 |2113-3136|2048|
 |...|...
 
-### Cache construction
+## Cache construction
 
-The 32-byte seed block hash is expanded into the 64 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
+The 32-byte seed block hash is expanded into the 256 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
 
 Argon2 is used with the following parameters:
 
@@ -29,8 +28,8 @@ Argon2 is used with the following parameters:
 |------------|--|
 |parallelism|1|
 |output size|0|
-|memory|65536 (64 MiB)|
-|iterations|12|
+|memory|262144 (256 MiB)|
+|iterations|3|
 |version|`0x13`|
 |hash type|0 (Argon2d)
 |password|seed block hash (32 bytes)
@@ -40,43 +39,66 @@ Argon2 is used with the following parameters:
 
 The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.
 
-The use of 12 iterations makes time-memory tradeoffs infeasible and thus 64 MiB is the minimum amount of memory required by RandomX.
+The use of 3 iterations makes time-memory tradeoffs infeasible and thus 256 MiB is the minimum amount of memory required by RandomX.
 
-When the memory fill is complete, the whole memory array is cyclically shifted backwards by 512 bytes (i.e. bytes 0-511 are moved to the end of the array). This is done to misalign the array so that each 1024-byte cache block spans two subsequent Argon2 blocks.
+## Dataset block generation
+The full 4 GiB dataset can be generated from the 256 MiB cache. Each 64-byte block is generated independently by XORing 16 pseudorandom cache blocks selected by the `SquareHash` function.
 
-### Dataset block generation
-The full 4 GiB dataset can be generated from the 64 MiB cache. Each block is generated separately: a 1024 byte block of the cache is expanded into 64 KiB of the dataset. The algorithm has 3 steps: expansion, AES and shuffle.
+### SquareHash
+`SquareHash` is a custom hash function with 64-bit input and 64-bit output. It is calculated by repeatedly squaring the input, splitting the 128-bit result in to two 64-bit halves and subtracting the high half from the low half. This is repeated 42 times. It's available as a [portable C implementation](../src/squareHash.h) and [x86-64 assembly version](../src/asm/squareHash.inc).
 
-#### Expansion
-The 1024 cache bytes are split into 128 quadwords and interleaved with 504-byte chunks of null bytes. The resulting sequence is: 8 cache bytes + 504 null bytes + 8 cache bytes + 504 null bytes etc. Total length of the expanded block is 65536 bytes.
+Properties of `SquareHash`:
 
-#### AES
-The 256-bit seed block hash is expanded into 10 AES round keys `k0`-`k9`. Let `i = 0...65535` be the index of the block that is being expanded. If `i` is an even number, this step uses AES *decryption* and if `i` is an odd number, it uses AES *encryption*.  Since both encryption and decryption scramble random data, no distinction is made between them in the text below.
+* It achieves full [Avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect).
+* Since the whole calculation is a long dependency chain, which uses only multiplication and subtraction, the performance gains by using custom hardware are very limited.
+* A single `SquareHash` calculation takes 40-80 ns, which is about the same time as DRAM access latency. ASIC devices using low-latency memory will be bottlenecked by `SquareHash`, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM.
 
-The AES encryption is performed with 10 identical rounds using round keys `k0`-`k9`. Note that this is different from the typical AES procedure, which uses a different key schedule for decryption and a modified last round.
+The output of 16 chained SquareHash calculations is used to determine cache blocks that are XORed together to produce a dataset block:
 
-Before the AES encryption is applied, each 16-byte chunk is XORed with the ciphertext of the previous chunk. This is similar to the [AES-CBC](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Cipher_Block_Chaining_%28CBC%29) mode of operation and forces the encryption to be sequential. For XORing the initial block, an initialization vector is formed by zero-extending `i` to 128 bits.
+```c++
+void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) {
+  uint64_t r0, r1, r2, r3, r4, r5, r6, r7;
 
-#### Shuffle
-When the AES step is complete, the last 16-byte chunk of the block is used to initialize a PCG32 random number generator. Bits 0-63 are used as the initial state and bits 64-127 are used as the increment. The least-significant bit of the increment is always set to 1 to form an odd number.
+  r0 = 4ULL * blockNumber;
+  r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;
 
-The whole block is then divided into 16384 doublewords (4 bytes) and the [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) algorithm is applied to it. The algorithm generates a random in-place permutation of the 16384 doublewords. The result of the shuffle is the `i`-th block of the dataset.
+  constexpr uint32_t mask = (CacheSize - 1) & CacheLineAlignMask;
 
-The shuffle algorithm requires a uniform distribution of random numbers. The output of the PCG32 generator is always properly filtered to avoid the [modulo bias](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modulo_bias).
+  for (auto i = 0; i < DatasetIterations; ++i) {
+    const uint8_t* mixBlock = cache + (r0 & mask);
+    PREFETCHNTA(mixBlock);
+    r0 = squareHash(r0);
+    r0 ^= load64(mixBlock + 0);
+    r1 ^= load64(mixBlock + 8);
+    r2 ^= load64(mixBlock + 16);
+    r3 ^= load64(mixBlock + 24);
+    r4 ^= load64(mixBlock + 32);
+    r5 ^= load64(mixBlock + 40);
+    r6 ^= load64(mixBlock + 48);
+    r7 ^= load64(mixBlock + 56);
+  }
 
-### Performance
-The initial 64-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be easily parallelized.
+  store64(out + 0, r0);
+  store64(out + 8, r1);
+  store64(out + 16, r2);
+  store64(out + 24, r3);
+  store64(out + 32, r4);
+  store64(out + 40, r5);
+  store64(out + 48, r6);
+  store64(out + 56, r7);
+}
+```
 
-Dataset generation performance depends on the support of the AES-NI instruction set. The following table lists the generation runtimes using the same Ivy Bridge laptop with a single thread:
+*Note: `SquareHash` doesn't calculate squaring modulo 2<sup>64</sup>+1 because the subtraction is performed modulo 2<sup>64</sup>. Squaring modulo 2<sup>64</sup>+1 can be calculated by adding the carry bit in every iteration (i.e. the sequence in x86-64 assembly would have to be: `mul rax; sub rax, rdx; adc rax, 0`), but this would decrease ASIC-resistance of `SquareHash`.*
 
-|AES|4 GiB dataset generation|single block generation|
-|-----|-----------------------------|----------------|
-|hardware (AES-NI)|25 s|380 µs|
-|software|53 s|810 µs|
+## Performance
+The initial 256-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be parallelized.
 
-While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using a recent 6-core CPU with AES-NI support, the whole dataset can be generated in about 4 seconds.
+On the same laptop, full dataset initialization takes around 100 seconds using a single thread (1.5 µs per block).
 
-Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating ~512 dataset blocks per minute (corresponds to less than 1% utilization of a single CPU core).
+While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using an 8-core AMD Ryzen CPU, the whole dataset can be generated in under 10 seconds.
 
-### Light clients
-Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly as the program is being executed. In this case, the program execution time will be increased by roughly 100 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 40 milliseconds per program.
\ No newline at end of file
+Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating 524288 dataset blocks per minute (corresponds to about 1% utilization of a single CPU core).
+
+## Light clients
+Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly during hash calculation. In this case, the hash calculation time will be increased by 16384 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 24.5 milliseconds per hash.
\ No newline at end of file
diff --git a/doc/isa-ops.md b/doc/isa-ops.md
new file mode 100644
index 0000000..1ab9591
--- /dev/null
+++ b/doc/isa-ops.md
@@ -0,0 +1,103 @@
+# RandomX instruction listing
+
+## Integer instructions
+For integer instructions, the destination is always an integer register (register group R). Source operand (if applicable) can be either an integer register or memory value. If `dst` and `src` refer to the same register, most instructions use `imm32` as the source operand instead of the register. This is indicated in the 'src == dst' column.
+
+Memory operands are loaded as 8-byte values from the address indicated by `src`.  This indirect addressing is marked with square brackets: `[src]`.
+
+|frequency|instruction|dst|src|`src == dst ?`|operation|
+|-|-|-|-|-|-|
+|12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`|
+|7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`|
+|16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`|
+|12/256|ISUB_R|R|R|`src = imm32`|`dst = dst - src`|
+|7/256|ISUB_M|R|mem|`src = imm32`|`dst = dst - [src]`|
+|9/256|IMUL_9C|R|-|-|`dst = 9 * dst + imm32`|
+|16/256|IMUL_R|R|R|`src = imm32`|`dst = dst * src`|
+|4/256|IMUL_M|R|mem|`src = imm32`|`dst = dst * [src]`|
+|4/256|IMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64`|
+|1/256|IMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64`|
+|4/256|ISMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64` (signed)|
+|1/256|ISMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64` (signed)|
+|4/256|IDIV_C|R|-|-|`dst = dst + dst / imm32`|
+|4/256|ISDIV_C|R|-|-|`dst = dst + dst / imm32` (signed)|
+|2/256|INEG_R|R|-|-|`dst = -dst`|
+|16/256|IXOR_R|R|R|`src = imm32`|`dst = dst ^ src`|
+|4/256|IXOR_M|R|mem|`src = imm32`|`dst = dst ^ [src]`|
+|10/256|IROR_R|R|R|`src = imm32`|`dst = dst >>> src`|
+|4/256|ISWAP_R|R|R|`src = dst`|`temp = src; src = dst; dst = temp`|
+
+#### IMULH and ISMULH
+These instructions output the high 64 bits of the whole 128-bit multiplication result. The result differs for signed and unsigned multiplication (`IMULH` is unsigned, `ISMULH` is signed). The variants with a register source operand do not use `imm32` (they perform a squaring operation if `dst` equals `src`).
+
+#### IDIV_C and ISDIV_C
+The division instructions use a constant divisor, so they can be optimized into a [multiplication by fixed-point reciprocal](https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant). `IDIV_C` performs unsigned division (`imm32` is zero-extended to 64 bits), while `ISDIV_C` performs signed division. In the case of division by zero, the instructions become a no-op. In the very rare case of signed overflow, the destination register is set to zero.
+
+#### ISWAP_R
+This instruction swaps the values of two registers. If source and destination refer to the same register, the result is a no-op.
+
+## Floating point instructions
+For floating point instructions, the destination can be a group F or group E register. Source operand is either a group A register or a memory value.
+
+Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`.
+
+|frequency|instruction|dst|src|operation|
+|-|-|-|-|-|
+|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
+|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
+|5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`|
+|20/256|FSUB_R|F|A|`(dst0, dst1) = (dst0 - src0, dst1 - src1)`|
+|5/256|FSUB_M|F|mem|`(dst0, dst1) = (dst0 - [src][0], dst1 - [src][1])`|
+|6/256|FNEG_R|F|-|`(dst0, dst1) = (-dst0, -dst1)`|
+|20/256|FMUL_R|E|A|`(dst0, dst1) = (dst0 * src0, dst1 * src1)`|
+|4/256|FDIV_M|E|mem|`(dst0, dst1) = (dst0 / [src][0], dst1 / [src][1])`|
+|6/256|FSQRT_R|E|-|`(dst0, dst1) = (√dst0, √dst1)`|
+
+#### Denormal and NaN values
+Due to restrictions on the values of the floating point registers, no operation results in `NaN`.
+`FDIV_M` can produce a denormal result. In that case, the result is set to `DBL_MIN = 2.22507385850720138309e-308`, which is the smallest positive normal number.
+
+#### Rounding
+All floating point instructions give correctly rounded results. The rounding mode depends on the value of the `fprc` register:
+
+|`fprc`|rounding mode|
+|-------|------------|
+|0|roundTiesToEven|
+|1|roundTowardNegative|
+|2|roundTowardPositive|
+|3|roundTowardZero|
+
+The rounding modes are defined by the IEEE 754 standard.
+
+## Other instructions
+There are 4 special instructions that have more than one source operand or the destination operand is a memory value.
+
+|frequency|instruction|dst|src|operation|
+|-|-|-|-|-|
+|7/256|COND_R|R|R|`if(condition(src, imm32)) dst = dst + 1`
+|1/256|COND_M|R|mem|`if(condition([src], imm32)) dst = dst + 1`
+|1/256|CFROUND|`fprc`|R|`fprc = src >>> imm32`
+|16/256|ISTORE|mem|R|`[dst] = src`
+
+#### COND
+
+These instructions conditionally increment the destination register. The condition function depends on the `mod.cond` flag and takes the lower 32 bits of the source operand and the value `imm32`.
+
+|`mod.cond`|signed|`condition`|probability|*x86*|*ARM*
+|---|---|----------|-----|--|----|
+|0|no|`src <= imm32`|0% - 100%|`JBE`|`BLS`
+|1|no|`src > imm32`|0% - 100%|`JA`|`BHI`
+|2|yes|`src - imm32 < 0`|50%|`JS`|`BMI`
+|3|yes|`src - imm32 >= 0`|50%|`JNS`|`BPL`
+|4|yes|`src - imm32` overflows|0% - 50%|`JO`|`BVS`
+|5|yes|`src - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
+|6|yes|`src < imm32`|0% - 100%|`JL`|`BLT`
+|7|yes|`src >= imm32`|0% - 100%|`JGE`|`BGE`
+
+The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected probability the condition is true (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
+
+#### CFROUND
+This instruction sets the value of the `fprc` register to the 2 least significant bits of the source register rotated right by `imm32`. This changes the rounding mode of all subsequent floating point instructions.
+
+#### ISTORE
+The `ISTORE` instruction stores the value of the source integer register to the memory at the address specified by the destination register. The `src` and `dst` register can be the same.
diff --git a/doc/isa.md b/doc/isa.md
index 0c0ab7b..83d4436 100644
--- a/doc/isa.md
+++ b/doc/isa.md
@@ -1,213 +1,91 @@
 
-## RandomX instruction set
-RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
+# RandomX instruction set architecture
+RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE 754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).
 
-Each RandomX instruction has a length of 128 bits. The encoding is following:
+## Registers
 
-![Imgur](https://i.imgur.com/mbndESz.png)
+RandomX has 8 integer registers `r0`-`r7` (group R) and a total of 12 floating point registers split into 3 groups: `a0`-`a3` (group A), `f0`-`f3` (group F) and `e0`-`e3` (group E). Integer registers are 64 bits wide, while floating point registers are 128 bits wide and contain a pair of floating point numbers. The lower and upper half of floating point registers are not separately addressable.
 
-*All flags are aligned to an 8-bit boundary for easier decoding.*
+*Table 1: Addressable register groups*
 
-#### Opcode
-There are 256 opcodes, which are distributed between 30 instructions based on their weight (how often they will occur in the program on average). Instructions are divided into 5 groups:
+|index|R|A|F|E|F+E|
+|--|--|--|--|--|--|
+|0|`r0`|`a0`|`f0`|`e0`|`f0`|
+|1|`r1`|`a1`|`f1`|`e1`|`f1`|
+|2|`r2`|`a2`|`f2`|`e2`|`f2`|
+|3|`r3`|`a3`|`f3`|`e3`|`f3`|
+|4|`r4`||||`e0`|
+|5|`r5`||||`e1`|
+|6|`r6`||||`e2`|
+|7|`r7`||||`e3`|
 
-|group|number of opcodes||comment|
-|---------|-----------------|----|------|
-|IA|115|44.9%|integer arithmetic operations
-|IS|21|8.2%|bitwise shift and rotate
-|FA|70|27.4%|floating point arithmetic operations
-|FS|8|3.1%|floating point single-input operations
-|CF|42|16.4%|control flow instructions (branches)
-||**256**|**100%**
+Besides the directly addressable registers above, there is a 2-bit `fprc` register for rounding control, which is an implicit destination register of the `CFROUND` instruction, and two architectural 32-bit registers `ma` and `mx`, which are not accessible to any instruction. 
 
-#### Operand A
-The first 64-bit operand is read from memory. The location is determined by the `loc(a)` flag:
+Integer registers `r0`-`r7` can be the source or the destination operands of integer instructions or may be used as address registers for loading the source operand from the memory (scratchpad).
 
-|loc(a)[2:0]|read A from|address size (W)
-|---------|-|-|
-|000|dataset|32 bits|
-|001|dataset|32 bits|
-|010|dataset|32 bits|
-|011|dataset|32 bits|
-|100|scratchpad|15 bits|
-|101|scratchpad|11 bits|
-|110|scratchpad|11 bits|
-|111|scratchpad|11 bits|
+Floating point registers `a0`-`a3` are read-only and may not be written to except at the moment a program is loaded into the VM. They can be the source operand of any floating point instruction. The value of these registers is restricted to the interval `[1, 4294967296)`.
 
-Flag `reg(a)` encodes an integer register `r0`-`r7`.  The read address is calculated as:
-```
-reg(a) = reg(a) XOR signExtend(addr(a))
-read_addr = reg(a)[W-1:0]
-```
-`W` is the address width from the above table. For reading from the scratchpad, `read_addr` is multiplied by 8 for 8-byte aligned access.
+Floating point registers `f0`-`f3` are the *additive* registers, which can be the destination of floating point addition and subtraction instructions. The absolute value of these registers will not exceed `1.0e+12`.
 
-#### Operand B
-The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (instruction groups IA and IS) or a floating point register (instruction group FA). Instruction group FS doesn't use operand B.
+Floating point registers `e0`-`e3` are the *multiplicative* registers, which can be the destination of floating point multiplication, division and square root instructions. Their value is always positive.
 
-|loc(b)[2:0]|B (IA)|B (IS)|B (FA)|B (FS)
-|---------|-|-|-|-|
-|000|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|001|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|010|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|011|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|100|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
-|101|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
-|110|`imm32`|`imm8`|floating point `reg(b)`|-
-|111|`imm32`|`imm8`|floating point `reg(b)`|-
+## Instruction encoding
 
-`imm8` is an 8-bit immediate value, which is used for shift and rotate integer instructions (group IS). Only bits 0-5 are used.
+Each instruction word is 64 bits long and has the following format:
 
-`imm32` is a 32-bit immediate value which is used for integer instructions from group IA.
+![Imgur](https://i.imgur.com/FtkWRwe.png)
 
-Floating point instructions don't use immediate values.
+### opcode
+There are 256 opcodes, which are distributed between 32 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).
 
-#### Operand C
-The third operand is the location where the result is stored. It can be a register or a 64-bit scratchpad location, depending on the value of flag `loc(c)`.
+*Table 2: Instruction groups*
 
-|loc\(c\)[2:0]|address size (W)| C (IA, IS)|C (FA, FS)
-|---------|-|-|-|-|-|
-|000|15 bits|scratchpad|floating point `reg(c)`
-|001|11 bits|scratchpad|floating point `reg(c)`
-|010|11 bits|scratchpad|floating point `reg(c)`
-|011|11 bits|scratchpad|floating point `reg(c)`
-|100|15 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
-|101|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
-|110|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
-|111|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
+|group|# instructions|# opcodes||
+|---------|-----------------|----|-|
+|integer |19|137|53.5%|
+|floating point |9|94|36.7%|
+|other |4|25|9.8%|
+||**32**|**256**|**100%**
 
-Integer operations write either to the scratchpad or to a register. Floating point operations always write to a register and can also write to the scratchpad. In that case, bit 3 of the `loc(c)` flag determines if the low or high half of the register is written:
+Full description of all instructions: [isa-ops.md](isa-ops.md).
 
-|loc\(c\)[3]|write to scratchpad|
-|------------|-----------------------|
-|0|floating point `reg(c)[63:0]`
-|1|floating point `reg(c)[127:64]`
+### dst
+Destination register. Only bits 0-1 (register groups A, F, E) or 0-2 (groups R, F+E) are used to encode a register according to Table 1.
 
-The FPROUND instruction is an exception and always writes the low half of the register.
+### src
 
-For writing to the scratchpad, an integer register is always used to calculate the address:
-```
-write_addr = 8 * (addr(c) XOR reg(c)[31:0])[W-1:0]
-```
-*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 writes to memory.*
+The `src` flag encodes a source operand register according to Table 1 (only bits 0-1 or 0-2 are used).
 
-#### imm8
-An 8-bit immediate value that is used as the shift/rotate count by group IS instructions and as the jump offset of the CALL instruction.
+Immediate value `imm32` is used as the source operand in cases when `dst` and `src` encode the same register.
 
-#### addr(a)
-A 32-bit address mask that is used to calculate the read address for the A operand. It's sign-extended to 64 bits.
+For register-memory instructions, the source operand determines the `address_base` value for calculating the memory address (see below).
 
-#### addr\(c\)
-A 32-bit address mask that is used to calculate the write address for the C operand. `addr(c)` is equal to `imm32`.
+### mod
 
-### ALU instructions
+The `mod` flag is encoded as:
 
-|weight|instruction|group|signed|A width|B width|C|C width|
-|-|-|-|-|-|-|-|-|
-|10|ADD_64|IA|no|64|64|`A + B`|64|
-|2|ADD_32|IA|no|32|32|`A + B`|32|
-|10|SUB_64|IA|no|64|64|`A - B`|64|
-|2|SUB_32|IA|no|32|32|`A - B`|32|
-|21|MUL_64|IA|no|64|64|`A * B`|64|
-|10|MULH_64|IA|no|64|64|`A * B`|64|
-|15|MUL_32|IA|no|32|32|`A * B`|64|
-|15|IMUL_32|IA|yes|32|32|`A * B`|64|
-|10|IMULH_64|IA|yes|64|64|`A * B`|64|
-|1|DIV_64|IA|no|64|32|`A / B`|32|
-|1|IDIV_64|IA|yes|64|32|`A / B`|32|
-|4|AND_64|IA|no|64|64|`A & B`|64|
-|2|AND_32|IA|no|32|32|`A & B`|32|
-|4|OR_64|IA|no|64|64|`A | B`|64|
-|2|OR_32|IA|no|32|32|`A | B`|32|
-|4|XOR_64|IA|no|64|64|`A ^ B`|64|
-|2|XOR_32|IA|no|32|32|`A ^ B`|32|
-|3|SHL_64|IS|no|64|6|`A << B`|64|
-|3|SHR_64|IS|no|64|6|`A >> B`|64|
-|3|SAR_64|IS|yes|64|6|`A >> B`|64|
-|6|ROL_64|IS|no|64|6|`A <<< B`|64|
-|6|ROR_64|IS|no|64|6|`A >>> B`|64|
+*Table 3: mod flag encoding*
 
-##### 32-bit operations
-Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are set to zero.
+|`mod`|description|
+|----|--------|
+|0-1|`mod.mem` flag|
+|2-4|`mod.cond` flag|
+|5-7|Reserved|
 
-##### Multiplication
-There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
+The `mod.mem` flag determines the address mask when reading from or writing to memory:
 
-##### Division
-For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend `A`.
+*Table 3: memory address mask*
 
-*Division by zero can be handled without branching by a conditional move. Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. This rare case must be handled in x86 (ARM produces the "correct" result).*
+|`mod.mem`|`address_mask`|(scratchpad level)|
+|---------|-|---|
+|0|262136|(L2)|
+|1-3|16376|(L1)|
 
-##### Shift and rotate
-The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm8` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
+Table 3 applies to all memory accesses except for cases when the source operand is an immediate value. In that case, `address_mask` is equal to 2097144 (L3). 
 
-### FPU instructions
+The address for reading/writing is calculated by applying bitwise AND operation to `address_base` and `address_mask`.
 
-|weight|instruction|group|C|
-|-|-|-|-|
-|20|FPADD|FA|`A + B`|
-|20|FPSUB|FA|`A - B`|
-|22|FPMUL|FA|`A * B`|
-|8|FPDIV|FA|`A / B`|
-|6|FPSQRT|FS|`sqrt(abs(A))`|
-|2|FPROUND|FS|`convertSigned52(A)`|
+The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested.
 
-All floating point instructions apart FPROUND are vector instructions that operate on two packed double precision floating point values.
-
-#### Conversion of operand A
-Operand A is loaded from memory as a 64-bit value. All floating point instructions apart FPROUND interpret A as two packed 32-bit signed integers and convert them into two packed double precision floating point values.
-
-The FPROUND instruction has a scalar output and interprets A as a 64-bit signed integer. The 11 least-significant bits are cleared before conversion to a double precision format. This is done so the number fits exactly into the 52-bit mantissa without rounding. Output of FPROUND is always written into the lower half of the result register and only this lower half may be written into the scratchpad.
-
-#### Rounding
-FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` instruction. Denormal values must be flushed to zero.
-
-#### NaN
-If an operation produces NaN, the result is converted into positive zero. NaN results may never be written into registers or memory. Only division and multiplication must be checked for NaN results (`0.0 / 0.0` and `0.0 * Infinity` result in NaN).
-
-##### FPROUND
-The FPROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two least-significant bits of A.
-
-|A[1:0]|rounding mode|
-|-------|------------|
-|00|roundTiesToEven|
-|01|roundTowardNegative|
-|10|roundTowardPositive|
-|11|roundTowardZero|
-
-The rounding modes are defined by the IEEE-754 standard.
-
-*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 23 and 22 (reversed) of the ARM `FPSCR` register.*
-
-### Control instructions
-The following 2 control instructions are supported:
-
-|weight|instruction|function|condition|
-|-|-|-|-|
-|20|CALL|near procedure call|(see condition table below)
-|22|RET|return from procedure|stack is not empty
-
-Both instructions are conditional. If the condition evaluates to `false`, CALL and RET behave as "arithmetic no-op" and simply copy operand A into destination C without jumping.
-
-##### CALL
-The CALL instruction uses a condition function, which takes the lower 32 bits of integer register `reg(b)` and the value `imm32` and evaluates a condition based on the `loc(b)` flag: 
-
-|loc(b)[2:0]|signed|jump condition|probability|*x86*|*ARM*
-|---|---|----------|-----|--|----|
-|000|no|`reg(b)[31:0] <= imm32`|0% - 100%|`JBE`|`BLS`
-|001|no|`reg(b)[31:0] > imm32`|0% - 100%|`JA`|`BHI`
-|010|yes|`reg(b)[31:0] - imm32 < 0`|50%|`JS`|`BMI`
-|011|yes|`reg(b)[31:0] - imm32 >= 0`|50%|`JNS`|`BPL`
-|100|yes|`reg(b)[31:0] - imm32` overflows|0% - 50%|`JO`|`BVS`
-|101|yes|`reg(b)[31:0] - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
-|110|yes|`reg(b)[31:0] < imm32`|0% - 100%|`JL`|`BLT`
-|111|yes|`reg(b)[31:0] >= imm32`|0% - 100%|`JGE`|`BGE`
-
-The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected jump probability (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
-
-Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm8[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
-
-##### RET
-The RET instruction is taken only if the stack is not empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A XOR retval`. Finally, the instruction jumps back to `raddr`.
-
-## Reference implementation
-A portable C++ implementation of all ALU and FPU instructions is available in [instructionsPortable.cpp](../src/instructionsPortable.cpp).
\ No newline at end of file
+### imm32
+A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits unless specified otherwise.
diff --git a/makefile b/makefile
index 21584cb..77788dc 100644
--- a/makefile
+++ b/makefile
@@ -11,12 +11,12 @@ SRCDIR=src
 OBJDIR=obj
 LDFLAGS=-lpthread
 TOBJS=$(addprefix $(OBJDIR)/,instructionsPortable.o TestAluFpu.o)
-ROBJS=$(addprefix $(OBJDIR)/,argon2_core.o argon2_ref.o AssemblyGeneratorX86.o blake2b.o CompiledVirtualMachine.o dataset.o JitCompilerX86.o instructionsPortable.o Instruction.o InterpretedVirtualMachine.o main.o Program.o softAes.o VirtualMachine.o t1ha2.o Cache.o)
+ROBJS=$(addprefix $(OBJDIR)/,argon2_core.o argon2_ref.o AssemblyGeneratorX86.o blake2b.o CompiledVirtualMachine.o dataset.o JitCompilerX86.o instructionsPortable.o Instruction.o InterpretedVirtualMachine.o main.o Program.o softAes.o VirtualMachine.o Cache.o virtualMemory.o divideByConstantCodegen.o LightClientAsyncWorker.o hashAes1Rx4.o)
 ifeq ($(PLATFORM),x86_64)
-    ROBJS += $(OBJDIR)/JitCompilerX86-static.o
+    ROBJS += $(OBJDIR)/JitCompilerX86-static.o $(OBJDIR)/squareHash.o
 endif
 
-all: release test
+all: release
 
 release: CXXFLAGS += -march=native -O3 -flto
 release: CCFLAGS += -march=native -O3 -flto
@@ -27,6 +27,11 @@ debug: CCFLAGS += -g
 debug: LDFLAGS += -g
 debug: $(BINDIR)/randomx
 
+profile: CXXFLAGS += -pg
+profile: CCFLAGS += -pg
+profile: LDFLAGS += -pg
+profile: $(BINDIR)/randomx
+
 test: CXXFLAGS += -O0
 test: $(BINDIR)/AluFpuTest
 
@@ -36,7 +41,7 @@ $(BINDIR)/randomx: $(ROBJS) | $(BINDIR)
 $(BINDIR)/AluFpuTest: $(TOBJS) | $(BINDIR)
 	$(CXX) $(TOBJS) $(LDFLAGS) -o $@
   
-$(OBJDIR)/TestAluFpu.o: $(addprefix $(SRCDIR)/,TestAluFpu.cpp instructions.hpp Pcg32.hpp) | $(OBJDIR)
+$(OBJDIR)/TestAluFpu.o: $(addprefix $(SRCDIR)/,TestAluFpu.cpp instructions.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/TestAluFpu.cpp -o $@
   
 $(OBJDIR)/argon2_core.o: $(addprefix $(SRCDIR)/,argon2_core.c argon2_core.h blake2/blake2.h blake2/blake2-impl.h) | $(OBJDIR)
@@ -45,40 +50,52 @@ $(OBJDIR)/argon2_core.o: $(addprefix $(SRCDIR)/,argon2_core.c argon2_core.h blak
 $(OBJDIR)/argon2_ref.o: $(addprefix $(SRCDIR)/,argon2_ref.c argon2.h argon2_core.h blake2/blake2.h blake2/blake2-impl.h blake2/blamka-round-ref.h) | $(OBJDIR)
 	$(CC) $(CCFLAGS) -c $(SRCDIR)/argon2_ref.c -o $@
 
-$(OBJDIR)/AssemblyGeneratorX86.o: $(addprefix $(SRCDIR)/,AssemblyGeneratorX86.cpp AssemblyGeneratorX86.hpp Instruction.hpp Pcg32.hpp common.hpp instructions.hpp instructionWeights.hpp) | $(OBJDIR)
+$(OBJDIR)/AssemblyGeneratorX86.o: $(addprefix $(SRCDIR)/,AssemblyGeneratorX86.cpp AssemblyGeneratorX86.hpp Instruction.hpp common.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/AssemblyGeneratorX86.cpp -o $@
 
 $(OBJDIR)/blake2b.o: $(addprefix $(SRCDIR)/blake2/,blake2b.c blake2.h blake2-impl.h) | $(OBJDIR)
 	$(CC) $(CCFLAGS) -c $(SRCDIR)/blake2/blake2b.c -o $@
 
-$(OBJDIR)/CompiledVirtualMachine.o: $(addprefix $(SRCDIR)/,CompiledVirtualMachine.cpp CompiledVirtualMachine.hpp Pcg32.hpp common.hpp instructions.hpp) | $(OBJDIR)
+$(OBJDIR)/CompiledVirtualMachine.o: $(addprefix $(SRCDIR)/,CompiledVirtualMachine.cpp CompiledVirtualMachine.hpp common.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/CompiledVirtualMachine.cpp -o $@
   
-$(OBJDIR)/dataset.o: $(addprefix $(SRCDIR)/,dataset.cpp common.hpp Pcg32.hpp) | $(OBJDIR)
+$(OBJDIR)/dataset.o: $(addprefix $(SRCDIR)/,dataset.cpp common.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/dataset.cpp -o $@
 
+$(OBJDIR)/divideByConstantCodegen.o: $(addprefix $(SRCDIR)/,divideByConstantCodegen.c divideByConstantCodegen.h) | $(OBJDIR)
+	$(CC) $(CCFLAGS) -c $(SRCDIR)/divideByConstantCodegen.c -o $@
+
+$(OBJDIR)/hashAes1Rx4.o: $(addprefix $(SRCDIR)/,hashAes1Rx4.cpp softAes.h) | $(OBJDIR)
+	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/hashAes1Rx4.cpp -o $@
+
 $(OBJDIR)/JitCompilerX86.o: $(addprefix $(SRCDIR)/,JitCompilerX86.cpp JitCompilerX86.hpp Instruction.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/JitCompilerX86.cpp -o $@
 
-$(OBJDIR)/JitCompilerX86-static.o: $(addprefix $(SRCDIR)/,JitCompilerX86-static.S $(addprefix asm/program_, prologue_linux.inc prologue_load.inc epilogue_linux.inc epilogue_store.inc read_r.inc read_f.inc)) | $(OBJDIR)
+$(OBJDIR)/JitCompilerX86-static.o: $(addprefix $(SRCDIR)/,JitCompilerX86-static.S $(addprefix asm/program_, prologue_linux.inc prologue_load.inc epilogue_linux.inc epilogue_store.inc read_dataset.inc loop_load.inc loop_store.inc xmm_constants.inc)) | $(OBJDIR)
 	$(CXX) -x assembler-with-cpp -c $(SRCDIR)/JitCompilerX86-static.S -o $@
 
-$(OBJDIR)/instructionsPortable.o: $(addprefix $(SRCDIR)/,instructionsPortable.cpp instructions.hpp intrinPortable.h) | $(OBJDIR)
+$(OBJDIR)/squareHash.o: $(addprefix $(SRCDIR)/,squareHash.S $(addprefix asm/, squareHash.inc))  | $(OBJDIR)
+	$(CXX) -x assembler-with-cpp -c $(SRCDIR)/squareHash.S -o $@
+
+$(OBJDIR)/instructionsPortable.o: $(addprefix $(SRCDIR)/,instructionsPortable.cpp intrinPortable.h) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/instructionsPortable.cpp -o $@
 
 $(OBJDIR)/Instruction.o: $(addprefix $(SRCDIR)/,Instruction.cpp Instruction.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/Instruction.cpp -o $@
   
-$(OBJDIR)/InterpretedVirtualMachine.o: $(addprefix $(SRCDIR)/,InterpretedVirtualMachine.cpp InterpretedVirtualMachine.hpp Pcg32.hpp instructions.hpp instructionWeights.hpp) | $(OBJDIR)
+$(OBJDIR)/InterpretedVirtualMachine.o: $(addprefix $(SRCDIR)/,InterpretedVirtualMachine.cpp InterpretedVirtualMachine.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/InterpretedVirtualMachine.cpp -o $@
+
+$(OBJDIR)/LightClientAsyncWorker.o: $(addprefix $(SRCDIR)/,LightClientAsyncWorker.cpp LightClientAsyncWorker.hpp common.hpp) | $(OBJDIR)
+	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/LightClientAsyncWorker.cpp -o $@
   
 $(OBJDIR)/main.o: $(addprefix $(SRCDIR)/,main.cpp InterpretedVirtualMachine.hpp Stopwatch.hpp blake2/blake2.h) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/main.cpp -o $@
   
-$(OBJDIR)/Program.o: $(addprefix $(SRCDIR)/,Program.cpp Program.hpp Pcg32.hpp) | $(OBJDIR)
+$(OBJDIR)/Program.o: $(addprefix $(SRCDIR)/,Program.cpp Program.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/Program.cpp -o $@
 
-$(OBJDIR)/Cache.o: $(addprefix $(SRCDIR)/,Cache.cpp Cache.hpp Pcg32.hpp argon2_core.h) | $(OBJDIR)
+$(OBJDIR)/Cache.o: $(addprefix $(SRCDIR)/,Cache.cpp Cache.hpp argon2_core.h) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/Cache.cpp -o $@
   
 $(OBJDIR)/softAes.o: $(addprefix $(SRCDIR)/,softAes.cpp softAes.h) | $(OBJDIR)
@@ -87,8 +104,8 @@ $(OBJDIR)/softAes.o: $(addprefix $(SRCDIR)/,softAes.cpp softAes.h) | $(OBJDIR)
 $(OBJDIR)/VirtualMachine.o: $(addprefix $(SRCDIR)/,VirtualMachine.cpp VirtualMachine.hpp common.hpp dataset.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/VirtualMachine.cpp -o $@
 
-$(OBJDIR)/t1ha2.o: $(addprefix $(SRCDIR)/t1ha/,t1ha2.c t1ha.h t1ha_bits.h) | $(OBJDIR)
-	$(CC) $(CCFLAGS) -c $(SRCDIR)/t1ha/t1ha2.c -o $@
+$(OBJDIR)/virtualMemory.o: $(addprefix $(SRCDIR)/,virtualMemory.cpp virtualMemory.hpp) | $(OBJDIR)
+	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/virtualMemory.cpp -o $@
   
 $(OBJDIR):
 	mkdir $(OBJDIR)
diff --git a/src/AssemblyGeneratorX86.cpp b/src/AssemblyGeneratorX86.cpp
index bb0e106..bb50718 100644
--- a/src/AssemblyGeneratorX86.cpp
+++ b/src/AssemblyGeneratorX86.cpp
@@ -17,535 +17,528 @@ You should have received a copy of the GNU General Public License
 along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 //#define TRACE
+#define MAGIC_DIVISION
 #include "AssemblyGeneratorX86.hpp"
-#include "Pcg32.hpp"
 #include "common.hpp"
-#include "instructions.hpp"
+#ifdef MAGIC_DIVISION
+#include "divideByConstantCodegen.h"
+#endif
+#include "Program.hpp"
 
 namespace RandomX {
 
 	static const char* regR[8] = { "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15" };
 	static const char* regR32[8] = { "r8d", "r9d", "r10d", "r11d", "r12d", "r13d", "r14d", "r15d" };
-	static const char* regF[8] = { "xmm8", "xmm9", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7" };
+	static const char* regFE[8] = { "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7" };
+	static const char* regF[4] = { "xmm0", "xmm1", "xmm2", "xmm3" };
+	static const char* regE[4] = { "xmm4", "xmm5", "xmm6", "xmm7" };
+	static const char* regA[4] = { "xmm8", "xmm9", "xmm10", "xmm11" };
 
-	void AssemblyGeneratorX86::generateProgram(const void* seed) {
+	static const char* fsumInstr[4] = { "paddb", "paddw", "paddd", "paddq" };
+
+	static const char* regA4 = "xmm12";
+	static const char* dblMin = "xmm13";
+	static const char* absMask = "xmm14";
+	static const char* signMask = "xmm15";
+	static const char* regMx = "rbp";
+	static const char* regIc = "rbx";
+	static const char* regIc32 = "ebx";
+	static const char* regIc8 = "bl";
+	static const char* regDatasetAddr = "rdi";
+	static const char* regScratchpadAddr = "rsi";
+
+	void AssemblyGeneratorX86::generateProgram(Program& prog) {
 		asmCode.str(std::string()); //clear
-		Pcg32 gen(seed);
-		for (unsigned i = 0; i < sizeof(RegisterFile) / sizeof(Pcg32::result_type); ++i) {
-			gen();
-		}
-		Instruction instr;
 		for (unsigned i = 0; i < ProgramLength; ++i) {
-			for (unsigned j = 0; j < sizeof(instr) / sizeof(Pcg32::result_type); ++j) {
-				*(((uint32_t*)&instr) + j) = gen();
-			}
+			Instruction& instr = prog(i);
+			instr.src %= RegistersCount;
+			instr.dst %= RegistersCount;
 			generateCode(instr, i);
-			asmCode << std::endl;
+			//asmCode << std::endl;
 		}
-		if(ProgramLength > 0)
-			asmCode << "\tjmp rx_i_0" << std::endl;
 	}
 
 	void AssemblyGeneratorX86::generateCode(Instruction& instr, int i) {
-		asmCode << "rx_i_" << i << ": ;" << instr.getName() << std::endl;
-		asmCode << "\tdec edi" << std::endl;
-		asmCode << "\tjz rx_finish" << std::endl;
+		asmCode << "\t; " << instr;
 		auto generator = engine[instr.opcode];
 		(this->*generator)(instr, i);
 	}
 
-	void AssemblyGeneratorX86::genar(Instruction& instr) {
-		asmCode << "\txor " << regR[instr.rega % RegistersCount] << ", 0" << std::hex << instr.addra << "h" << std::dec << std::endl;
-		switch (instr.loca & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov ecx, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tcall rx_read_dataset_r" << std::endl;
-			return;
-
-		case 4:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-			asmCode << "\tmov rax, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
-
-		default:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-			asmCode << "\tmov rax, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
-		}
+	void AssemblyGeneratorX86::genAddressReg(Instruction& instr, const char* reg = "eax") {
+		asmCode << "\tmov " << reg << ", " << regR32[instr.src] << std::endl;
+		asmCode << "\tand " << reg << ", " << ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask) << std::endl;
 	}
 
-
-	void AssemblyGeneratorX86::genaf(Instruction& instr) {
-		asmCode << "\txor " << regR[instr.rega % RegistersCount] << ", 0" << std::hex << instr.addra << "h" << std::dec << std::endl;
-		switch (instr.loca & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov ecx, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tcall rx_read_dataset_f" << std::endl;
-			return;
-
-		case 4:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-			asmCode << "\tcvtdq2pd xmm0, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
-
-		default:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-			asmCode << "\tcvtdq2pd xmm0, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
-		}
+	void AssemblyGeneratorX86::genAddressRegDst(Instruction& instr, int maskAlign = 8) {
+		asmCode << "\tmov eax" << ", " << regR32[instr.dst] << std::endl;
+		asmCode << "\tand eax" << ", " << ((instr.mod % 4) ? (ScratchpadL1Mask & (-maskAlign)) : (ScratchpadL2Mask & (-maskAlign))) << std::endl;
 	}
 
-	void AssemblyGeneratorX86::genbr0(Instruction& instr, const char* instrx86) {
-		switch (instr.locb & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov rcx, " << regR[instr.regb % RegistersCount] << std::endl;
-			asmCode << "\t" << instrx86 << " rax, cl" << std::endl;
-			return;
-		default:
-			asmCode << "\t" << instrx86 << " rax, " << (instr.imm8 & 63) << std::endl;;
-			return;
-		}
+	int32_t AssemblyGeneratorX86::genAddressImm(Instruction& instr) {
+		return (int32_t)instr.imm32 & ScratchpadL3Mask;
 	}
 
-	void AssemblyGeneratorX86::genbr1(Instruction& instr) {
-		switch (instr.locb & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-		case 4:
-		case 5:
-			asmCode << regR[instr.regb % RegistersCount] << std::endl;
-			return;
-		default:
-			asmCode  << instr.imm32 << std::endl;;
-			return;
-		}
-	}
-
-	void AssemblyGeneratorX86::genbr132(Instruction& instr) {
-		switch (instr.locb & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-		case 4:
-		case 5:
-			asmCode << regR32[instr.regb % RegistersCount] << std::endl;
-			return;
-		default:
-			asmCode << instr.imm32 << std::endl;;
-			return;
-		}
-	}
-
-	void AssemblyGeneratorX86::genbf(Instruction& instr, const char* instrx86) {
-		asmCode << "\t" << instrx86 << " xmm0, " << regF[instr.regb % RegistersCount] << std::endl;
-	}
-
-	void AssemblyGeneratorX86::gencr(Instruction& instr) {
-		switch (instr.locc & 7)
-		{
-		case 0:
-			asmCode << "\tmov rcx, rax" << std::endl;
-			asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-			asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-			asmCode << "\tmov qword ptr [rsi + rax * 8], rcx" << std::endl;
-			if (trace) {
-				asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rcx" << std::endl;
-			}
-			return;
-
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov rcx, rax" << std::endl;
-			asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-			asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-			asmCode << "\tmov qword ptr [rsi + rax * 8], rcx" << std::endl;
-			if (trace) {
-				asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rcx" << std::endl;
-			}
-			return;
-
-		default:
-			asmCode << "\tmov " << regR[instr.regc % RegistersCount] << ", rax" << std::endl;
-			if (trace) {
-				asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rax" << std::endl;
-			}
-		}
-	}
-
-	void AssemblyGeneratorX86::gencf(Instruction& instr, bool alwaysLow = false) {
-		if(!alwaysLow)
-			asmCode << "\tmovaps " << regF[instr.regc % RegistersCount] << ", xmm0" << std::endl;
-		const char* store = (!alwaysLow && (instr.locc & 8)) ? "movhpd" : "movlpd";
-		switch (instr.locc & 7)
-		{
-			case 4:
-				asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-				asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-				asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-				asmCode << "\t" << store << " qword ptr [rsi + rax * 8], " << regF[instr.regc % RegistersCount] << std::endl;
-				break;
-
-			case 5:
-			case 6:
-			case 7:
-				asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-				asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-				asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-				asmCode << "\t" << store << " qword ptr [rsi + rax * 8], " << regF[instr.regc % RegistersCount] << std::endl;
-				break;
-		}
-		if (trace) {
-			asmCode << "\t" << store << " qword ptr [rsi + rdi * 8 + 262136], " << regF[instr.regc % RegistersCount] << std::endl;
-		}
-	}
-
-	void AssemblyGeneratorX86::h_ADD_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tadd rax, ";
-		genbr1(instr);
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_ADD_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tadd eax, ";
-		genbr132(instr);
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_SUB_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tsub rax, ";
-		genbr1(instr);
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_SUB_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tsub eax, ";
-		genbr132(instr);
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_MUL_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\timul rax, ";
-		if ((instr.locb & 7) >= 6) {
-			asmCode << "rax, ";
-		}
-		genbr1(instr);
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_MULH_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov rcx, ";
-		genbr1(instr);
-		asmCode << "\tmul rcx" << std::endl;
-		asmCode << "\tmov rax, rdx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_MUL_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov ecx, eax" << std::endl;
-		asmCode << "\tmov eax, ";
-		genbr132(instr);
-		asmCode << "\timul rax, rcx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_IMUL_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmovsxd rcx, eax" << std::endl;
-		if ((instr.locb & 7) >= 6) {
-			asmCode << "\tmov rax, " << instr.imm32 << std::endl;
+	//1 uOP
+	void AssemblyGeneratorX86::h_IADD_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tadd " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
 		}
 		else {
-			asmCode << "\tmovsxd rax, " << regR32[instr.regb % RegistersCount] << std::endl;
+			asmCode << "\tadd " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
 		}
-		asmCode << "\timul rax, rcx" << std::endl;
-		gencr(instr);
 	}
 
-	void AssemblyGeneratorX86::h_IMULH_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov rcx, ";
-		genbr1(instr);
-		asmCode << "\timul rcx" << std::endl;
-		asmCode << "\tmov rax, rdx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_DIV_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) >= 6) {
-			if (instr.imm32 == 0) {
-				asmCode << "\tmov ecx, 1" << std::endl;
-			}
-			else {
-				asmCode << "\tmov ecx, " << instr.imm32 << std::endl;
-			}
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_IADD_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\tadd " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
 		}
 		else {
-			asmCode << "\tmov ecx, 1" << std::endl;
-			asmCode << "\tmov edx, " << regR32[instr.regb % RegistersCount] << std::endl;
-			asmCode << "\ttest edx, edx" << std::endl;
-			asmCode << "\tcmovne ecx, edx" << std::endl;
+			asmCode << "\tadd " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
 		}
-		asmCode << "\txor edx, edx" << std::endl;
-		asmCode << "\tdiv rcx" << std::endl;
-		gencr(instr);
 	}
 
-	void AssemblyGeneratorX86::h_IDIV_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov edx, ";
-		genbr132(instr);
-		asmCode << "\tcmp edx, -1" << std::endl;
-		asmCode << "\tjne short safe_idiv_" << i << std::endl;
-		asmCode << "\tmov rcx, rax" << std::endl;
-		asmCode << "\trol rcx, 1" << std::endl;
-		asmCode << "\tdec rcx" << std::endl;
-		asmCode << "\tjz short result_idiv_" << i << std::endl;
-		asmCode << "safe_idiv_" << i << ":" << std::endl;
-		asmCode << "\tmov ecx, 1" << std::endl;
-		asmCode << "\ttest edx, edx" << std::endl;
-		asmCode << "\tcmovne ecx, edx" << std::endl;
-		asmCode << "\tmovsxd rcx, ecx" << std::endl;
-		asmCode << "\tcqo" << std::endl;
-		asmCode << "\tidiv rcx" << std::endl;
-		asmCode << "result_idiv_" << i << ":" << std::endl;
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_IADD_RC(Instruction& instr, int i) {
+		asmCode << "\tlea " << regR[instr.dst] << ", [" << regR[instr.dst] << "+" << regR[instr.src] << std::showpos << (int32_t)instr.imm32 << std::noshowpos << "]" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_AND_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tand rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_ISUB_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tsub " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\tsub " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_AND_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tand eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_ISUB_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\tsub " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
+		}
+		else {
+			asmCode << "\tsub " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_OR_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tor rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_IMUL_9C(Instruction& instr, int i) {
+		asmCode << "\tlea " << regR[instr.dst] << ", [" << regR[instr.dst] << "+" << regR[instr.dst] << "*8" << std::showpos << (int32_t)instr.imm32 << std::noshowpos << "]" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_OR_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tor eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_IMUL_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\timul " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\timul " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_XOR_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\txor rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_IMUL_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\timul " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
+		}
+		else {
+			asmCode << "\timul " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_XOR_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\txor eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//4 uOPs
+	void AssemblyGeneratorX86::h_IMULH_R(Instruction& instr, int i) {
+		asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+		asmCode << "\tmul " << regR[instr.src] << std::endl;
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_SHL_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "shl");
-		gencr(instr);
+	//5.75 uOPs
+	void AssemblyGeneratorX86::h_IMULH_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr, "ecx");
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\tmul qword ptr [rsi+rcx]" << std::endl;
+		}
+		else {
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\tmul qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_SHR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "shr");
-		gencr(instr);
+	//4 uOPs
+	void AssemblyGeneratorX86::h_ISMULH_R(Instruction& instr, int i) {
+		asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+		asmCode << "\timul " << regR[instr.src] << std::endl;
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_SAR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "sar");
-		gencr(instr);
+	//5.75 uOPs
+	void AssemblyGeneratorX86::h_ISMULH_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr, "ecx");
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\timul qword ptr [rsi+rcx]" << std::endl;
+		}
+		else {
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\timul qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_ROL_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "rol");
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_INEG_R(Instruction& instr, int i) {
+		asmCode << "\tneg " << regR[instr.dst] << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_ROR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "ror");
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_IXOR_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\txor " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\txor " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_FPADD(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "addpd");
-		gencf(instr);
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_IXOR_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\txor " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
+		}
+		else {
+			asmCode << "\txor " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_FPSUB(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "subpd");
-		gencf(instr);
+	//1.75 uOPs
+	void AssemblyGeneratorX86::h_IROR_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tmov ecx, " << regR32[instr.src] << std::endl;
+			asmCode << "\tror " << regR[instr.dst] << ", cl" << std::endl;
+		}
+		else {
+			asmCode << "\tror " << regR[instr.dst] << ", " << (instr.imm32 & 63) << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_FPMUL(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "mulpd");
-		asmCode << "\tmovaps xmm1, xmm0" << std::endl;
-		asmCode << "\tcmpeqpd xmm1, xmm1" << std::endl;
-		asmCode << "\tandps xmm0, xmm1" << std::endl;
-		gencf(instr);
+	//1.75 uOPs
+	void AssemblyGeneratorX86::h_IROL_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tmov ecx, " << regR32[instr.src] << std::endl;
+			asmCode << "\trol " << regR[instr.dst] << ", cl" << std::endl;
+		}
+		else {
+			asmCode << "\trol " << regR[instr.dst] << ", " << (instr.imm32 & 63) << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_FPDIV(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "divpd");
-		asmCode << "\tmovaps xmm1, xmm0" << std::endl;
-		asmCode << "\tcmpeqpd xmm1, xmm1" << std::endl;
-		asmCode << "\tandps xmm0, xmm1" << std::endl;
-		gencf(instr);
+	//~6 uOPs
+	void AssemblyGeneratorX86::h_IDIV_C(Instruction& instr, int i) {
+		if (instr.imm32 != 0) {
+			uint32_t divisor = instr.imm32;
+			if (divisor & (divisor - 1)) {
+				magicu_info mi = compute_unsigned_magic_info(divisor, sizeof(uint64_t) * 8);
+				if (mi.pre_shift == 0 && !mi.increment) {
+					asmCode << "\tmov rax, " << mi.multiplier << std::endl;
+					asmCode << "\tmul " << regR[instr.dst] << std::endl;
+				}
+				else {
+					asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+					if (mi.pre_shift > 0)
+						asmCode << "\tshr rax, " << mi.pre_shift << std::endl;
+					if (mi.increment) {
+						asmCode << "\tadd rax, 1" << std::endl;
+						asmCode << "\tsbb rax, 0" << std::endl;
+					}
+					asmCode << "\tmov rcx, " << mi.multiplier << std::endl;
+					asmCode << "\tmul rcx" << std::endl;
+				}
+				if (mi.post_shift > 0)
+					asmCode << "\tshr rdx, " << mi.post_shift << std::endl;
+				asmCode << "\tadd " << regR[instr.dst] << ", rdx" << std::endl;
+			}
+			else { //divisor is a power of two
+				int shift = 0;
+				while (divisor >>= 1)
+					++shift;
+				if(shift > 0)
+					asmCode << "\tshr " << regR[instr.dst] << ", " << shift << std::endl;
+			}
+		}	
 	}
 
-	void AssemblyGeneratorX86::h_FPSQRT(Instruction& instr, int i) {
-		genaf(instr);
-		asmCode << "\tandps xmm0, xmm10" << std::endl;
-		asmCode << "\tsqrtpd xmm0, xmm0" << std::endl;
-		gencf(instr);
+	//~8.5 uOPs
+	void AssemblyGeneratorX86::h_ISDIV_C(Instruction& instr, int i) {
+		int64_t divisor = (int32_t)instr.imm32;
+		if ((divisor & -divisor) == divisor || (divisor & -divisor) == -divisor) {
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			// +/- power of two
+			bool negative = divisor < 0;
+			if (negative)
+				divisor = -divisor;
+			int shift = 0;
+			uint64_t unsignedDivisor = divisor;
+			while (unsignedDivisor >>= 1)
+				++shift;
+			if (shift > 0) {
+				asmCode << "\tmov rcx, rax" << std::endl;
+				asmCode << "\tsar rcx, 63" << std::endl;
+				uint32_t mask = (1ULL << shift) + 0xFFFFFFFF;
+				asmCode << "\tand ecx, 0" << std::hex << mask << std::dec << "h" << std::endl;
+				asmCode << "\tadd rax, rcx" << std::endl;
+				asmCode << "\tsar rax, " << shift << std::endl;
+			}
+			if (negative)
+				asmCode << "\tneg rax" << std::endl;
+			asmCode << "\tadd " << regR[instr.dst] << ", rax" << std::endl;
+		}
+		else if (divisor != 0) {
+			magics_info mi = compute_signed_magic_info(divisor);
+			asmCode << "\tmov rax, " << mi.multiplier << std::endl;
+			asmCode << "\timul " << regR[instr.dst] << std::endl;
+			//asmCode << "\tmov rax, rdx" << std::endl;
+			asmCode << "\txor eax, eax" << std::endl;
+			bool haveSF = false;
+			if (divisor > 0 && mi.multiplier < 0) {
+				asmCode << "\tadd rdx, " << regR[instr.dst] << std::endl;
+				haveSF = true;
+			}
+			if (divisor < 0 && mi.multiplier > 0) {
+				asmCode << "\tsub rdx, " << regR[instr.dst] << std::endl;
+				haveSF = true;
+			}
+			if (mi.shift > 0) {
+				asmCode << "\tsar rdx, " << mi.shift << std::endl;
+				haveSF = true;
+			}
+			if (!haveSF)
+				asmCode << "\ttest rdx, rdx" << std::endl;
+			asmCode << "\tsets al" << std::endl;
+			asmCode << "\tadd rdx, rax" << std::endl;
+			asmCode << "\tadd " << regR[instr.dst] << ", rdx" << std::endl;
+		}
 	}
 
-	void AssemblyGeneratorX86::h_FPROUND(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov rcx, rax" << std::endl;
-		asmCode << "\tshl eax, 13" << std::endl;
-		asmCode << "\tand rcx, -2048" << std::endl;
+	//2 uOPs
+	void AssemblyGeneratorX86::h_ISWAP_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\txchg " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+	}
+
+	//1 uOPs
+	void AssemblyGeneratorX86::h_FSWAP_R(Instruction& instr, int i) {
+		asmCode << "\tshufpd " << regFE[instr.dst] << ", " << regFE[instr.dst] << ", 1" << std::endl;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_FADD_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\taddpd " << regF[instr.dst] << ", " << regA[instr.src] << std::endl;
+		//asmCode << "\t" << fsumInstr[instr.mod % 4] << " " << signMask << ", " << regF[instr.dst] << std::endl;
+	}
+
+	//5 uOPs
+	void AssemblyGeneratorX86::h_FADD_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\taddpd " << regF[instr.dst] << ", xmm12" << std::endl;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_FSUB_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\tsubpd " << regF[instr.dst] << ", " << regA[instr.src] << std::endl;
+		//asmCode << "\t" << fsumInstr[instr.mod % 4] << " " << signMask << ", " << regF[instr.dst] << std::endl;
+	}
+
+	//5 uOPs
+	void AssemblyGeneratorX86::h_FSUB_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\tsubpd " << regF[instr.dst] << ", xmm12" << std::endl;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_FNEG_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		asmCode << "\txorps " << regF[instr.dst] << ", " << signMask << std::endl;
+	}
+
+	//1 uOPs
+	void AssemblyGeneratorX86::h_FMUL_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\tmulpd " << regE[instr.dst] << ", " << regA[instr.src] << std::endl;
+	}
+
+	//7 uOPs
+	void AssemblyGeneratorX86::h_FMUL_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\tandps xmm12, xmm14" << std::endl;
+		asmCode << "\tmulpd " << regE[instr.dst] << ", xmm12" << std::endl;
+		asmCode << "\tmaxpd " << regE[instr.dst] << ", " << dblMin << std::endl;
+	}
+
+	//2 uOPs
+	void AssemblyGeneratorX86::h_FDIV_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\tdivpd " << regE[instr.dst] << ", " << regA[instr.src] << std::endl;
+		asmCode << "\tmaxpd " << regE[instr.dst] << ", " << dblMin << std::endl;
+	}
+
+	//7 uOPs
+	void AssemblyGeneratorX86::h_FDIV_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\tandps xmm12, xmm14" << std::endl;
+		asmCode << "\tdivpd " << regE[instr.dst] << ", xmm12" << std::endl;
+		asmCode << "\tmaxpd " << regE[instr.dst] << ", " << dblMin << std::endl;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_FSQRT_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		asmCode << "\tsqrtpd " << regE[instr.dst] << ", " << regE[instr.dst] << std::endl;
+	}	
+
+	//6 uOPs
+	void AssemblyGeneratorX86::h_CFROUND(Instruction& instr, int i) {
+		asmCode << "\tmov rax, " << regR[instr.src] << std::endl;
+		int rotate = (13 - (instr.imm32 & 63)) & 63;
+		if (rotate != 0)
+			asmCode << "\trol rax, " << rotate << std::endl;
 		asmCode << "\tand eax, 24576" << std::endl;
-		asmCode << "\tcvtsi2sd " << regF[instr.regc % RegistersCount] << ", rcx" << std::endl;
 		asmCode << "\tor eax, 40896" << std::endl;
-		asmCode << "\tmov dword ptr [rsp - 8], eax" << std::endl;
-		asmCode << "\tldmxcsr dword ptr [rsp - 8]" << std::endl;
-		gencf(instr, true);
+		asmCode << "\tmov dword ptr [rsp-8], eax" << std::endl;
+		asmCode << "\tldmxcsr dword ptr [rsp-8]" << std::endl;
 	}
 
-	static inline const char* jumpCondition(Instruction& instr, bool invert = false) {
-		switch ((instr.locb & 7) ^ invert)
+	static inline const char* condition(Instruction& instr, bool invert = false) {
+		switch (((instr.mod >> 2) & 7) ^ invert)
 		{
 			case 0:
-				return "jbe";
+				return "be";
 			case 1:
-				return "ja";
+				return "a";
 			case 2:
-				return "js";
+				return "s";
 			case 3:
-				return "jns";
+				return "ns";
 			case 4:
-				return "jo";
+				return "o";
 			case 5:
-				return "jno";
+				return "no";
 			case 6:
-				return "jl";
+				return "l";
 			case 7:
-				return "jge";
+				return "ge";
+			default:
+				UNREACHABLE;
 		}
 	}
 
-	void AssemblyGeneratorX86::h_CALL(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tcmp " << regR32[instr.regb % RegistersCount] << ", " << instr.imm32 << std::endl;
-		asmCode << "\t" << jumpCondition(instr);
-		asmCode << " short taken_call_" << i << std::endl;
-		gencr(instr);
-		asmCode << "\tjmp rx_i_" << wrapInstr(i + 1) << std::endl;
-		asmCode << "taken_call_" << i << ":" << std::endl;
-		if (trace) {
-			asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rax" << std::endl;
-		}
-		asmCode << "\tpush rax" << std::endl;
-		asmCode << "\tcall rx_i_" << wrapInstr(i + (instr.imm8 & 127) + 2) << std::endl;
+	//4 uOPs
+	void AssemblyGeneratorX86::h_COND_R(Instruction& instr, int i) {
+		asmCode << "\txor ecx, ecx" << std::endl;
+		asmCode << "\tcmp " << regR32[instr.src] << ", " << (int32_t)instr.imm32 << std::endl;
+		asmCode << "\tset" << condition(instr) << " cl" << std::endl;
+		asmCode << "\tadd " << regR[instr.dst] << ", rcx" << std::endl;
 	}
 
-	void AssemblyGeneratorX86::h_RET(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tcmp rsp, rbp" << std::endl;
-		asmCode << "\tje short not_taken_ret_" << i << std::endl;
-		asmCode << "\txor rax, qword ptr [rsp + 8]" << std::endl;
-		gencr(instr);
-		asmCode << "\tret 8" << std::endl;
-		asmCode << "not_taken_ret_" << i << ":" << std::endl;
-		gencr(instr);
+	//6 uOPs
+	void AssemblyGeneratorX86::h_COND_M(Instruction& instr, int i) {
+		asmCode << "\txor ecx, ecx" << std::endl;
+		genAddressReg(instr);
+		asmCode << "\tcmp dword ptr [rsi+rax], " << (int32_t)instr.imm32 << std::endl;
+		asmCode << "\tset" << condition(instr) << " cl" << std::endl;
+		asmCode << "\tadd " << regR[instr.dst] << ", rcx" << std::endl;
+	}
+
+	//3 uOPs
+	void AssemblyGeneratorX86::h_ISTORE(Instruction& instr, int i) {
+		genAddressRegDst(instr);
+		asmCode << "\tmov qword ptr [rsi+rax], " << regR[instr.src] << std::endl;
+	}
+
+	//3 uOPs
+	void AssemblyGeneratorX86::h_FSTORE(Instruction& instr, int i) {
+		genAddressRegDst(instr, 16);
+		asmCode << "\tmovapd xmmword ptr [rsi+rax], " << regFE[instr.src] << std::endl;
+	}
+
+	void AssemblyGeneratorX86::h_NOP(Instruction& instr, int i) {
+		asmCode << "\tnop" << std::endl;
 	}
 
 #include "instructionWeights.hpp"
 #define INST_HANDLE(x) REPN(&AssemblyGeneratorX86::h_##x, WT(x))
 
 	InstructionGenerator AssemblyGeneratorX86::engine[256] = {
-		INST_HANDLE(ADD_64)
-		INST_HANDLE(ADD_32)
-		INST_HANDLE(SUB_64)
-		INST_HANDLE(SUB_32)
-		INST_HANDLE(MUL_64)
-		INST_HANDLE(MULH_64)
-		INST_HANDLE(MUL_32)
-		INST_HANDLE(IMUL_32)
-		INST_HANDLE(IMULH_64)
-		INST_HANDLE(DIV_64)
-		INST_HANDLE(IDIV_64)
-		INST_HANDLE(AND_64)
-		INST_HANDLE(AND_32)
-		INST_HANDLE(OR_64)
-		INST_HANDLE(OR_32)
-		INST_HANDLE(XOR_64)
-		INST_HANDLE(XOR_32)
-		INST_HANDLE(SHL_64)
-		INST_HANDLE(SHR_64)
-		INST_HANDLE(SAR_64)
-		INST_HANDLE(ROL_64)
-		INST_HANDLE(ROR_64)
-		INST_HANDLE(FPADD)
-		INST_HANDLE(FPSUB)
-		INST_HANDLE(FPMUL)
-		INST_HANDLE(FPDIV)
-		INST_HANDLE(FPSQRT)
-		INST_HANDLE(FPROUND)
-		INST_HANDLE(CALL)
-		INST_HANDLE(RET)
+		//Integer
+		INST_HANDLE(IADD_R)
+		INST_HANDLE(IADD_M)
+		INST_HANDLE(IADD_RC)
+		INST_HANDLE(ISUB_R)
+		INST_HANDLE(ISUB_M)
+		INST_HANDLE(IMUL_9C)
+		INST_HANDLE(IMUL_R)
+		INST_HANDLE(IMUL_M)
+		INST_HANDLE(IMULH_R)
+		INST_HANDLE(IMULH_M)
+		INST_HANDLE(ISMULH_R)
+		INST_HANDLE(ISMULH_M)
+		INST_HANDLE(IDIV_C)
+		INST_HANDLE(ISDIV_C)
+		INST_HANDLE(INEG_R)
+		INST_HANDLE(IXOR_R)
+		INST_HANDLE(IXOR_M)
+		INST_HANDLE(IROR_R)
+		INST_HANDLE(IROL_R)
+		INST_HANDLE(ISWAP_R)
+
+		//Common floating point
+		INST_HANDLE(FSWAP_R)
+
+		//Floating point group F
+		INST_HANDLE(FADD_R)
+		INST_HANDLE(FADD_M)
+		INST_HANDLE(FSUB_R)
+		INST_HANDLE(FSUB_M)
+		INST_HANDLE(FNEG_R)
+
+		//Floating point group E
+		INST_HANDLE(FMUL_R)
+		INST_HANDLE(FMUL_M)
+		INST_HANDLE(FDIV_R)
+		INST_HANDLE(FDIV_M)
+		INST_HANDLE(FSQRT_R)
+
+		//Control
+		INST_HANDLE(COND_R)
+		INST_HANDLE(COND_M)
+		INST_HANDLE(CFROUND)
+
+		INST_HANDLE(ISTORE)
+		INST_HANDLE(FSTORE)
+
+		INST_HANDLE(NOP)
 	};
 }
\ No newline at end of file
diff --git a/src/AssemblyGeneratorX86.hpp b/src/AssemblyGeneratorX86.hpp
index 3097a94..0c1844e 100644
--- a/src/AssemblyGeneratorX86.hpp
+++ b/src/AssemblyGeneratorX86.hpp
@@ -24,13 +24,14 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 namespace RandomX {
 
+	class Program;
 	class AssemblyGeneratorX86;
 
 	typedef void(AssemblyGeneratorX86::*InstructionGenerator)(Instruction&, int);
 
 	class AssemblyGeneratorX86 {
 	public:
-		void generateProgram(const void* seed);
+		void generateProgram(Program&);
 		void printCode(std::ostream& os) {
 			os << asmCode.rdbuf();
 		}
@@ -38,46 +39,48 @@ namespace RandomX {
 		static InstructionGenerator engine[256];
 		std::stringstream asmCode;
 
-		void genar(Instruction&);
-		void genaf(Instruction&);
-		void genbr0(Instruction&, const char*);
-		void genbr1(Instruction&);
-		void genbr132(Instruction&);
-		void genbf(Instruction&, const char*);
-		void gencr(Instruction&);
-		void gencf(Instruction&, bool);
+		void genAddressReg(Instruction&, const char*);
+		void genAddressRegDst(Instruction&, int);
+		int32_t genAddressImm(Instruction&);
 
 		void generateCode(Instruction&, int);
 
-		void h_ADD_64(Instruction&, int);
-		void h_ADD_32(Instruction&, int);
-		void h_SUB_64(Instruction&, int);
-		void h_SUB_32(Instruction&, int);
-		void h_MUL_64(Instruction&, int);
-		void h_MULH_64(Instruction&, int);
-		void h_MUL_32(Instruction&, int);
-		void h_IMUL_32(Instruction&, int);
-		void h_IMULH_64(Instruction&, int);
-		void h_DIV_64(Instruction&, int);
-		void h_IDIV_64(Instruction&, int);
-		void h_AND_64(Instruction&, int);
-		void h_AND_32(Instruction&, int);
-		void h_OR_64(Instruction&, int);
-		void h_OR_32(Instruction&, int);
-		void h_XOR_64(Instruction&, int);
-		void h_XOR_32(Instruction&, int);
-		void h_SHL_64(Instruction&, int);
-		void h_SHR_64(Instruction&, int);
-		void h_SAR_64(Instruction&, int);
-		void h_ROL_64(Instruction&, int);
-		void h_ROR_64(Instruction&, int);
-		void h_FPADD(Instruction&, int);
-		void h_FPSUB(Instruction&, int);
-		void h_FPMUL(Instruction&, int);
-		void h_FPDIV(Instruction&, int);
-		void h_FPSQRT(Instruction&, int);
-		void h_FPROUND(Instruction&, int);
-		void h_CALL(Instruction&, int);
-		void h_RET(Instruction&, int);
+		void  h_IADD_R(Instruction&, int);
+		void  h_IADD_M(Instruction&, int);
+		void  h_IADD_RC(Instruction&, int);
+		void  h_ISUB_R(Instruction&, int);
+		void  h_ISUB_M(Instruction&, int);
+		void  h_IMUL_9C(Instruction&, int);
+		void  h_IMUL_R(Instruction&, int);
+		void  h_IMUL_M(Instruction&, int);
+		void  h_IMULH_R(Instruction&, int);
+		void  h_IMULH_M(Instruction&, int);
+		void  h_ISMULH_R(Instruction&, int);
+		void  h_ISMULH_M(Instruction&, int);
+		void  h_IDIV_C(Instruction&, int);
+		void  h_ISDIV_C(Instruction&, int);
+		void  h_INEG_R(Instruction&, int);
+		void  h_IXOR_R(Instruction&, int);
+		void  h_IXOR_M(Instruction&, int);
+		void  h_IROR_R(Instruction&, int);
+		void  h_IROL_R(Instruction&, int);
+		void  h_ISWAP_R(Instruction&, int);
+		void  h_FSWAP_R(Instruction&, int);
+		void  h_FADD_R(Instruction&, int);
+		void  h_FADD_M(Instruction&, int);
+		void  h_FSUB_R(Instruction&, int);
+		void  h_FSUB_M(Instruction&, int);
+		void  h_FNEG_R(Instruction&, int);
+		void  h_FMUL_R(Instruction&, int);
+		void  h_FMUL_M(Instruction&, int);
+		void  h_FDIV_R(Instruction&, int);
+		void  h_FDIV_M(Instruction&, int);
+		void  h_FSQRT_R(Instruction&, int);
+		void  h_COND_R(Instruction&, int);
+		void  h_COND_M(Instruction&, int);
+		void  h_CFROUND(Instruction&, int);
+		void  h_ISTORE(Instruction&, int);
+		void  h_FSTORE(Instruction&, int);
+		void  h_NOP(Instruction&, int);
 	};
 }
\ No newline at end of file
diff --git a/src/Cache.cpp b/src/Cache.cpp
index eb03f9d..85d481e 100644
--- a/src/Cache.cpp
+++ b/src/Cache.cpp
@@ -23,7 +23,6 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include "Cache.hpp"
 #include "softAes.h"
 #include "argon2.h"
-#include "Pcg32.hpp"
 #include "argon2_core.h"
 
 namespace RandomX {
@@ -134,11 +133,6 @@ namespace RandomX {
 		//Argon2d memory fill
 		argonFill(seed, seedSize);
 
-		//Circular shift of the cache buffer by 512 bytes
-		//realized by copying the first 512 bytes to the back 
-		//of the buffer and shifting the start by 512 bytes
-		memcpy(memory + CacheSize, memory, CacheShift);
-
 		//AES keys
 		expandAesKeys<softAes>((__m128i*)seed, keys.data());
 	}
diff --git a/src/Cache.hpp b/src/Cache.hpp
index 7a34ee8..bc3d6ed 100644
--- a/src/Cache.hpp
+++ b/src/Cache.hpp
@@ -23,12 +23,32 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <new>
 #include "common.hpp"
 #include "dataset.hpp"
+#include "virtualMemory.hpp"
 
 namespace RandomX {
 
 	class Cache {
 	public:
-		void* operator new(size_t size) {
+		static void* alloc(bool largePages) {
+			if (largePages) {
+				return allocLargePagesMemory(sizeof(Cache));
+			}
+			else {
+				void* ptr = _mm_malloc(sizeof(Cache), sizeof(__m128i));
+				if (ptr == nullptr)
+					throw std::bad_alloc();
+				return ptr;
+			}
+		}
+		static void dealloc(Cache* cache, bool largePages) {
+			if (largePages) {
+				//allocLargePagesMemory(sizeof(Cache));
+			}
+			else {
+				_mm_free(cache);
+			}
+		}
+		/*void* operator new(size_t size) {
 			void* ptr = _mm_malloc(size, sizeof(__m128i));
 			if (ptr == nullptr)
 				throw std::bad_alloc();
@@ -37,7 +57,7 @@ namespace RandomX {
 
 		void operator delete(void* ptr) {
 			_mm_free(ptr);
-		}
+		}*/
 
 		template<bool softAes>
 		void initialize(const void* seed, size_t seedSize);
@@ -46,12 +66,12 @@ namespace RandomX {
 			return keys;
 		}
 
-		const uint8_t* getCache() {
-			return memory + CacheShift;
+		const uint8_t* getCache() const {
+			return memory;
 		}
 	private:
 		alignas(16) KeysContainer keys;
-		uint8_t memory[CacheSize + CacheShift];
+		uint8_t memory[CacheSize];
 		void argonFill(const void* seed, size_t seedSize);
 	};
 }
\ No newline at end of file
diff --git a/src/CompiledVirtualMachine.cpp b/src/CompiledVirtualMachine.cpp
index 7803003..8cfc364 100644
--- a/src/CompiledVirtualMachine.cpp
+++ b/src/CompiledVirtualMachine.cpp
@@ -18,37 +18,28 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 
 #include "CompiledVirtualMachine.hpp"
-#include "Pcg32.hpp"
 #include "common.hpp"
-#include "instructions.hpp"
 #include <stdexcept>
 
 namespace RandomX {
 
-	CompiledVirtualMachine::CompiledVirtualMachine(bool softAes) : VirtualMachine(softAes) {
-
+	CompiledVirtualMachine::CompiledVirtualMachine() {
+		totalSize = 0;
 	}
 
-	void CompiledVirtualMachine::setDataset(dataset_t ds, bool lightClient) {
-		if (lightClient) {
-			throw std::runtime_error("Compiled VM does not support light-client mode");
-		}
-		VirtualMachine::setDataset(ds, lightClient);
+	void CompiledVirtualMachine::setDataset(dataset_t ds) {
+		mem.ds = ds;
 	}
 
-	void CompiledVirtualMachine::initializeProgram(const void* seed) {
-		Pcg32 gen(seed);
-		for (unsigned i = 0; i < sizeof(reg) / sizeof(Pcg32::result_type); ++i) {
-			*(((uint32_t*)&reg) + i) = gen();
-		}
-		compiler.generateProgram(gen);
-		mem.ma = (gen() ^ *(((uint32_t*)seed) + 4)) & ~7;
-		mem.mx = *(((uint32_t*)seed) + 5);
+	void CompiledVirtualMachine::initialize() {
+		VirtualMachine::initialize();
+		compiler.generateProgram(program);
 	}
 
 	void CompiledVirtualMachine::execute() {
-		//executeProgram(reg, mem, scratchpad, readDataset);
-		compiler.getProgramFunc()(reg, mem, scratchpad);
+		//executeProgram(reg, mem, scratchpad, InstructionCount);
+		totalSize += compiler.getCodeSize();
+		compiler.getProgramFunc()(reg, mem, scratchpad, InstructionCount);
 #ifdef TRACEVM
 		for (int32_t i = InstructionCount - 1; i >= 0; --i) {
 			std::cout << std::hex << tracepad[i].u64 << std::endl;
diff --git a/src/CompiledVirtualMachine.hpp b/src/CompiledVirtualMachine.hpp
index 0932cfe..e3b6bf0 100644
--- a/src/CompiledVirtualMachine.hpp
+++ b/src/CompiledVirtualMachine.hpp
@@ -19,24 +19,39 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 #pragma once
 //#define TRACEVM
+#include <new>
 #include "VirtualMachine.hpp"
 #include "JitCompilerX86.hpp"
+#include "intrinPortable.h"
 
 namespace RandomX {
 
 	class CompiledVirtualMachine : public VirtualMachine {
 	public:
-		CompiledVirtualMachine(bool softAes);
-		void setDataset(dataset_t ds, bool light = false) override;
-		void initializeProgram(const void* seed) override;
+		void* operator new(size_t size) {
+			void* ptr = _mm_malloc(size, 64);
+			if (ptr == nullptr)
+				throw std::bad_alloc();
+			return ptr;
+		}
+		void operator delete(void* ptr) {
+			_mm_free(ptr);
+		}
+		CompiledVirtualMachine();
+		void setDataset(dataset_t ds) override;
+		void initialize() override;
 		virtual void execute() override;
 		void* getProgram() {
 			return compiler.getCode();
 		}
+		uint64_t getTotalSize() {
+			return totalSize;
+		}
 	private:
 #ifdef TRACEVM
 		convertible_t tracepad[InstructionCount];
 #endif
 		JitCompilerX86 compiler;
+		uint64_t totalSize;
 	};
 }
\ No newline at end of file
diff --git a/src/Instruction.cpp b/src/Instruction.cpp
index 4ab128a..bdcaf39 100644
--- a/src/Instruction.cpp
+++ b/src/Instruction.cpp
@@ -18,53 +18,419 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 
 #include "Instruction.hpp"
+#include "common.hpp"
 
 namespace RandomX {
 
 	void Instruction::print(std::ostream& os) const {
-			os << "  A: loc = " << std::dec << (loca & 7) << ", reg: " << (rega & 7) << std::endl;
-			os << "  B: loc = " << (locb & 7) << ", reg: " << (regb & 7) << std::endl;
-			os << "  C: loc = " << (locc & 7) << ", reg: " << (regc & 7) << std::endl;
-			os << "  addra = " << std::hex << addra << std::endl;
-			os << "  addrc = " << addrc << std::endl;
-			os << "  imm8 = " << std::dec << (int)imm8 << std::endl;
-			os << "  imm32 = " << imm32 << std::endl;
+		os << names[opcode] << " ";
+		auto handler = engine[opcode];
+		(this->*handler)(os);
+	}
+
+	void Instruction::genAddressReg(std::ostream& os) const {
+		os << ((mod % 4) ? "L1" : "L2") << "[r" << (int)src << "]";
+	}
+
+	void Instruction::genAddressRegDst(std::ostream& os) const {
+		os << ((mod % 4) ? "L1" : "L2") << "[r" << (int)dst << "]";
+	}
+
+	void Instruction::genAddressImm(std::ostream& os) const {
+		os << "L3" << "[" << (imm32 & ScratchpadL3Mask) << "]";
+	}
+
+	void Instruction::h_IADD_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
 		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_IADD_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IADD_RC(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << ", " << (int32_t)imm32 << std::endl;
+	}
+
+	//1 uOP
+	void Instruction::h_ISUB_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_ISUB_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IMUL_9C(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+	}
+
+	void Instruction::h_IMUL_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_IMUL_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IMULH_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << std::endl;
+	}
+
+	void Instruction::h_IMULH_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_ISMULH_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << std::endl;
+	}
+
+	void Instruction::h_ISMULH_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_INEG_R(std::ostream& os) const {
+		os << "r" << (int)dst << std::endl;
+	}
+
+	void Instruction::h_IXOR_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_IXOR_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IROR_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (imm32 & 63) << std::endl;
+		}
+	}
+
+	void Instruction::h_IROL_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (imm32 & 63) << std::endl;
+		}
+	}
+
+	void Instruction::h_IDIV_C(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << imm32 << std::endl;
+	}
+
+	void Instruction::h_ISDIV_C(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+	}
+
+	void Instruction::h_ISWAP_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << std::endl;
+	}
+
+	void Instruction::h_FSWAP_R(std::ostream& os) const {
+		const char reg = (dst >= 4) ? 'e' : 'f';
+		auto dstIndex = dst % 4;
+		os << reg << dstIndex << std::endl;
+	}
+
+	void Instruction::h_FADD_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "f" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FADD_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "f" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FSUB_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "f" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FSUB_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "f" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FNEG_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "f" << dstIndex << std::endl;
+	}
+
+	void Instruction::h_FMUL_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "e" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FMUL_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "e" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FDIV_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "e" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FDIV_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "e" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FSQRT_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "e" << dstIndex << std::endl;
+	}
+
+	void Instruction::h_CFROUND(std::ostream& os) const {
+		os << "r" << (int)src << ", " << (imm32 & 63) << std::endl;
+	}
+
+	static inline const char* condition(int index) {
+		switch (index)
+		{
+		case 0:
+			return "be";
+		case 1:
+			return "ab";
+		case 2:
+			return "sg";
+		case 3:
+			return "ns";
+		case 4:
+			return "of";
+		case 5:
+			return "no";
+		case 6:
+			return "lt";
+		case 7:
+			return "ge";
+		default:
+			UNREACHABLE;
+		}
+	}
+
+	void Instruction::h_COND_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << condition((mod >> 2) & 7) << "(r" << (int)src << ", " << (int32_t)imm32 << ")" << std::endl;
+	}
+
+	void Instruction::h_COND_M(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << condition((mod >> 2) & 7) << "(";
+		genAddressReg(os);
+		os << ", " << (int32_t)imm32 << ")" << std::endl;
+	}
+
+	void  Instruction::h_ISTORE(std::ostream& os) const {
+		genAddressRegDst(os);
+		os << ", r" << (int)src << std::endl;
+	}
+
+	void  Instruction::h_FSTORE(std::ostream& os) const {
+		const char reg = (src >= 4) ? 'e' : 'f';
+		genAddressRegDst(os);
+		auto srcIndex = src % 4;
+		os << ", " << reg << srcIndex << std::endl;
+	}
+
+	void  Instruction::h_NOP(std::ostream& os) const {
+		os << std::endl;
+	}
 
 #include "instructionWeights.hpp"
 #define INST_NAME(x) REPN(#x, WT(x))
+#define INST_HANDLE(x) REPN(&Instruction::h_##x, WT(x))
 
 	const char* Instruction::names[256] = {
-		INST_NAME(ADD_64)
-		INST_NAME(ADD_32)
-		INST_NAME(SUB_64)
-		INST_NAME(SUB_32)
-		INST_NAME(MUL_64)
-		INST_NAME(MULH_64)
-		INST_NAME(MUL_32)
-		INST_NAME(IMUL_32)
-		INST_NAME(IMULH_64)
-		INST_NAME(DIV_64)
-		INST_NAME(IDIV_64)
-		INST_NAME(AND_64)
-		INST_NAME(AND_32)
-		INST_NAME(OR_64)
-		INST_NAME(OR_32)
-		INST_NAME(XOR_64)
-		INST_NAME(XOR_32)
-		INST_NAME(SHL_64)
-		INST_NAME(SHR_64)
-		INST_NAME(SAR_64)
-		INST_NAME(ROL_64)
-		INST_NAME(ROR_64)
-		INST_NAME(FPADD)
-		INST_NAME(FPSUB)
-		INST_NAME(FPMUL)
-		INST_NAME(FPDIV)
-		INST_NAME(FPSQRT)
-		INST_NAME(FPROUND)
-		INST_NAME(CALL)
-		INST_NAME(RET)
+		//Integer
+		INST_NAME(IADD_R)
+		INST_NAME(IADD_M)
+		INST_NAME(IADD_RC)
+		INST_NAME(ISUB_R)
+		INST_NAME(ISUB_M)
+		INST_NAME(IMUL_9C)
+		INST_NAME(IMUL_R)
+		INST_NAME(IMUL_M)
+		INST_NAME(IMULH_R)
+		INST_NAME(IMULH_M)
+		INST_NAME(ISMULH_R)
+		INST_NAME(ISMULH_M)
+		INST_NAME(IDIV_C)
+		INST_NAME(ISDIV_C)
+		INST_NAME(INEG_R)
+		INST_NAME(IXOR_R)
+		INST_NAME(IXOR_M)
+		INST_NAME(IROR_R)
+		INST_NAME(IROL_R)
+		INST_NAME(ISWAP_R)
+
+		//Common floating point
+		INST_NAME(FSWAP_R)
+
+		//Floating point group F
+		INST_NAME(FADD_R)
+		INST_NAME(FADD_M)
+		INST_NAME(FSUB_R)
+		INST_NAME(FSUB_M)
+		INST_NAME(FNEG_R)
+
+		//Floating point group E
+		INST_NAME(FMUL_R)
+		INST_NAME(FMUL_M)
+		INST_NAME(FDIV_R)
+		INST_NAME(FDIV_M)
+		INST_NAME(FSQRT_R)
+
+		//Control
+		INST_NAME(COND_R)
+		INST_NAME(COND_M)
+		INST_NAME(CFROUND)
+
+		INST_NAME(ISTORE)
+		INST_NAME(FSTORE)
+
+		INST_NAME(NOP)
+	};
+
+	InstructionVisualizer Instruction::engine[256] = {
+		//Integer
+		INST_HANDLE(IADD_R)
+		INST_HANDLE(IADD_M)
+		INST_HANDLE(IADD_RC)
+		INST_HANDLE(ISUB_R)
+		INST_HANDLE(ISUB_M)
+		INST_HANDLE(IMUL_9C)
+		INST_HANDLE(IMUL_R)
+		INST_HANDLE(IMUL_M)
+		INST_HANDLE(IMULH_R)
+		INST_HANDLE(IMULH_M)
+		INST_HANDLE(ISMULH_R)
+		INST_HANDLE(ISMULH_M)
+		INST_HANDLE(IDIV_C)
+		INST_HANDLE(ISDIV_C)
+		INST_HANDLE(INEG_R)
+		INST_HANDLE(IXOR_R)
+		INST_HANDLE(IXOR_M)
+		INST_HANDLE(IROR_R)
+		INST_HANDLE(IROL_R)
+		INST_HANDLE(ISWAP_R)
+
+		//Common floating point
+		INST_HANDLE(FSWAP_R)
+
+		//Floating point group F
+		INST_HANDLE(FADD_R)
+		INST_HANDLE(FADD_M)
+		INST_HANDLE(FSUB_R)
+		INST_HANDLE(FSUB_M)
+		INST_HANDLE(FNEG_R)
+
+		//Floating point group E
+		INST_HANDLE(FMUL_R)
+		INST_HANDLE(FMUL_M)
+		INST_HANDLE(FDIV_R)
+		INST_HANDLE(FDIV_M)
+		INST_HANDLE(FSQRT_R)
+
+		//Control
+		INST_HANDLE(COND_R)
+		INST_HANDLE(COND_M)
+		INST_HANDLE(CFROUND)
+
+		INST_HANDLE(ISTORE)
+		INST_HANDLE(FSTORE)
+
+		INST_HANDLE(NOP)
 	};
 
 }
\ No newline at end of file
diff --git a/src/Instruction.hpp b/src/Instruction.hpp
index 33c2059..5cfd833 100644
--- a/src/Instruction.hpp
+++ b/src/Instruction.hpp
@@ -24,21 +24,57 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 namespace RandomX {
 
+	class Instruction;
+
+	typedef void(Instruction::*InstructionVisualizer)(std::ostream&) const;
+
+	namespace InstructionType {
+		constexpr int IADD_R = 0;
+		constexpr int IADD_M = 1;
+		constexpr int IADD_RC = 2;
+		constexpr int ISUB_R = 3;
+		constexpr int ISUB_M = 4;
+		constexpr int IMUL_9C = 5;
+		constexpr int IMUL_R = 6;
+		constexpr int IMUL_M = 7;
+		constexpr int IMULH_R = 8;
+		constexpr int IMULH_M = 9;
+		constexpr int ISMULH_R = 10;
+		constexpr int ISMULH_M = 11;
+		constexpr int IDIV_C = 12;
+		constexpr int ISDIV_C = 13;
+		constexpr int INEG_R = 14;
+		constexpr int IXOR_R = 15;
+		constexpr int IXOR_M = 16;
+		constexpr int IROR_R = 17;
+		constexpr int IROL_R = 18;
+		constexpr int ISWAP_R = 19;
+		constexpr int FSWAP_R = 20;
+		constexpr int FADD_R = 21;
+		constexpr int FADD_M = 22;
+		constexpr int FSUB_R = 23;
+		constexpr int FSUB_M = 24;
+		constexpr int FNEG_R = 25;
+		constexpr int FMUL_R = 26;
+		constexpr int FMUL_M = 27;
+		constexpr int FDIV_R = 28;
+		constexpr int FDIV_M = 29;
+		constexpr int FSQRT_R = 30;
+		constexpr int COND_R = 31;
+		constexpr int COND_M = 32;
+		constexpr int CFROUND = 33;
+		constexpr int ISTORE = 34;
+		constexpr int FSTORE = 35;
+		constexpr int NOP = 36;
+	}
+
 	class Instruction {
 	public:
 		uint8_t opcode;
-		uint8_t loca;
-		uint8_t rega;
-		uint8_t locb;
-		uint8_t regb;
-		uint8_t locc;
-		uint8_t regc;
-		uint8_t imm8;
-		int32_t addra;
-		union {
-			uint32_t addrc;
-			int32_t imm32;
-		};
+		uint8_t dst;
+		uint8_t src;
+		uint8_t mod;
+		uint32_t imm32;
 		const char* getName() const {
 			return names[opcode];
 		}
@@ -49,8 +85,51 @@ namespace RandomX {
 	private:
 		void print(std::ostream&) const;
 		static const char* names[256];
+		static InstructionVisualizer engine[256];
+
+		void genAddressReg(std::ostream& os) const;
+		void genAddressImm(std::ostream& os) const;
+		void genAddressRegDst(std::ostream&) const;
+
+		void  h_IADD_R(std::ostream&) const;
+		void  h_IADD_M(std::ostream&) const;
+		void  h_IADD_RC(std::ostream&) const;
+		void  h_ISUB_R(std::ostream&) const;
+		void  h_ISUB_M(std::ostream&) const;
+		void  h_IMUL_9C(std::ostream&) const;
+		void  h_IMUL_R(std::ostream&) const;
+		void  h_IMUL_M(std::ostream&) const;
+		void  h_IMULH_R(std::ostream&) const;
+		void  h_IMULH_M(std::ostream&) const;
+		void  h_ISMULH_R(std::ostream&) const;
+		void  h_ISMULH_M(std::ostream&) const;
+		void  h_IDIV_C(std::ostream&) const;
+		void  h_ISDIV_C(std::ostream&) const;
+		void  h_INEG_R(std::ostream&) const;
+		void  h_IXOR_R(std::ostream&) const;
+		void  h_IXOR_M(std::ostream&) const;
+		void  h_IROR_R(std::ostream&) const;
+		void  h_IROL_R(std::ostream&) const;
+		void  h_ISWAP_R(std::ostream&) const;
+		void  h_FSWAP_R(std::ostream&) const;
+		void  h_FADD_R(std::ostream&) const;
+		void  h_FADD_M(std::ostream&) const;
+		void  h_FSUB_R(std::ostream&) const;
+		void  h_FSUB_M(std::ostream&) const;
+		void  h_FNEG_R(std::ostream&) const;
+		void  h_FMUL_R(std::ostream&) const;
+		void  h_FMUL_M(std::ostream&) const;
+		void  h_FDIV_R(std::ostream&) const;
+		void  h_FDIV_M(std::ostream&) const;
+		void  h_FSQRT_R(std::ostream&) const;
+		void  h_COND_R(std::ostream&) const;
+		void  h_COND_M(std::ostream&) const;
+		void  h_CFROUND(std::ostream&) const;
+		void  h_ISTORE(std::ostream&) const;
+		void  h_FSTORE(std::ostream&) const;
+		void  h_NOP(std::ostream&) const;
 	};
 
-	static_assert(sizeof(Instruction) == 16, "Invalid alignment of struct Instruction");
+	static_assert(sizeof(Instruction) == 8, "Invalid alignment of struct Instruction");
 
 }
\ No newline at end of file
diff --git a/src/InterpretedVirtualMachine.cpp b/src/InterpretedVirtualMachine.cpp
index c436ef7..c5a6d53 100644
--- a/src/InterpretedVirtualMachine.cpp
+++ b/src/InterpretedVirtualMachine.cpp
@@ -19,16 +19,21 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 //#define TRACE
 //#define FPUCHECK
 #include "InterpretedVirtualMachine.hpp"
-#include "Pcg32.hpp"
-#include "instructions.hpp"
+#include "dataset.hpp"
+#include "Cache.hpp"
+#include "LightClientAsyncWorker.hpp"
 #include <iostream>
 #include <iomanip>
 #include <stdexcept>
 #include <sstream>
 #include <cmath>
+#include <cfloat>
+#include <thread>
+#include "intrinPortable.h"
 #ifdef STATS
 #include <algorithm>
 #endif
+#include "divideByConstantCodegen.h"
 
 #ifdef FPUCHECK
 constexpr bool fpuCheck = true;
@@ -38,350 +43,701 @@ constexpr bool fpuCheck = false;
 
 namespace RandomX {
 
-	void InterpretedVirtualMachine::initializeProgram(const void* seed) {
-		Pcg32 gen(seed);
-		for (unsigned i = 0; i < sizeof(reg) / sizeof(Pcg32::result_type); ++i) {
-			*(((uint32_t*)&reg) + i) = gen();
+	InterpretedVirtualMachine::~InterpretedVirtualMachine() {
+		if (asyncWorker) {
+			delete mem.ds.asyncWorker;
 		}
-		FPINIT();
-		for (int i = 0; i < RegistersCount; ++i) {
-			reg.f[i].lo.f64 = (double)reg.f[i].lo.i64;
-			reg.f[i].hi.f64 = (double)reg.f[i].hi.i64;
+	}
+
+	void InterpretedVirtualMachine::setDataset(dataset_t ds) {
+		if (asyncWorker) {
+			if (softAes) {
+				mem.ds.asyncWorker = new LightClientAsyncWorker<true>(ds.cache);
+			}
+			else {
+				mem.ds.asyncWorker = new LightClientAsyncWorker<false>(ds.cache);
+			}
+			readDataset = &datasetReadLightAsync;
+		}
+		else {
+			mem.ds = ds;
+			readDataset = &datasetReadLight;
+		}
+	}
+
+	void InterpretedVirtualMachine::initialize() {
+		VirtualMachine::initialize();
+		for (unsigned i = 0; i < ProgramLength; ++i) {
+			program(i).src %= RegistersCount;
+			program(i).dst %= RegistersCount;
+		}
+	}
+
+	template<int N>
+	void InterpretedVirtualMachine::executeBytecode(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]) {
+		executeBytecode(N, r, f, e, a);
+		executeBytecode<N + 1>(r, f, e, a);
+	}
+
+	template<>
+	void InterpretedVirtualMachine::executeBytecode<ProgramLength>(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]) {
+	}
+
+	FORCE_INLINE void InterpretedVirtualMachine::executeBytecode(int i, int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]) {
+		auto& ibc = byteCode[i];
+		switch (ibc.type)
+		{
+			case InstructionType::IADD_R: {
+				*ibc.idst += *ibc.isrc;
+			} break;
+
+			case InstructionType::IADD_M: {
+				*ibc.idst += load64(scratchpad + (*ibc.isrc & ibc.memMask));
+			} break;
+
+			case InstructionType::IADD_RC: {
+				*ibc.idst += *ibc.isrc + ibc.imm;
+			} break;
+
+			case InstructionType::ISUB_R: {
+				*ibc.idst -= *ibc.isrc;
+			} break;
+
+			case InstructionType::ISUB_M: {
+				*ibc.idst -= load64(scratchpad + (*ibc.isrc & ibc.memMask));
+			} break;
+
+			case InstructionType::IMUL_9C: {
+				*ibc.idst += 9 * *ibc.idst + ibc.imm;
+			} break;
+
+			case InstructionType::IMUL_R: {
+				*ibc.idst *= *ibc.isrc;
+			} break;
+
+			case InstructionType::IMUL_M: {
+				*ibc.idst *= load64(scratchpad + (*ibc.isrc & ibc.memMask));
+			} break;
+
+			case InstructionType::IMULH_R: {
+				*ibc.idst = mulh(*ibc.idst, *ibc.isrc);
+			} break;
+
+			case InstructionType::IMULH_M: {
+				*ibc.idst = mulh(*ibc.idst, load64(scratchpad + (*ibc.isrc & ibc.memMask)));
+			} break;
+
+			case InstructionType::ISMULH_R: {
+				*ibc.idst = smulh(unsigned64ToSigned2sCompl(*ibc.idst), unsigned64ToSigned2sCompl(*ibc.isrc));
+			} break;
+
+			case InstructionType::ISMULH_M: {
+				*ibc.idst = smulh(unsigned64ToSigned2sCompl(*ibc.idst), unsigned64ToSigned2sCompl(load64(scratchpad + (*ibc.isrc & ibc.memMask))));
+			} break;
+
+			case InstructionType::IDIV_C: {
+				if (ibc.signedMultiplier != 0) {
+					int_reg_t dividend = *ibc.idst;
+					int_reg_t quotient = dividend >> ibc.preShift;
+					if (ibc.increment) {
+						quotient = quotient == UINT64_MAX ? UINT64_MAX : quotient + 1;
+					}
+					quotient = mulh(quotient, ibc.signedMultiplier);
+					quotient >>= ibc.postShift;
+					*ibc.idst += quotient;
+				}
+				else {
+					*ibc.idst += *ibc.idst >> ibc.shift;
+				}
+			} break;
+
+			case InstructionType::ISDIV_C: {
+
+			} break;
+
+			case InstructionType::INEG_R: {
+				*ibc.idst = ~(*ibc.idst) + 1; //two's complement negative
+			} break;
+
+			case InstructionType::IXOR_R: {
+				*ibc.idst ^= *ibc.isrc;
+			} break;
+
+			case InstructionType::IXOR_M: {
+				*ibc.idst ^= load64(scratchpad + (*ibc.isrc & ibc.memMask));
+			} break;
+
+			case InstructionType::IROR_R: {
+				*ibc.idst = rotr(*ibc.idst, *ibc.isrc & 63);
+			} break;
+
+			case InstructionType::IROL_R: {
+				*ibc.idst = rotl(*ibc.idst, *ibc.isrc & 63);
+			} break;
+
+			case InstructionType::ISWAP_R: {
+				int_reg_t temp = *ibc.isrc;
+				*ibc.isrc = *ibc.idst;
+				*ibc.idst = temp;
+			} break;
+
+			case InstructionType::FSWAP_R: {
+				*ibc.fdst = _mm_shuffle_pd(*ibc.fdst, *ibc.fdst, 1);
+			} break;
+
+			case InstructionType::FADD_R: {
+				*ibc.fdst = _mm_add_pd(*ibc.fdst, *ibc.fsrc);
+			} break;
+
+			case InstructionType::FADD_M: {
+				__m128d fsrc = load_cvt_i32x2(scratchpad + (*ibc.isrc & ibc.memMask));
+				*ibc.fdst = _mm_add_pd(*ibc.fdst, fsrc);
+			} break;
+
+			case InstructionType::FSUB_R: {
+				*ibc.fdst = _mm_sub_pd(*ibc.fdst, *ibc.fsrc);
+			} break;
+
+			case InstructionType::FSUB_M: {
+				__m128d fsrc = load_cvt_i32x2(scratchpad + (*ibc.isrc & ibc.memMask));
+				*ibc.fdst = _mm_sub_pd(*ibc.fdst, fsrc);
+			} break;
+
+			case InstructionType::FNEG_R: {
+				const __m128d signMask = _mm_castsi128_pd(_mm_set1_epi64x(1ULL << 63));
+				*ibc.fdst = _mm_xor_pd(*ibc.fdst, signMask);
+			} break;
+
+			case InstructionType::FMUL_R: {
+				*ibc.fdst = _mm_mul_pd(*ibc.fdst, *ibc.fsrc);
+			} break;
+
+			case InstructionType::FDIV_M: {
+				__m128d fsrc = load_cvt_i32x2(scratchpad + (*ibc.isrc & ibc.memMask));
+				__m128d fdst = _mm_div_pd(*ibc.fdst, fsrc);
+				*ibc.fdst = _mm_max_pd(fdst, _mm_set_pd(DBL_MIN, DBL_MIN));
+			} break;
+
+			case InstructionType::FSQRT_R: {
+				*ibc.fdst = _mm_sqrt_pd(*ibc.fdst);
+			} break;
+
+			case InstructionType::COND_R: {
+				*ibc.idst += condition(*ibc.isrc, ibc.imm, ibc.condition) ? 1 : 0;
+			} break;
+
+			case InstructionType::COND_M: {
+				*ibc.idst += condition(load64(scratchpad + (*ibc.isrc & ibc.memMask)), ibc.imm, ibc.condition) ? 1 : 0;
+			} break;
+
+			case InstructionType::CFROUND: {
+				setRoundMode(rotr(*ibc.isrc, ibc.imm) % 4);
+			} break;
+
+			case InstructionType::ISTORE: {
+				store64(scratchpad + (*ibc.idst & ibc.memMask), *ibc.isrc);
+			} break;
+
+			case InstructionType::NOP: {
+				//nothing
+			} break;
+
+			default:
+				UNREACHABLE;
 		}
-		//std::cout << reg;
-		p.initialize(gen);
-		mem.ma = (gen() ^ *(((uint32_t*)seed) + 4)) & ~7;
-		mem.mx = *(((uint32_t*)seed) + 5);
-		pc = 0;
-		ic = InstructionCount;
-		stack.clear();
 	}
 
 	void InterpretedVirtualMachine::execute() {
-		while (ic > 0) {
-#ifdef STATS
-			count_instructions[pc]++;
-#endif
-			auto& inst = p(pc);
-			if(trace) std::cout << inst.getName() << " (" << std::dec << pc << ")" << std::endl;
-			pc = (pc + 1) % ProgramLength;
-			auto handler = engine[inst.opcode];
-			(this->*handler)(inst);
-			ic--;
-		}
-#ifdef STATS
-		count_endstack += stack.size();
-#endif
-	}
+		int_reg_t r[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
+		__m128d f[4];
+		__m128d e[4];
+		__m128d a[4];
 
-	convertible_t InterpretedVirtualMachine::loada(Instruction& inst) {
-		convertible_t& rega = reg.r[inst.rega % RegistersCount];
-		rega.i64 ^= inst.addra; //sign-extend addra
-		addr_t addr = rega.u32;
-		switch (inst.loca & 7)
-		{
-			case 0:
-			case 1:
-			case 2:
-			case 3:
-				return readDataset(addr, mem);
+		a[0] = _mm_load_pd(&reg.a[0].lo);
+		a[1] = _mm_load_pd(&reg.a[1].lo);
+		a[2] = _mm_load_pd(&reg.a[2].lo);
+		a[3] = _mm_load_pd(&reg.a[3].lo);
 
-			case 4:
-				return scratchpad[addr % ScratchpadL2];
+		precompileProgram(r, f, e, a);
 
-			case 5:
-			case 6:
-			case 7:
-				return scratchpad[addr % ScratchpadL1];
-		}
-	}
+		uint32_t spAddr0 = mem.mx;
+		uint32_t spAddr1 = mem.ma;
 
-	convertible_t InterpretedVirtualMachine::loadbr1(Instruction& inst) {
-		switch (inst.locb & 7)
-		{
-			case 0:
-			case 1:
-			case 2:
-			case 3:
-			case 4:
-			case 5:
-				return reg.r[inst.regb % RegistersCount];
-			case 6:
-			case 7:
-				convertible_t temp;
-				temp.i64 = inst.imm32; //sign-extend imm32
-				return temp;
-		}
-	}
+		for(unsigned iter = 0; iter < InstructionCount; ++iter) {
+			//std::cout << "Iteration " << iter << std::endl;
+			spAddr0 ^= r[readReg0];
+			spAddr0 &= ScratchpadL3Mask64;
+			
+			r[0] ^= load64(scratchpad + spAddr0 + 0);
+			r[1] ^= load64(scratchpad + spAddr0 + 8);
+			r[2] ^= load64(scratchpad + spAddr0 + 16);
+			r[3] ^= load64(scratchpad + spAddr0 + 24);
+			r[4] ^= load64(scratchpad + spAddr0 + 32);
+			r[5] ^= load64(scratchpad + spAddr0 + 40);
+			r[6] ^= load64(scratchpad + spAddr0 + 48);
+			r[7] ^= load64(scratchpad + spAddr0 + 56);
 
-	convertible_t InterpretedVirtualMachine::loadbr0(Instruction& inst) {
-		switch (inst.locb & 7)
-		{
-			case 0:
-			case 1:
-			case 2:
-			case 3:
-				return reg.r[inst.regb % RegistersCount];
-			case 4:
-			case 5:
-			case 6:
-			case 7:
-				convertible_t temp;
-				temp.u64 = inst.imm8;
-				return temp;
-		}
-	}
+			spAddr1 ^= r[readReg1];
+			spAddr1 &= ScratchpadL3Mask64;
 
-	convertible_t& InterpretedVirtualMachine::getcr(Instruction& inst) {
-		addr_t addr;
-		switch (inst.locc & 7)
-		{
-			case 0:
-				addr = reg.r[inst.regc % RegistersCount].u32 ^ inst.addrc;
-				return scratchpad[addr % ScratchpadL2];
+			f[0] = load_cvt_i32x2(scratchpad + spAddr1 + 0);
+			f[1] = load_cvt_i32x2(scratchpad + spAddr1 + 8);
+			f[2] = load_cvt_i32x2(scratchpad + spAddr1 + 16);
+			f[3] = load_cvt_i32x2(scratchpad + spAddr1 + 24);
+			e[0] = _mm_abs(load_cvt_i32x2(scratchpad + spAddr1 + 32));
+			e[1] = _mm_abs(load_cvt_i32x2(scratchpad + spAddr1 + 40));
+			e[2] = _mm_abs(load_cvt_i32x2(scratchpad + spAddr1 + 48));
+			e[3] = _mm_abs(load_cvt_i32x2(scratchpad + spAddr1 + 56));
 
-			case 1:
-			case 2:
-			case 3:
-				addr = reg.r[inst.regc % RegistersCount].u32 ^ inst.addrc;
-				return scratchpad[addr % ScratchpadL1];
+			executeBytecode<0>(r, f, e, a);
 
-			case 4:
-			case 5:
-			case 6:
-			case 7:
-				return reg.r[inst.regc % RegistersCount];
-		}
-	}
-
-	void InterpretedVirtualMachine::writecf(Instruction& inst, fpu_reg_t& regc) {
-		addr_t addr;
-		switch (inst.locc & 7)
-		{
-			case 4:
-				addr = reg.r[inst.regc % RegistersCount].u32 ^ inst.addrc;
-				scratchpad[addr % ScratchpadL2] = (inst.locc & 8) ? regc.hi : regc.lo;
-				break;
-
-			case 5:
-			case 6:
-			case 7:
-				addr = reg.r[inst.regc % RegistersCount].u32 ^ inst.addrc;
-				scratchpad[addr % ScratchpadL1] = (inst.locc & 8) ? regc.hi : regc.lo;
-
-			default:
-				break;
-		}
-	}
-
-	void InterpretedVirtualMachine::writecflo(Instruction& inst, fpu_reg_t& regc) {
-		addr_t addr;
-		switch (inst.locc & 7)
-		{
-			case 4:
-				addr = reg.r[inst.regc % RegistersCount].u32 ^ inst.addrc;
-				scratchpad[addr % ScratchpadL2] = regc.lo;
-				break;
-
-			case 5:
-			case 6:
-			case 7:
-				addr = reg.r[inst.regc % RegistersCount].u32 ^ inst.addrc;
-				scratchpad[addr % ScratchpadL1] = regc.lo;
-
-			default:
-				break;
-		}
-	}
-
-#define ALU_RETIRE(x) x(a, b, c); \
-	if(trace) std::cout << std::hex << /*a.u64 << " " << b.u64 << " " <<*/ c.u64 << std::endl;
-
-#define FPU_RETIRE(x) x(a, b, c); \
-	writecf(inst, c); \
-	if(trace) { \
-		std::cout << std::hex << ((inst.locc & 8) ? c.hi.u64 : c.lo.u64) << std::endl; \
-	} \
-	if(fpuCheck) { \
-		if(c.hi.f64 != c.hi.f64 || c.lo.f64 != c.lo.f64)  { \
-			std::stringstream ss; \
-			ss << "NaN result of " << #x << "(" << std::hex << a.u64 << ", " << b.hi.u64 << " " << b.lo.u64 << ") = " << c.hi.u64 << " " << c.lo.u64 << std::endl; \
-			throw std::runtime_error(ss.str()); \
-		} else if (std::fpclassify(c.hi.f64) == FP_SUBNORMAL || std::fpclassify(c.lo.f64) == FP_SUBNORMAL) {\
-			std::stringstream ss; \
-			ss << "Denormal result of " << #x << "(" << std::hex << a.u64 << ", " << b.hi.u64 << " " << b.lo.u64 << ") = " << c.hi.u64 << " " << c.lo.u64 << std::endl; \
-			throw std::runtime_error(ss.str()); \
-		} \
-	}
-
-#ifdef STATS
-#define INC_COUNT(x) count_##x++;
-#else
-#define INC_COUNT(x)
-#endif
-
-#define FPU_RETIRE_FPSQRT(x) FPSQRT(a, b, c); \
-	writecf(inst, c); \
-	if(trace) std::cout << std::hex << ((inst.locc & 8) ? c.hi.u64 : c.lo.u64) << std::endl;
-
-#define FPU_RETIRE_FPROUND(x) FPROUND(a, b, c); \
-	writecflo(inst, c); \
-	if(trace) std::cout << std::hex << c.lo.u64 << std::endl;
-
-#define ALU_INST(x) void InterpretedVirtualMachine::h_##x(Instruction& inst) { \
-	INC_COUNT(x) \
-	convertible_t a = loada(inst); \
-	convertible_t b = loadbr1(inst); \
-	convertible_t& c = getcr(inst); \
-	ALU_RETIRE(x) \
-	}
-
-#define ALU_INST_SR(x) void InterpretedVirtualMachine::h_##x(Instruction& inst) { \
-	INC_COUNT(x) \
-	convertible_t a = loada(inst); \
-	convertible_t b = loadbr0(inst); \
-	convertible_t& c = getcr(inst); \
-	ALU_RETIRE(x) \
-	}
-
-#define FPU_INST(x) void InterpretedVirtualMachine::h_##x(Instruction& inst) { \
-	INC_COUNT(x) \
-	convertible_t a = loada(inst); \
-	fpu_reg_t& b = reg.f[inst.regb % RegistersCount]; \
-	fpu_reg_t& c = reg.f[inst.regc % RegistersCount]; \
-	FPU_RETIRE(x) \
-	}
-
-#define FPU_INST_NB(x) void InterpretedVirtualMachine::h_##x(Instruction& inst) { \
-	INC_COUNT(x) \
-	convertible_t a = loada(inst); \
-	fpu_reg_t b; \
-	fpu_reg_t& c = reg.f[inst.regc % RegistersCount]; \
-	FPU_RETIRE_##x(x) \
-	}
-
-	ALU_INST(ADD_64)
-	ALU_INST(ADD_32)
-	ALU_INST(SUB_64)
-	ALU_INST(SUB_32)
-	ALU_INST(MUL_64)
-	ALU_INST(MULH_64)
-	ALU_INST(MUL_32)
-	ALU_INST(IMUL_32)
-	ALU_INST(IMULH_64)
-	ALU_INST(DIV_64)
-	ALU_INST(IDIV_64)
-	ALU_INST(AND_64)
-	ALU_INST(AND_32)
-	ALU_INST(OR_64)
-	ALU_INST(OR_32)
-	ALU_INST(XOR_64)
-	ALU_INST(XOR_32)
-
-	ALU_INST_SR(SHL_64)
-	ALU_INST_SR(SHR_64)
-	ALU_INST_SR(SAR_64)
-	ALU_INST_SR(ROL_64)
-	ALU_INST_SR(ROR_64)
-
-	FPU_INST(FPADD)
-	FPU_INST(FPSUB)
-	FPU_INST(FPMUL)
-	FPU_INST(FPDIV)
-
-	FPU_INST_NB(FPSQRT)
-	FPU_INST_NB(FPROUND)
-
-	void InterpretedVirtualMachine::h_CALL(Instruction& inst) {
-		convertible_t a = loada(inst);
-		if (JMP_COND(inst.locb, reg.r[inst.regb % RegistersCount], inst.imm32)) {
-#ifdef STATS
-			count_CALL_taken++;
-			count_jump_taken[inst.locb & 7]++;
-			count_retdepth = std::max(0, count_retdepth - 1);
-#endif
-			stackPush(a);
-			stackPush(pc);
-#ifdef STATS
-			count_max_stack = std::max(count_max_stack, (int)stack.size());
-#endif
-			pc += (inst.imm8 & 127) + 1;
-			pc = pc % ProgramLength;
-			if (trace) std::cout << std::hex << a.u64 << std::endl;
-		}
-		else {
-			convertible_t& c = getcr(inst);
-#ifdef STATS
-			count_CALL_not_taken++;
-			count_jump_not_taken[inst.locb & 7]++;
-#endif
-			c.u64 = a.u64;
-			if (trace) std::cout << std::hex << /*a.u64 << " " <<*/ c.u64 << std::endl;
-		}
-	}
-
-	void InterpretedVirtualMachine::h_RET(Instruction& inst) {
-		convertible_t a = loada(inst);
-		convertible_t b = loadbr1(inst);
-		convertible_t& c = getcr(inst);
-		if (stack.size() > 0) {
-#ifdef STATS
-			count_RET_taken++;
-			count_retdepth++;
-			count_retdepth_max = std::max(count_retdepth_max, count_retdepth);
-#endif
-			auto raddr = stackPopAddress();
-			auto retval = stackPopValue();
-			c.u64 = a.u64 ^ retval.u64;
-			pc = raddr;
-		}
-		else {
-#ifdef STATS
-			if (stack.size() == 0)
-				count_RET_stack_empty++;
-			else {
-				count_RET_not_taken++;
-				count_jump_not_taken[inst.locb & 7]++;
+			if (asyncWorker) {
+				ILightClientAsyncWorker* aw = mem.ds.asyncWorker;
+				const uint64_t* datasetLine = aw->getBlock(mem.ma);
+				for (int i = 0; i < RegistersCount; ++i)
+					r[i] ^= datasetLine[i];
+				mem.mx ^= r[readReg2] ^ r[readReg3];
+				mem.mx &= CacheLineAlignMask; //align to cache line
+				std::swap(mem.mx, mem.ma);
+				aw->prepareBlock(mem.ma);
 			}
-#endif
-			c.u64 = a.u64;
+			else {
+				mem.mx ^= r[readReg2] ^ r[readReg3];
+				mem.mx &= CacheLineAlignMask;
+				Cache* cache = mem.ds.cache;
+				uint64_t datasetLine[CacheLineSize / sizeof(uint64_t)];
+				initBlock(cache->getCache(), (uint8_t*)datasetLine, mem.ma / CacheLineSize, cache->getKeys());
+				for (int i = 0; i < RegistersCount; ++i)
+					r[i] ^= datasetLine[i];
+				std::swap(mem.mx, mem.ma);
+			}
+
+			store64(scratchpad + spAddr1 + 0, r[0]);
+			store64(scratchpad + spAddr1 + 8, r[1]);
+			store64(scratchpad + spAddr1 + 16, r[2]);
+			store64(scratchpad + spAddr1 + 24, r[3]);
+			store64(scratchpad + spAddr1 + 32, r[4]);
+			store64(scratchpad + spAddr1 + 40, r[5]);
+			store64(scratchpad + spAddr1 + 48, r[6]);
+			store64(scratchpad + spAddr1 + 56, r[7]);
+
+			_mm_store_pd((double*)(scratchpad + spAddr0 + 0), _mm_mul_pd(f[0], e[0]));
+			_mm_store_pd((double*)(scratchpad + spAddr0 + 16), _mm_mul_pd(f[1], e[1]));
+			_mm_store_pd((double*)(scratchpad + spAddr0 + 32), _mm_mul_pd(f[2], e[2]));
+			_mm_store_pd((double*)(scratchpad + spAddr0 + 48), _mm_mul_pd(f[3], e[3]));
+
+			spAddr0 = 0;
+			spAddr1 = 0;
 		}
-		if (trace) std::cout << std::hex << /*a.u64 << " " <<*/ c.u64 << std::endl;
+
+		store64(&reg.r[0], r[0]);
+		store64(&reg.r[1], r[1]);
+		store64(&reg.r[2], r[2]);
+		store64(&reg.r[3], r[3]);
+		store64(&reg.r[4], r[4]);
+		store64(&reg.r[5], r[5]);
+		store64(&reg.r[6], r[6]);
+		store64(&reg.r[7], r[7]);
+
+		_mm_store_pd(&reg.f[0].lo, f[0]);
+		_mm_store_pd(&reg.f[1].lo, f[1]);
+		_mm_store_pd(&reg.f[2].lo, f[2]);
+		_mm_store_pd(&reg.f[3].lo, f[3]);
+		_mm_store_pd(&reg.e[0].lo, e[0]);
+		_mm_store_pd(&reg.e[1].lo, e[1]);
+		_mm_store_pd(&reg.e[2].lo, e[2]);
+		_mm_store_pd(&reg.e[3].lo, e[3]);
 	}
 
 #include "instructionWeights.hpp"
-#define INST_HANDLE(x) REPN(&InterpretedVirtualMachine::h_##x, WT(x))
 
-	InstructionHandler InterpretedVirtualMachine::engine[256] = {
-		INST_HANDLE(ADD_64)
-		INST_HANDLE(ADD_32)
-		INST_HANDLE(SUB_64)
-		INST_HANDLE(SUB_32)
-		INST_HANDLE(MUL_64)
-		INST_HANDLE(MULH_64)
-		INST_HANDLE(MUL_32)
-		INST_HANDLE(IMUL_32)
-		INST_HANDLE(IMULH_64)
-		INST_HANDLE(DIV_64)
-		INST_HANDLE(IDIV_64)
-		INST_HANDLE(AND_64)
-		INST_HANDLE(AND_32)
-		INST_HANDLE(OR_64)
-		INST_HANDLE(OR_32)
-		INST_HANDLE(XOR_64)
-		INST_HANDLE(XOR_32)
-		INST_HANDLE(SHL_64)
-		INST_HANDLE(SHR_64)
-		INST_HANDLE(SAR_64)
-		INST_HANDLE(ROL_64)
-		INST_HANDLE(ROR_64)
-		INST_HANDLE(FPADD)
-		INST_HANDLE(FPSUB)
-		INST_HANDLE(FPMUL)
-		INST_HANDLE(FPDIV)
-		INST_HANDLE(FPSQRT)
-		INST_HANDLE(FPROUND)
-		INST_HANDLE(CALL)
-		INST_HANDLE(RET)
-	};
+	void InterpretedVirtualMachine::precompileProgram(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]) {
+		for (unsigned i = 0; i < ProgramLength; ++i) {
+			auto& instr = program(i);
+			auto& ibc = byteCode[i];
+			switch (instr.opcode) {
+				CASE_REP(IADD_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IADD_R;
+					ibc.idst = &r[dst];
+					if (src != dst) {
+						ibc.isrc = &r[src];
+					}
+					else {
+						ibc.imm = signExtend2sCompl(instr.imm32);
+						ibc.isrc = &ibc.imm;
+					}
+				} break;
+
+				CASE_REP(IADD_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IADD_M;
+					ibc.idst = &r[dst];
+					if (instr.src != instr.dst) {
+						ibc.isrc = &r[src];
+						ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+						ibc.memMask = ScratchpadL3Mask;
+					}
+				} break;
+
+				CASE_REP(IADD_RC) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IADD_RC;
+					ibc.idst = &r[dst];
+					ibc.isrc = &r[src];
+					ibc.imm = signExtend2sCompl(instr.imm32);
+				} break;
+
+				CASE_REP(ISUB_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::ISUB_R;
+					ibc.idst = &r[dst];
+					if (src != dst) {
+						ibc.isrc = &r[src];
+					}
+					else {
+						ibc.imm = signExtend2sCompl(instr.imm32);
+						ibc.isrc = &ibc.imm;
+					}
+				} break;
+
+				CASE_REP(ISUB_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::ISUB_M;
+					ibc.idst = &r[dst];
+					if (instr.src != instr.dst) {
+						ibc.isrc = &r[src];
+						ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+						ibc.memMask = ScratchpadL3Mask;
+					}
+				} break;
+
+				CASE_REP(IMUL_9C) {
+					auto dst = instr.dst % RegistersCount;
+					ibc.type = InstructionType::IMUL_9C;
+					ibc.idst = &r[dst];
+					ibc.imm = signExtend2sCompl(instr.imm32);
+				} break;
+
+				CASE_REP(IMUL_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IMUL_R;
+					ibc.idst = &r[dst];
+					if (src != dst) {
+						ibc.isrc = &r[src];
+					}
+					else {
+						ibc.imm = signExtend2sCompl(instr.imm32);
+						ibc.isrc = &ibc.imm;
+					}
+				} break;
+
+				CASE_REP(IMUL_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IMUL_M;
+					ibc.idst = &r[dst];
+					if (instr.src != instr.dst) {
+						ibc.isrc = &r[src];
+						ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+						ibc.memMask = ScratchpadL3Mask;
+					}
+				} break;
+
+				CASE_REP(IMULH_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IMULH_R;
+					ibc.idst = &r[dst];
+					ibc.isrc = &r[src];
+				} break;
+
+				CASE_REP(IMULH_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IMULH_M;
+					ibc.idst = &r[dst];
+					if (instr.src != instr.dst) {
+						ibc.isrc = &r[src];
+						ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+						ibc.memMask = ScratchpadL3Mask;
+					}
+				} break;
+
+				CASE_REP(ISMULH_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::ISMULH_R;
+					ibc.idst = &r[dst];
+					ibc.isrc = &r[src];
+				} break;
+
+				CASE_REP(ISMULH_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::ISMULH_M;
+					ibc.idst = &r[dst];
+					if (instr.src != instr.dst) {
+						ibc.isrc = &r[src];
+						ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+						ibc.memMask = ScratchpadL3Mask;
+					}
+				} break;
+
+				CASE_REP(IDIV_C) {
+					uint32_t divisor = instr.imm32;
+					if (divisor != 0) {
+						auto dst = instr.dst % RegistersCount;
+						ibc.type = InstructionType::IDIV_C;
+						ibc.idst = &r[dst];
+						if (divisor & (divisor - 1)) {
+							magicu_info mi = compute_unsigned_magic_info(divisor, sizeof(uint64_t) * 8);
+							ibc.signedMultiplier = mi.multiplier;
+							ibc.preShift = mi.pre_shift;
+							ibc.postShift = mi.post_shift;
+							ibc.increment = mi.increment;
+						}
+						else {
+							ibc.signedMultiplier = 0;
+							int shift = 0;
+							while (divisor >>= 1)
+								++shift;
+							ibc.shift = shift;
+						}
+					}
+					else {
+						ibc.type = InstructionType::NOP;
+					}
+				} break;
+
+				CASE_REP(ISDIV_C) {
+					ibc.type = InstructionType::NOP;
+				} break;
+
+				CASE_REP(INEG_R) {
+					auto dst = instr.dst % RegistersCount;
+					ibc.type = InstructionType::INEG_R;
+					ibc.idst = &r[dst];
+				} break;
+
+				CASE_REP(IXOR_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IXOR_R;
+					ibc.idst = &r[dst];
+					if (src != dst) {
+						ibc.isrc = &r[src];
+					}
+					else {
+						ibc.imm = signExtend2sCompl(instr.imm32);
+						ibc.isrc = &ibc.imm;
+					}
+				} break;
+
+				CASE_REP(IXOR_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IXOR_M;
+					ibc.idst = &r[dst];
+					if (instr.src != instr.dst) {
+						ibc.isrc = &r[src];
+						ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+						ibc.memMask = ScratchpadL3Mask;
+					}
+				} break;
+
+				CASE_REP(IROR_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IROR_R;
+					ibc.idst = &r[dst];
+					if (src != dst) {
+						ibc.isrc = &r[src];
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+					}
+				} break;
+
+				CASE_REP(IROL_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::IROL_R;
+					ibc.idst = &r[dst];
+					if (src != dst) {
+						ibc.isrc = &r[src];
+					}
+					else {
+						ibc.imm = instr.imm32;
+						ibc.isrc = &ibc.imm;
+					}
+				} break;
+
+				CASE_REP(ISWAP_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					if (src != dst) {
+						ibc.idst = &r[dst];
+						ibc.isrc = &r[src];
+						ibc.type = InstructionType::ISWAP_R;
+					}
+					else {
+						ibc.type = InstructionType::NOP;
+					}
+				} break;
+
+				CASE_REP(FSWAP_R) {
+					auto dst = instr.dst % RegistersCount;
+					ibc.type = InstructionType::FSWAP_R;
+					ibc.fdst = &f[dst];
+				} break;
+
+				CASE_REP(FADD_R) {
+					auto dst = instr.dst % 4;
+					auto src = instr.src % 4;
+					ibc.type = InstructionType::FADD_R;
+					ibc.fdst = &f[dst];
+					ibc.fsrc = &a[src];
+				} break;
+
+				CASE_REP(FADD_M) {
+					auto dst = instr.dst % 4;
+					auto src = instr.src % 8;
+					ibc.type = InstructionType::FADD_M;
+					ibc.fdst = &f[dst];
+					ibc.isrc = &r[src];
+					ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+				} break;
+
+				CASE_REP(FSUB_R) {
+					auto dst = instr.dst % 4;
+					auto src = instr.src % 4;
+					ibc.type = InstructionType::FSUB_R;
+					ibc.fdst = &f[dst];
+					ibc.fsrc = &a[src];
+				} break;
+
+				CASE_REP(FSUB_M) {
+					auto dst = instr.dst % 4;
+					auto src = instr.src % 8;
+					ibc.type = InstructionType::FSUB_M;
+					ibc.fdst = &f[dst];
+					ibc.isrc = &r[src];
+					ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+				} break;
+
+				CASE_REP(FNEG_R) {
+					auto dst = instr.dst % 4;
+					ibc.fdst = &f[dst];
+					ibc.type = InstructionType::FNEG_R;
+				} break;
+
+				CASE_REP(FMUL_R) {
+					auto dst = instr.dst % 4;
+					auto src = instr.src % 4;
+					ibc.type = InstructionType::FMUL_R;
+					ibc.fdst = &e[dst];
+					ibc.fsrc = &a[src];
+				} break;
+
+				CASE_REP(FMUL_M) {
+				} break;
+
+				CASE_REP(FDIV_R) {
+				} break;
+
+				CASE_REP(FDIV_M) {
+					auto dst = instr.dst % 4;
+					auto src = instr.src % 8;
+					ibc.type = InstructionType::FDIV_M;
+					ibc.fdst = &e[dst];
+					ibc.isrc = &r[src];
+					ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+				} break;
+
+				CASE_REP(FSQRT_R) {
+					auto dst = instr.dst % 4;
+					ibc.type = InstructionType::FSQRT_R;
+					ibc.fdst = &e[dst];
+				} break;
+
+				CASE_REP(COND_R) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::COND_R;
+					ibc.idst = &r[dst];
+					ibc.isrc = &r[src];
+					ibc.condition = (instr.mod >> 2) & 7;
+					ibc.imm = instr.imm32;
+				} break;
+
+				CASE_REP(COND_M) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::COND_M;
+					ibc.idst = &r[dst];
+					ibc.isrc = &r[src];
+					ibc.condition = (instr.mod >> 2) & 7;
+					ibc.imm = instr.imm32;
+					ibc.memMask = ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
+				} break;
+
+				CASE_REP(CFROUND) {
+					auto src = instr.src % 8;
+					ibc.isrc = &r[src];
+					ibc.type = InstructionType::CFROUND;
+					ibc.imm = instr.imm32 & 63;
+				} break;
+
+				CASE_REP(ISTORE) {
+					auto dst = instr.dst % RegistersCount;
+					auto src = instr.src % RegistersCount;
+					ibc.type = InstructionType::ISTORE;
+					ibc.idst = &r[dst];
+					ibc.isrc = &r[src];
+				} break;
+
+				CASE_REP(FSTORE) {
+				} break;
+
+				CASE_REP(NOP) {
+					ibc.type = InstructionType::NOP;
+				} break;
+
+				default:
+					UNREACHABLE;
+			}
+		}
+	}
 }
\ No newline at end of file
diff --git a/src/InterpretedVirtualMachine.hpp b/src/InterpretedVirtualMachine.hpp
index b8fd98f..4db4ae4 100644
--- a/src/InterpretedVirtualMachine.hpp
+++ b/src/InterpretedVirtualMachine.hpp
@@ -21,27 +21,57 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 //#define STATS
 #include "VirtualMachine.hpp"
 #include "Program.hpp"
-#include <vector>
+#include "intrinPortable.h"
 
 namespace RandomX {
 
+	class ITransform {
+	public:
+		virtual int32_t apply(int32_t) const = 0;
+		virtual const char* getName() const = 0;
+		virtual std::ostream& printAsm(std::ostream&) const = 0;
+		virtual std::ostream& printCxx(std::ostream&) const = 0;
+	};
+
+	struct InstructionByteCode;
 	class InterpretedVirtualMachine;
 
 	typedef void(InterpretedVirtualMachine::*InstructionHandler)(Instruction&);
 
+	struct alignas(16) InstructionByteCode {
+		int_reg_t* idst;
+		int_reg_t* isrc;
+		int_reg_t imm;
+		__m128d* fdst;
+		__m128d* fsrc;
+		uint32_t condition;
+		uint32_t memMask;
+		uint32_t type;
+		union {
+			uint64_t unsignedMultiplier;
+			int64_t signedMultiplier;
+		};
+		unsigned shift;
+		unsigned preShift;
+		unsigned postShift;
+		bool increment;
+	};
+
+	constexpr int asedwfagdewsa = sizeof(InstructionByteCode);
+
 	class InterpretedVirtualMachine : public VirtualMachine {
 	public:
-		InterpretedVirtualMachine(bool softAes) : VirtualMachine(softAes) {}
-		virtual void initializeProgram(const void* seed) override;
-		virtual void execute() override;
-		const Program& getProgam() {
-			return p;
-		}
+		InterpretedVirtualMachine(bool soft, bool async) : softAes(soft), asyncWorker(async) {}
+		~InterpretedVirtualMachine();
+		void setDataset(dataset_t ds) override;
+		void initialize() override;
+		void execute() override;
 	private:
 		static InstructionHandler engine[256];
-		Program p;
-		std::vector<convertible_t> stack;
-		uint64_t pc, ic;
+		DatasetReadFunc readDataset;
+		bool softAes, asyncWorker;
+		InstructionByteCode byteCode[ProgramLength];
+		
 #ifdef STATS
 		int count_ADD_64 = 0;
 		int count_ADD_32 = 0;
@@ -65,17 +95,18 @@ namespace RandomX {
 		int count_SAR_64 = 0;
 		int count_ROL_64 = 0;
 		int count_ROR_64 = 0;
-		int count_FPADD = 0;
-		int count_FPSUB = 0;
-		int count_FPMUL = 0;
-		int count_FPDIV = 0;
-		int count_FPSQRT = 0;
+		int count_FADD = 0;
+		int count_FSUB = 0;
+		int count_FMUL = 0;
+		int count_FDIV = 0;
+		int count_FSQRT = 0;
 		int count_FPROUND = 0;
+		int count_JUMP_taken = 0;
+		int count_JUMP_not_taken = 0;
 		int count_CALL_taken = 0;
 		int count_CALL_not_taken = 0;
 		int count_RET_stack_empty = 0;
 		int count_RET_taken = 0;
-		int count_RET_not_taken = 0;
 		int count_jump_taken[8] = { 0 };
 		int count_jump_not_taken[8] = { 0 };
 		int count_max_stack = 0;
@@ -83,66 +114,17 @@ namespace RandomX {
 		int count_retdepth_max = 0;
 		int count_endstack = 0;
 		int count_instructions[ProgramLength] = { 0 };
+		int count_FADD_nop = 0;
+		int count_FADD_nop2 = 0;
+		int count_FSUB_nop = 0;
+		int count_FSUB_nop2 = 0;
+		int count_FMUL_nop = 0;
+		int count_FMUL_nop2 = 0;
+		int datasetAccess[256] = { 0 };
 #endif
-
-		convertible_t loada(Instruction&);
-		convertible_t loadbr0(Instruction&);
-		convertible_t loadbr1(Instruction&);
-		convertible_t& getcr(Instruction&);
-		void writecf(Instruction&, fpu_reg_t&);
-		void writecflo(Instruction&, fpu_reg_t&);
-
-		void stackPush(convertible_t& c) {
-			stack.push_back(c);
-		}
-
-		void stackPush(uint64_t x) {
-			convertible_t c;
-			c.u64 = x;
-			stack.push_back(c);
-		}
-
-		convertible_t stackPopValue() {
-			convertible_t top = stack.back();
-			stack.pop_back();
-			return top;
-		}
-
-		uint64_t stackPopAddress() {
-			convertible_t top = stack.back();
-			stack.pop_back();
-			return top.u64;
-		}
-
-		void h_ADD_64(Instruction&);
-		void h_ADD_32(Instruction&);
-		void h_SUB_64(Instruction&);
-		void h_SUB_32(Instruction&);
-		void h_MUL_64(Instruction&);
-		void h_MULH_64(Instruction&);
-		void h_MUL_32(Instruction&);
-		void h_IMUL_32(Instruction&);
-		void h_IMULH_64(Instruction&);
-		void h_DIV_64(Instruction&);
-		void h_IDIV_64(Instruction&);
-		void h_AND_64(Instruction&);
-		void h_AND_32(Instruction&);
-		void h_OR_64(Instruction&);
-		void h_OR_32(Instruction&);
-		void h_XOR_64(Instruction&);
-		void h_XOR_32(Instruction&);
-		void h_SHL_64(Instruction&);
-		void h_SHR_64(Instruction&);
-		void h_SAR_64(Instruction&);
-		void h_ROL_64(Instruction&);
-		void h_ROR_64(Instruction&);
-		void h_FPADD(Instruction&);
-		void h_FPSUB(Instruction&);
-		void h_FPMUL(Instruction&);
-		void h_FPDIV(Instruction&);
-		void h_FPSQRT(Instruction&);
-		void h_FPROUND(Instruction&);
-		void h_CALL(Instruction&);
-		void h_RET(Instruction&);
+		void precompileProgram(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]);
+		template<int N>
+		void executeBytecode(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]);
+		void executeBytecode(int i, int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]);
 	};
 }
\ No newline at end of file
diff --git a/src/JitCompilerX86-static.S b/src/JitCompilerX86-static.S
index be156ef..9bf06ba 100644
--- a/src/JitCompilerX86-static.S
+++ b/src/JitCompilerX86-static.S
@@ -27,32 +27,47 @@
 #define DECL(x) x
 #endif
 .global DECL(randomx_program_prologue)
-.global DECL(randomx_program_begin)
+.global DECL(randomx_program_loop_begin)
+.global DECL(randomx_program_loop_load)
+.global DECL(randomx_program_start)
+.global DECL(randomx_program_read_dataset)
+.global DECL(randomx_program_loop_store)
+.global DECL(randomx_program_loop_end)
 .global DECL(randomx_program_epilogue)
-.global DECL(randomx_program_read_r)
-.global DECL(randomx_program_read_f)
 .global DECL(randomx_program_end)
 
+#define db .byte
+
 .align 64
 DECL(randomx_program_prologue):
 	#include "asm/program_prologue_linux.inc"
 
 .align 64
-DECL(randomx_program_begin):
+	#include "asm/program_xmm_constants.inc"
+
+.align 64
+DECL(randomx_program_loop_begin):
+	nop
+
+DECL(randomx_program_loop_load):
+	#include "asm/program_loop_load.inc"
+
+DECL(randomx_program_start):
+	nop
+
+DECL(randomx_program_read_dataset):
+	#include "asm/program_read_dataset.inc"
+
+DECL(randomx_program_loop_store):
+	#include "asm/program_loop_store.inc"
+
+DECL(randomx_program_loop_end):
 	nop
 
 .align 64
 DECL(randomx_program_epilogue):
 	#include "asm/program_epilogue_linux.inc"
 
-.align 64
-DECL(randomx_program_read_r):
-	#include "asm/program_read_r.inc"
-
-.align 64
-DECL(randomx_program_read_f):
-	#include "asm/program_read_f.inc"
-
 .align 64
 DECL(randomx_program_end):
-	nop
\ No newline at end of file
+	nop
diff --git a/src/JitCompilerX86-static.asm b/src/JitCompilerX86-static.asm
index d7d3d4b..5b2d387 100644
--- a/src/JitCompilerX86-static.asm
+++ b/src/JitCompilerX86-static.asm
@@ -15,13 +15,18 @@
 ;# You should have received a copy of the GNU General Public License
 ;# along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
+IFDEF RAX
+
 _RANDOMX_JITX86_STATIC SEGMENT PAGE READ EXECUTE
 
 PUBLIC randomx_program_prologue
-PUBLIC randomx_program_begin
+PUBLIC randomx_program_loop_begin
+PUBLIC randomx_program_loop_load
+PUBLIC randomx_program_start
+PUBLIC randomx_program_read_dataset
+PUBLIC randomx_program_loop_store
+PUBLIC randomx_program_loop_end
 PUBLIC randomx_program_epilogue
-PUBLIC randomx_program_read_r
-PUBLIC randomx_program_read_f
 PUBLIC randomx_program_end
 
 ALIGN 64
@@ -30,25 +35,38 @@ randomx_program_prologue PROC
 randomx_program_prologue ENDP
 
 ALIGN 64
-randomx_program_begin PROC
+	include asm/program_xmm_constants.inc
+
+ALIGN 64
+randomx_program_loop_begin PROC
 	nop
-randomx_program_begin ENDP
+randomx_program_loop_begin ENDP
+
+randomx_program_loop_load PROC
+	include asm/program_loop_load.inc
+randomx_program_loop_load ENDP
+
+randomx_program_start PROC
+	nop
+randomx_program_start ENDP
+
+randomx_program_read_dataset PROC
+	include asm/program_read_dataset.inc
+randomx_program_read_dataset ENDP
+
+randomx_program_loop_store PROC
+	include asm/program_loop_store.inc
+randomx_program_loop_store ENDP
+
+randomx_program_loop_end PROC
+	nop
+randomx_program_loop_end ENDP
 
 ALIGN 64
 randomx_program_epilogue PROC
 	include asm/program_epilogue_win64.inc
 randomx_program_epilogue ENDP
 
-ALIGN 64
-randomx_program_read_r PROC
-	include asm/program_read_r.inc
-randomx_program_read_r ENDP
-
-ALIGN 64
-randomx_program_read_f PROC
-	include asm/program_read_f.inc
-randomx_program_read_f ENDP
-
 ALIGN 64
 randomx_program_end PROC
 	nop
@@ -56,4 +74,6 @@ randomx_program_end ENDP
 
 _RANDOMX_JITX86_STATIC ENDS
 
+ENDIF
+
 END
\ No newline at end of file
diff --git a/src/JitCompilerX86-static.hpp b/src/JitCompilerX86-static.hpp
index 6052283..64abfa3 100644
--- a/src/JitCompilerX86-static.hpp
+++ b/src/JitCompilerX86-static.hpp
@@ -18,10 +18,13 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 
 extern "C" {
-  void randomx_program_prologue();
-  void randomx_program_begin();
-  void randomx_program_epilogue();
-  void randomx_program_read_r();
-  void randomx_program_read_f();
-  void randomx_program_end();
+	void randomx_program_prologue();
+	void randomx_program_loop_begin();
+	void randomx_program_loop_load();
+	void randomx_program_start();
+	void randomx_program_read_dataset();
+	void randomx_program_loop_store();
+	void randomx_program_loop_end();
+	void randomx_program_epilogue();
+	void randomx_program_end();
 }
\ No newline at end of file
diff --git a/src/JitCompilerX86.cpp b/src/JitCompilerX86.cpp
index b03a330..0c2fac0 100644
--- a/src/JitCompilerX86.cpp
+++ b/src/JitCompilerX86.cpp
@@ -17,10 +17,14 @@ You should have received a copy of the GNU General Public License
 along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 
+#define MAGIC_DIVISION
 #include "JitCompilerX86.hpp"
-#include "Pcg32.hpp"
+#include "Program.hpp"
 #include <cstring>
 #include <stdexcept>
+#ifdef MAGIC_DIVISION
+#include "divideByConstantCodegen.h"
+#endif
 
 #ifdef _WIN32
 #include <windows.h>
@@ -36,79 +40,150 @@ namespace RandomX {
 
 #if !defined(_M_X64) && !defined(__x86_64__)
 	JitCompilerX86::JitCompilerX86() {
-		throw std::runtime_error("JIT compiler only supports x86-64 CPUs");
+		//throw std::runtime_error("JIT compiler only supports x86-64 CPUs");
 	}
 
-	void JitCompilerX86::generateProgram(Pcg32& gen) {
+	void JitCompilerX86::generateProgram(Program& p) {
 
 	}
+
+	size_t JitCompilerX86::getCodeSize() {
+		return 0;
+	}
 #else
 
 	/*
-	 REGISTER ALLOCATION:
 
-	 rax -> temporary
-	 rbx -> MemoryRegisters& memory
-	 rcx -> temporary
-	 rdx -> temporary
-	 rsi -> convertible_t* scratchpad
-	 rdi -> "ic" (instruction counter)
-	 rbp -> beginning of VM stack
-	 rsp -> end of VM stack
-	 r8  -> "r0"
-	 r9  -> "r1"
-	 r10 -> "r2"
-	 r11 -> "r3"
-	 r12 -> "r4"
-	 r13 -> "r5"
-	 r14 -> "r6"
-	 r15 -> "r7"
-	 xmm0 -> temporary
-	 xmm1 -> temporary
-	 xmm2 -> "f2"
-	 xmm3 -> "f3"
-	 xmm4 -> "f4"
-	 xmm5 -> "f5"
-	 xmm6 -> "f6"
-	 xmm7 -> "f7"
-	 xmm8 -> "f0"
-	 xmm9 -> "f1"
-	 xmm10 -> absolute value mask 0x7fffffffffffffff7fffffffffffffff
+	REGISTER ALLOCATION:
 
-	 STACK STRUCTURE:
-
-	   |
-	   |
-	   | saved registers
-	   |
-	   v
-	 [rbp] RegisterFile& registerFile
-	   |
-	   |
-	   | VM stack
-	   |
-	   v
-	 [rsp] last element of VM stack
+	; rax -> temporary
+	; rbx -> loop counter "lc"
+	; rcx -> temporary
+	; rdx -> temporary
+	; rsi -> scratchpad pointer
+	; rdi -> dataset pointer
+	; rbp -> memory registers "ma" (high 32 bits), "mx" (low 32 bits)
+	; rsp -> stack pointer
+	; r8  -> "r0"
+	; r9  -> "r1"
+	; r10 -> "r2"
+	; r11 -> "r3"
+	; r12 -> "r4"
+	; r13 -> "r5"
+	; r14 -> "r6"
+	; r15 -> "r7"
+	; xmm0 -> "f0"
+	; xmm1 -> "f1"
+	; xmm2 -> "f2"
+	; xmm3 -> "f3"
+	; xmm4 -> "e0"
+	; xmm5 -> "e1"
+	; xmm6 -> "e2"
+	; xmm7 -> "e3"
+	; xmm8 -> "a0"
+	; xmm9 -> "a1"
+	; xmm10 -> "a2"
+	; xmm11 -> "a3"
+	; xmm12 -> temporary
+	; xmm13 -> DBL_MIN
+	; xmm14 -> absolute value mask 0x7fffffffffffffff7fffffffffffffff
+	; xmm15 -> sign mask           0x80000000000000008000000000000000
 
 	*/
 
 #include "JitCompilerX86-static.hpp"
 
 	const uint8_t* codePrologue = (uint8_t*)&randomx_program_prologue;
-	const uint8_t* codeProgramBegin = (uint8_t*)&randomx_program_begin;
+	const uint8_t* codeLoopBegin = (uint8_t*)&randomx_program_loop_begin;
+	const uint8_t* codeLoopLoad = (uint8_t*)&randomx_program_loop_load;
+	const uint8_t* codeProgamStart = (uint8_t*)&randomx_program_start;
+	const uint8_t* codeReadDataset = (uint8_t*)&randomx_program_read_dataset;
+	const uint8_t* codeLoopStore = (uint8_t*)&randomx_program_loop_store;
+	const uint8_t* codeLoopEnd = (uint8_t*)&randomx_program_loop_end;
 	const uint8_t* codeEpilogue = (uint8_t*)&randomx_program_epilogue;
-	const uint8_t* codeReadDatasetR = (uint8_t*)&randomx_program_read_r;
-	const uint8_t* codeReadDatasetF = (uint8_t*)&randomx_program_read_f;
 	const uint8_t* codeProgramEnd = (uint8_t*)&randomx_program_end;
 
-	const int32_t prologueSize = codeProgramBegin - codePrologue;
-	const int32_t epilogueSize = codeReadDatasetR - codeEpilogue;
-	const int32_t readDatasetRSize = codeReadDatasetF - codeReadDatasetR;
-	const int32_t readDatasetFSize = codeProgramEnd - codeReadDatasetF;
+	const int32_t prologueSize = codeLoopBegin - codePrologue;
+	const int32_t epilogueSize = codeProgramEnd - codeEpilogue;
 
-	const int32_t readDatasetFOffset = CodeSize - readDatasetFSize;
-	const int32_t readDatasetROffset = readDatasetFOffset - readDatasetRSize;
-	const int32_t epilogueOffset = readDatasetROffset - epilogueSize;
+	const int32_t loopLoadSize = codeProgamStart - codeLoopLoad;
+	const int32_t readDatasetSize = codeLoopStore - codeReadDataset;
+	const int32_t loopStoreSize = codeLoopEnd - codeLoopStore;
+
+	const int32_t epilogueOffset = CodeSize - epilogueSize;
+
+	static const uint8_t REX_ADD_RR[] = { 0x4d, 0x03 };
+	static const uint8_t REX_ADD_RM[] = { 0x4c, 0x03 };
+	static const uint8_t REX_SUB_RR[] = { 0x4d, 0x2b };
+	static const uint8_t REX_SUB_RM[] = { 0x4c, 0x2b };
+	static const uint8_t REX_MOV_RR[] = { 0x41, 0x8b };
+	static const uint8_t REX_MOV_RR64[] = { 0x49, 0x8b };
+	static const uint8_t REX_MOV_R64R[] = { 0x4c, 0x8b };
+	static const uint8_t REX_IMUL_RR[] = { 0x4d, 0x0f, 0xaf };
+	static const uint8_t REX_IMUL_RRI[] = { 0x4d, 0x69 };
+	static const uint8_t REX_IMUL_RM[] = { 0x4c, 0x0f, 0xaf };
+	static const uint8_t REX_MUL_R[] = { 0x49, 0xf7 };
+	static const uint8_t REX_MUL_M[] = { 0x48, 0xf7 };
+	static const uint8_t REX_81[] = { 0x49, 0x81 };
+	static const uint8_t AND_EAX_I = 0x25;
+	static const uint8_t MOV_EAX_I = 0xb8;
+	static const uint8_t MOV_RAX_I[] = { 0x48, 0xb8 };
+	static const uint8_t MOV_RCX_I[] = { 0x48, 0xb9 };
+	static const uint8_t REX_LEA[] = { 0x4f, 0x8d };
+	static const uint8_t REX_MUL_MEM[] = { 0x48, 0xf7, 0x24, 0x0e };
+	static const uint8_t REX_IMUL_MEM[] = { 0x48, 0xf7, 0x2c, 0x0e };
+	static const uint8_t REX_SHR_RAX[] = { 0x48, 0xc1, 0xe8 };
+	static const uint8_t RAX_ADD_SBB_1[] = { 0x48, 0x83, 0xC0, 0x01, 0x48, 0x83, 0xD8, 0x00 };
+	static const uint8_t MUL_RCX[] = { 0x48, 0xf7, 0xe1 };
+	static const uint8_t REX_SHR_RDX[] = { 0x48, 0xc1, 0xea };
+	static const uint8_t REX_SH[] = { 0x49, 0xc1 };
+	static const uint8_t MOV_RCX_RAX_SAR_RCX_63[] = { 0x48, 0x89, 0xc1, 0x48, 0xc1, 0xf9, 0x3f };
+	static const uint8_t AND_ECX_I[] = { 0x81, 0xe1 };
+	static const uint8_t ADD_RAX_RCX[] = { 0x48, 0x01, 0xC8 };
+	static const uint8_t SAR_RAX_I8[] = { 0x48, 0xC1, 0xF8 };
+	static const uint8_t NEG_RAX[] = { 0x48, 0xF7, 0xD8 };
+	static const uint8_t ADD_R_RAX[] = { 0x49, 0x01 };
+	static const uint8_t XOR_EAX_EAX[] = { 0x31, 0xC0 };
+	static const uint8_t ADD_RDX_R[] = { 0x4c, 0x01 };
+	static const uint8_t SUB_RDX_R[] = { 0x4c, 0x29 };
+	static const uint8_t SAR_RDX_I8[] = { 0x48, 0xC1, 0xFA };
+	static const uint8_t TEST_RDX_RDX[] = { 0x48, 0x85, 0xD2 };
+	static const uint8_t SETS_AL_ADD_RDX_RAX[] = { 0x0F, 0x98, 0xC0, 0x48, 0x01, 0xC2 };
+	static const uint8_t REX_NEG[] = { 0x49, 0xF7 };
+	static const uint8_t REX_XOR_RR[] = { 0x4D, 0x33 };
+	static const uint8_t REX_XOR_RI[] = { 0x49, 0x81 };
+	static const uint8_t REX_XOR_RM[] = { 0x4c, 0x33 };
+	static const uint8_t REX_ROT_CL[] = { 0x49, 0xd3 };
+	static const uint8_t REX_ROT_I8[] = { 0x49, 0xc1 };
+	static const uint8_t SHUFPD[] = { 0x66, 0x0f, 0xc6 };
+	static const uint8_t REX_ADDPD[] = { 0x66, 0x41, 0x0f, 0x58 };
+	static const uint8_t REX_CVTDQ2PD_XMM12[] = { 0xf3, 0x44, 0x0f, 0xe6, 0x24, 0x06 };
+	static const uint8_t REX_SUBPD[] = { 0x66, 0x41, 0x0f, 0x5c };
+	static const uint8_t REX_XORPS[] = { 0x41, 0x0f, 0x57 };
+	static const uint8_t REX_MULPD[] = { 0x66, 0x41, 0x0f, 0x59 };
+	static const uint8_t REX_MAXPD[] = { 0x66, 0x41, 0x0f, 0x5f };
+	static const uint8_t REX_DIVPD[] = { 0x66, 0x41, 0x0f, 0x5e };
+	static const uint8_t SQRTPD[] = { 0x66, 0x0f, 0x51 };
+	static const uint8_t AND_OR_MOV_LDMXCSR[] = { 0x25, 0x00, 0x60, 0x00, 0x00, 0x0D, 0xC0, 0x9F, 0x00, 0x00, 0x89, 0x44, 0x24, 0xF8, 0x0F, 0xAE, 0x54, 0x24, 0xF8 };
+	static const uint8_t ROL_RAX[] = { 0x48, 0xc1, 0xc0 };
+	static const uint8_t XOR_ECX_ECX[] = { 0x33, 0xC9 };
+	static const uint8_t REX_CMP_R32I[] = { 0x41, 0x81 };
+	static const uint8_t REX_CMP_M32I[] = { 0x81, 0x3c, 0x06 };
+	static const uint8_t MOVAPD[] = { 0x66, 0x0f, 0x29 };
+	static const uint8_t REX_MOV_MR[] = { 0x4c, 0x89 };
+	static const uint8_t REX_XOR_EAX[] = { 0x41, 0x33 };
+	static const uint8_t SUB_EBX[] = { 0x83, 0xEB, 0x01 };
+	static const uint8_t JNZ[] = { 0x0f, 0x85 };
+	static const uint8_t JMP = 0xe9;
+	static const uint8_t REX_XOR_RAX_R64[] = { 0x49, 0x33 };
+	static const uint8_t REX_XCHG[] = { 0x4d, 0x87 };
+	static const uint8_t REX_ANDPS_XMM12[] = { 0x41, 0x0f, 0x54, 0xe6 };
+	static const uint8_t REX_PADD[] = { 0x66, 0x44, 0x0f };
+	static const uint8_t PADD_OPCODES[] = { 0xfc, 0xfd, 0xfe, 0xd4 };
+
+	size_t JitCompilerX86::getCodeSize() {
+		return codePos - prologueSize;
+	}
 
 	JitCompilerX86::JitCompilerX86() {
 #ifdef _WIN32
@@ -121,600 +196,605 @@ namespace RandomX {
 			throw std::runtime_error("mmap failed");
 #endif
 		memcpy(code, codePrologue, prologueSize);
-		memcpy(code + CodeSize - readDatasetRSize - readDatasetFSize - epilogueSize, codeEpilogue, epilogueSize);
-		memcpy(code + CodeSize - readDatasetRSize - readDatasetFSize, codeReadDatasetR, readDatasetRSize);
-		memcpy(code + CodeSize - readDatasetFSize, codeReadDatasetF, readDatasetFSize);
+		memcpy(code + CodeSize - epilogueSize, codeEpilogue, epilogueSize);
 	}
 
-	void JitCompilerX86::generateProgram(Pcg32& gen) {
-		instructionOffsets.clear();
-		callOffsets.clear();
+	void JitCompilerX86::generateProgram(Program& prog) {
+		auto addressRegisters = prog.getEntropy(12);
+		uint32_t readReg0 = 0 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		uint32_t readReg1 = 2 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		uint32_t readReg2 = 4 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		uint32_t readReg3 = 6 + (addressRegisters & 1);
 		codePos = prologueSize;
-		Instruction instr;
+		emit(REX_XOR_RAX_R64);
+		emitByte(0xc0 + readReg0);
+		emit(REX_XOR_RAX_R64);
+		emitByte(0xc0 + readReg1);
+		memcpy(code + codePos, codeLoopLoad, loopLoadSize);
+		codePos += loopLoadSize;
 		for (unsigned i = 0; i < ProgramLength; ++i) {
-			for (unsigned j = 0; j < sizeof(instr) / sizeof(Pcg32::result_type); ++j) {
-				*(((uint32_t*)&instr) + j) = gen();
-			}
-			generateCode(instr, i);
+			Instruction& instr = prog(i);
+			instr.src %= RegistersCount;
+			instr.dst %= RegistersCount;
+			generateCode(instr);
 		}
-		emitByte(0xe9);
-		emit(instructionOffsets[0] - (codePos + 4));
-		fixCallOffsets();
+		emit(REX_MOV_RR);
+		emitByte(0xc0 + readReg2);
+		emit(REX_XOR_EAX);
+		emitByte(0xc0 + readReg3);
+		memcpy(code + codePos, codeReadDataset, readDatasetSize);
+		codePos += readDatasetSize;
+		memcpy(code + codePos, codeLoopStore, loopStoreSize);
+		codePos += loopStoreSize;
+		emit(SUB_EBX);
+		emit(JNZ);
+		emit32(prologueSize - codePos - 4);
+		emitByte(JMP);
+		emit32(epilogueOffset - codePos - 4);
+		emitByte(0x90);
 	}
 
-	void JitCompilerX86::generateCode(Instruction& instr, int i) {
-		instructionOffsets.push_back(codePos);
-		emit(0x840fcfff); //dec edx; jz <epilogue>
-		emit(epilogueOffset - (codePos + 4)); //jump offset (RIP-relative)
+	void JitCompilerX86::generateCode(Instruction& instr) {
 		auto generator = engine[instr.opcode];
-		(this->*generator)(instr, i);
+		(this->*generator)(instr);
 	}
 
-	void JitCompilerX86::fixCallOffsets() {
-		for (CallOffset& co : callOffsets) {
-			*reinterpret_cast<int32_t*>(code + co.pos) = instructionOffsets[co.index] - (co.pos + 4);
-		}
+	void JitCompilerX86::genAddressReg(Instruction& instr, bool rax = true) {
+		emit(REX_MOV_RR);
+		emitByte((rax ? 0xc0 : 0xc8) + instr.src);
+		if (rax)
+			emitByte(AND_EAX_I);
+		else
+			emit(AND_ECX_I);
+		emit32((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask);
 	}
 
-	void JitCompilerX86::genar(Instruction& instr) {
-		emit(uint16_t(0x8149)); //xor
-		emitByte(0xf0 + (instr.rega % RegistersCount));
-		emit(instr.addra);
-		switch (instr.loca & 7)
-		{
-			case 0:
-			case 1:
-			case 2:
-			case 3:
-				emit(uint16_t(0x8b41)); //mov
-				emitByte(0xc8 + (instr.rega % RegistersCount)); //ecx, rega
-				emitByte(0xe8); //call
-				emit(readDatasetROffset - (codePos + 4));
-				return;
-
-			case 4:
-				emit(uint16_t(0x8b41)); //mov
-				emitByte(0xc0 + (instr.rega % RegistersCount)); //eax, rega
-				emitByte(0x25); //and
-				emit(ScratchpadL2 - 1); //whole scratchpad
-				emit(0xc6048b48); // mov rax,QWORD PTR [rsi+rax*8]
-				return;
-
-			default:
-				emit(uint16_t(0x8b41)); //mov
-				emitByte(0xc0 + (instr.rega % RegistersCount)); //eax, rega
-				emitByte(0x25); //and
-				emit(ScratchpadL1 - 1); //first 16 KiB of scratchpad
-				emit(0xc6048b48); // mov rax,QWORD PTR [rsi+rax*8]
-				return;
-		}
+	void JitCompilerX86::genAddressRegDst(Instruction& instr, bool align16 = false) {
+		emit(REX_MOV_RR);
+		emitByte(0xc0 + instr.dst);
+		emitByte(AND_EAX_I);
+		int32_t maskL1 = align16 ? ScratchpadL1Mask16 : ScratchpadL1Mask;
+		int32_t maskL2 = align16 ? ScratchpadL2Mask16 : ScratchpadL2Mask;
+		emit32((instr.mod % 4) ? maskL1 : maskL2);
 	}
 
-	void JitCompilerX86::genaf(Instruction& instr) {
-		emit(uint16_t(0x8149)); //xor
-		emitByte(0xf0 + (instr.rega % RegistersCount));
-		emit(instr.addra);
-		switch (instr.loca & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			emit(uint16_t(0x8b41)); //mov
-			emitByte(0xc8 + (instr.rega % RegistersCount)); //ecx, rega
-			emitByte(0xe8); //call
-			emit(readDatasetFOffset - (codePos + 4));
-			return;
-
-		case 4:
-			emit(uint16_t(0x8b41)); //mov
-			emitByte(0xc0 + (instr.rega % RegistersCount)); //eax, rega
-			emitByte(0x25); //and
-			emit(ScratchpadL2 - 1); //whole scratchpad
-			emitByte(0xf3);
-			emit(0xc604e60f); //cvtdq2pd xmm0,QWORD PTR [rsi+rax*8]
-			return;
-
-		default:
-			emit(uint16_t(0x8b41)); //mov
-			emitByte(0xc0 + (instr.rega % RegistersCount)); //eax, rega
-			emitByte(0x25); //and
-			emit(ScratchpadL1 - 1); //first 16 KiB of scratchpad
-			emitByte(0xf3);
-			emit(0xc604e60f); //cvtdq2pd xmm0,QWORD PTR [rsi+rax*8]
-			return;
-		}
+	void JitCompilerX86::genAddressImm(Instruction& instr) {
+		emit32(instr.imm32 & ScratchpadL3Mask);
 	}
 
-	void JitCompilerX86::genbr0(Instruction& instr, uint16_t opcodeReg, uint16_t opcodeImm) {
-		if ((instr.locb & 7) <= 3) {
-			emit(uint16_t(0x8b49)); //mov
-			emitByte(0xc8 + (instr.regb % RegistersCount)); //rcx, regb
-			emitByte(0x48); //REX.W
-			emit(opcodeReg); //xxx rax, cl
+	void JitCompilerX86::h_IADD_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_ADD_RR);
+			emitByte(0xc0 + 8 * instr.dst + instr.src);
 		}
 		else {
-			emitByte(0x48); //REX.W
-			emit(opcodeImm); //xxx rax, imm8
-			emitByte((instr.imm8 & 63));
+			emit(REX_81);
+			emitByte(0xc0 + instr.dst);
+			emit32(instr.imm32);
 		}
 	}
 
-	void JitCompilerX86::genbr1(Instruction& instr, uint16_t opcodeReg, uint16_t opcodeImm) {
-		if ((instr.locb & 7) <= 5) {
-			emit(opcodeReg); // xxx rax, r64
-			emitByte(0xc0 + (instr.regb % RegistersCount));
+	void JitCompilerX86::h_IADD_M(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			emit(REX_ADD_RM);
+			emitByte(0x04 + 8 * instr.dst);
+			emitByte(0x06);
 		}
 		else {
-			emit(opcodeImm); // xxx rax, imm32
-			emit(instr.imm32);
+			emit(REX_ADD_RM);
+			emitByte(0x86 + 8 * instr.dst);
+			genAddressImm(instr);
 		}
 	}
 
-	void JitCompilerX86::genbr132(Instruction& instr, uint16_t opcodeReg, uint8_t opcodeImm) {
-		if ((instr.locb & 7) <= 5) {
-			emit(opcodeReg); // xxx eax, r32
-			emitByte(0xc0 + (instr.regb % RegistersCount));
+	void JitCompilerX86::genSIB(int scale, int index, int base) {
+		emitByte((scale << 5) | (index << 3) | base);
+	}
+
+	void JitCompilerX86::h_IADD_RC(Instruction& instr) {
+		emit(REX_LEA);
+		emitByte(0x84 + 8 * instr.dst);
+		genSIB(0, instr.src, instr.dst);
+		emit32(instr.imm32);
+	}
+
+	void JitCompilerX86::h_ISUB_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_SUB_RR);
+			emitByte(0xc0 + 8 * instr.dst + instr.src);
 		}
 		else {
-			emitByte(opcodeImm); // xxx eax, imm32
-			emit(instr.imm32);
+			emit(REX_81);
+			emitByte(0xe8 + instr.dst);
+			genAddressImm(instr);
 		}
 	}
 
-	void JitCompilerX86::genbf(Instruction& instr, uint8_t opcode) {
-		int regb = (instr.regb % RegistersCount);
-		emitByte(0x66); //xxxpd  xmm0,regb
-		if (regb <= 1) {
-			emitByte(0x41); //REX
+	void JitCompilerX86::h_ISUB_M(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			emit(REX_SUB_RM);
+			emitByte(0x04 + 8 * instr.dst);
+			emitByte(0x06);
 		}
-		emitByte(0x0f);
-		emitByte(opcode);
-		emitByte(0xc0 + regb);
-	}
-
-
-	void JitCompilerX86::scratchpadStoreR(Instruction& instr, uint32_t scratchpadSize) {
-		emit(0x41c88b48); //mov rcx, rax; REX
-		emitByte(0x8b); // mov
-		emitByte(0xc0 + (instr.regc % RegistersCount)); //eax, regc
-		emitByte(0x35); // xor eax
-		emit(instr.addrc);
-		emitByte(0x25); //and
-		emit(scratchpadSize - 1);
-		emit(0xc60c8948); // mov    QWORD PTR [rsi+rax*8],rcx
-	}
-
-	void JitCompilerX86::gencr(Instruction& instr) {
-		switch (instr.locc & 7)
-		{
-			case 0:
-				scratchpadStoreR(instr, ScratchpadL2);
-				break;
-
-			case 1:
-			case 2:
-			case 3:
-				scratchpadStoreR(instr, ScratchpadL1);
-				break;
-
-			default:
-				emit(uint16_t(0x8b4c)); //mov
-				emitByte(0xc0 + 8 * (instr.regc % RegistersCount)); //regc, rax
-				break;
+		else {
+			emit(REX_SUB_RM);
+			emitByte(0x86 + 8 * instr.dst);
+			genAddressImm(instr);
 		}
 	}
 
-	void JitCompilerX86::scratchpadStoreF(Instruction& instr, int regc, uint32_t scratchpadSize, bool storeHigh) {
-		emit(uint16_t(0x8b41)); //mov
-		emitByte(0xc0 + regc); //eax, regc
-		emitByte(0x35); // xor eax
-		emit(instr.addrc);
-		emitByte(0x25); //and
-		emit(scratchpadSize - 1);
-		emitByte(0x66); //movhpd/movlpd QWORD PTR [rsi+rax*8], regc
-		if (regc <= 1) {
-			emitByte(0x44); //REX
-		}
-		emitByte(0x0f);
-		emitByte(storeHigh ? 0x17 : 0x13);
-		emitByte(4 + 8 * regc);
-		emitByte(0xc6);
+	void JitCompilerX86::h_IMUL_9C(Instruction& instr) {
+		emit(REX_LEA);
+		emitByte(0x84 + 8 * instr.dst);
+		genSIB(3, instr.src, instr.dst);
+		emit32(instr.imm32);
 	}
 
-	void JitCompilerX86::gencf(Instruction& instr, bool alwaysLow = false) {
-		int regc = (instr.regc % RegistersCount);
-		if (!alwaysLow) {
-			if (regc <= 1) {
-				emitByte(0x44); //REX
+	void JitCompilerX86::h_IMUL_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_IMUL_RR);
+			emitByte(0xc0 + 8 * instr.dst + instr.src);
+		}
+		else {
+			emit(REX_IMUL_RRI);
+			emitByte(0xc0 + 9 * instr.dst);
+			genAddressImm(instr);
+		}
+	}
+
+	void JitCompilerX86::h_IMUL_M(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			emit(REX_IMUL_RM);
+			emitByte(0x04 + 8 * instr.dst);
+			emitByte(0x06);
+		}
+		else {
+			emit(REX_IMUL_RM);
+			emitByte(0x86 + 8 * instr.dst);
+			genAddressImm(instr);
+		}
+	}
+
+	void JitCompilerX86::h_IMULH_R(Instruction& instr) {
+		emit(REX_MOV_RR64);
+		emitByte(0xc0 + instr.dst);
+		emit(REX_MUL_R);
+		emitByte(0xe0 + instr.src);
+		emit(REX_MOV_R64R);
+		emitByte(0xc2 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_IMULH_M(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr, false);
+			emit(REX_MOV_RR64);
+			emitByte(0xc0 + instr.dst);
+			emit(REX_MUL_MEM);
+		}
+		else {
+			emit(REX_MOV_RR64);
+			emitByte(0xc0 + instr.dst);
+			emit(REX_MUL_M);
+			emitByte(0xa6);
+			genAddressImm(instr);
+		}
+		emit(REX_MOV_R64R);
+		emitByte(0xc2 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_ISMULH_R(Instruction& instr) {
+		emit(REX_MOV_RR64);
+		emitByte(0xc0 + instr.dst);
+		emit(REX_MUL_R);
+		emitByte(0xe8 + instr.src);
+		emit(REX_MOV_R64R);
+		emitByte(0xc2 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_ISMULH_M(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr, false);
+			emit(REX_MOV_RR64);
+			emitByte(0xc0 + instr.dst);
+			emit(REX_IMUL_MEM);
+		}
+		else {
+			emit(REX_MOV_RR64);
+			emitByte(0xc0 + instr.dst);
+			emit(REX_MUL_M);
+			emitByte(0xae);
+			genAddressImm(instr);
+		}
+		emit(REX_MOV_R64R);
+		emitByte(0xc2 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_IDIV_C(Instruction& instr) {
+		if (instr.imm32 != 0) {
+			uint32_t divisor = instr.imm32;
+			if (divisor & (divisor - 1)) {
+				magicu_info mi = compute_unsigned_magic_info(divisor, sizeof(uint64_t) * 8);
+				if (mi.pre_shift == 0 && !mi.increment) {
+					emit(MOV_RAX_I);
+					emit64(mi.multiplier);
+					emit(REX_MUL_R);
+					emitByte(0xe0 + instr.dst);
+				}
+				else {
+					emit(REX_MOV_RR64);
+					emitByte(0xc0 + instr.dst);
+					if (mi.pre_shift > 0) {
+						emit(REX_SHR_RAX);
+						emitByte(mi.pre_shift);
+					}
+					if (mi.increment) {
+						emit(RAX_ADD_SBB_1);
+					}
+					emit(MOV_RCX_I);
+					emit64(mi.multiplier);
+					emit(MUL_RCX);
+				}
+				if (mi.post_shift > 0) {
+					emit(REX_SHR_RDX);
+					emitByte(mi.post_shift);
+				}
+				emit(REX_ADD_RR);
+				emitByte(0xc2 + 8 * instr.dst);
+			}
+			else { //divisor is a power of two
+				int shift = 0;
+				while (divisor >>= 1)
+					++shift;
+				if (shift > 0) {
+					emit(REX_SH);
+					emitByte(0xe8 + instr.dst);
+				}
 			}
-			emit(uint16_t(0x280f)); //movaps
-			emitByte(0xc0 + 8 * regc); // regc, xmm0
-		}
-		switch (instr.locc & 7)
-		{
-			case 4:
-				scratchpadStoreF(instr, regc, ScratchpadL2, !alwaysLow && (instr.locc & 8));
-				break;
-
-			case 5:
-			case 6:
-			case 7:
-				scratchpadStoreF(instr, regc, ScratchpadL1, !alwaysLow && (instr.locc & 8));
-				break;
-
-			default:
-				break;
 		}
 	}
 
-	void JitCompilerX86::h_ADD_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr1(instr, 0x0349, 0x0548);
-		gencr(instr);
+	void JitCompilerX86::h_ISDIV_C(Instruction& instr) {
+		int64_t divisor = instr.imm32;
+		if ((divisor & -divisor) == divisor || (divisor & -divisor) == -divisor) {
+			emit(REX_MOV_RR64);
+			emitByte(0xc0 + instr.dst);
+			// +/- power of two
+			bool negative = divisor < 0;
+			if (negative)
+				divisor = -divisor;
+			int shift = 0;
+			uint64_t unsignedDivisor = divisor;
+			while (unsignedDivisor >>= 1)
+				++shift;
+			if (shift > 0) {
+				emit(MOV_RCX_RAX_SAR_RCX_63);
+				uint32_t mask = (1ULL << shift) - 1;
+				emit(AND_ECX_I);
+				emit32(mask);
+				emit(ADD_RAX_RCX);
+				emit(SAR_RAX_I8);
+				emitByte(shift);
+			}
+			if (negative)
+				emit(NEG_RAX);
+			emit(ADD_R_RAX);
+			emitByte(0xc0 + instr.dst);
+		}
+		else if (divisor != 0) {
+			magics_info mi = compute_signed_magic_info(divisor);
+			emit(MOV_RAX_I);
+			emit64(mi.multiplier);
+			emit(REX_MUL_R);
+			emitByte(0xe8 + instr.dst);
+			emit(XOR_EAX_EAX);
+			bool haveSF = false;
+			if (divisor > 0 && mi.multiplier < 0) {
+				emit(ADD_RDX_R);
+				emitByte(0xc2 + 8 * instr.dst);
+				haveSF = true;
+			}
+			if (divisor < 0 && mi.multiplier > 0) {
+				emit(SUB_RDX_R);
+				emitByte(0xc2 + 8 * instr.dst);
+				haveSF = true;
+			}
+			if (mi.shift > 0) {
+				emit(SAR_RDX_I8);
+				emitByte(mi.shift);
+				haveSF = true;
+			}
+			if (!haveSF)
+				emit(TEST_RDX_RDX);
+			emit(SETS_AL_ADD_RDX_RAX);
+			emit(ADD_R_RAX);
+			emitByte(0xd0 + instr.dst);
+		}
 	}
 
-	void JitCompilerX86::h_ADD_32(Instruction& instr, int i) {
-		genar(instr);
-		genbr132(instr, 0x0341, 0x05);
-		gencr(instr);
+	void JitCompilerX86::h_INEG_R(Instruction& instr) {
+		emit(REX_NEG);
+		emitByte(0xd8 + instr.dst);
 	}
 
-	void JitCompilerX86::h_SUB_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr1(instr, 0x2b49, 0x2d48);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_SUB_32(Instruction& instr, int i) {
-		genar(instr);
-		genbr132(instr, 0x2b41, 0x2d);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_MUL_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) <= 5) {
-			emitByte(0x49); //REX
-			emit(uint16_t(0xaf0f)); // imul rax, r64
-			emitByte(0xc0 + (instr.regb % RegistersCount));
+	void JitCompilerX86::h_IXOR_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_XOR_RR);
+			emitByte(0xc0 + 8 * instr.dst + instr.src);
 		}
 		else {
-			emitByte(0x48); //REX
-			emit(uint16_t(0xc069)); // imul rax, rax, imm32
-			emit(instr.imm32);
+			emit(REX_XOR_RI);
+			emitByte(0xf0 + instr.dst);
+			emit32(instr.imm32);
 		}
-		gencr(instr);
 	}
 
-	void JitCompilerX86::h_MULH_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) <= 5) {
-			emit(uint16_t(0x8b49)); //mov rcx, r64
-			emitByte(0xc8 + (instr.regb % RegistersCount));
+	void JitCompilerX86::h_IXOR_M(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			emit(REX_XOR_RM);
+			emitByte(0x04 + 8 * instr.dst);
+			emitByte(0x06);
+	}
+		else {
+			emit(REX_XOR_RM);
+			emitByte(0x86 + 8 * instr.dst);
+			genAddressImm(instr);
+		}
+	}
+
+	void JitCompilerX86::h_IROR_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_MOV_RR);
+			emitByte(0xc8 + instr.src);
+			emit(REX_ROT_CL);
+			emitByte(0xc8 + instr.dst);
 		}
 		else {
-			emitByte(0x48);
-			emit(uint16_t(0xc1c7)); // mov rcx, imm32
-			emit(instr.imm32);
+			emit(REX_ROT_I8);
+			emitByte(0xc8 + instr.dst);
+			emitByte(instr.imm32 & 63);
 		}
-		emitByte(0x48);
-		emit(uint16_t(0xe1f7)); // mul rcx
-		emitByte(0x48);
-		emit(uint16_t(0xc28b)); //	mov rax,rdx
-		gencr(instr);
 	}
 
-	void JitCompilerX86::h_MUL_32(Instruction& instr, int i) {
-		genar(instr);
-		emit(uint16_t(0xc88b)); //mov ecx, eax
-		if ((instr.locb & 7) <= 5) {
-			emit(uint16_t(0x8b41)); // mov eax, r32
-			emitByte(0xc0 + (instr.regb % RegistersCount));
+	void JitCompilerX86::h_IROL_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_MOV_RR);
+			emitByte(0xc8 + instr.src);
+			emit(REX_ROT_CL);
+			emitByte(0xc0 + instr.dst);
 		}
 		else {
-			emitByte(0xb8); // mov eax, imm32
-			emit(instr.imm32);
+			emit(REX_ROT_I8);
+			emitByte(0xc0 + instr.dst);
+			emitByte(instr.imm32 & 63);
 		}
-		emit(0xc1af0f48); //imul rax,rcx
-		gencr(instr);
 	}
 
-	void JitCompilerX86::h_IMUL_32(Instruction& instr, int i) {
-		genar(instr);
-		emitByte(0x48);
-		emit(uint16_t(0xc863)); //movsxd rcx,eax
-		if ((instr.locb & 7) <= 5) {
-			emit(uint16_t(0x6349)); //movsxd rax,r32
-			emitByte(0xc0 + (instr.regb % RegistersCount));
+	void JitCompilerX86::h_ISWAP_R(Instruction& instr) {
+		if (instr.src != instr.dst) {
+			emit(REX_XCHG);
+			emitByte(0xc0 + instr.dst + 8 * instr.src);
 		}
-		else {
-			emitByte(0x48);
-			emit(uint16_t(0xc0c7)); // mov rax, imm32
-			emit(instr.imm32);
+	}
+
+	void JitCompilerX86::h_FSWAP_R(Instruction& instr) {
+		emit(SHUFPD);
+		emitByte(0xc0 + 9 * instr.dst);
+		emitByte(1);
+	}
+
+	void JitCompilerX86::h_FADD_R(Instruction& instr) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		emit(REX_ADDPD);
+		emitByte(0xc0 + instr.src + 8 * instr.dst);
+		//emit(REX_PADD);
+		//emitByte(PADD_OPCODES[instr.mod % 4]);
+		//emitByte(0xf8 + instr.dst);
+	}
+
+	void JitCompilerX86::h_FADD_M(Instruction& instr) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		emit(REX_CVTDQ2PD_XMM12);
+		emit(REX_ADDPD);
+		emitByte(0xc4 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FSUB_R(Instruction& instr) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		emit(REX_SUBPD);
+		emitByte(0xc0 + instr.src + 8 * instr.dst);
+		//emit(REX_PADD);
+		//emitByte(PADD_OPCODES[instr.mod % 4]);
+		//emitByte(0xf8 + instr.dst);
+	}
+
+	void JitCompilerX86::h_FSUB_M(Instruction& instr) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		emit(REX_CVTDQ2PD_XMM12);
+		emit(REX_SUBPD);
+		emitByte(0xc4 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FNEG_R(Instruction& instr) {
+		instr.dst %= 4;
+		emit(REX_XORPS);
+		emitByte(0xc7 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FMUL_R(Instruction& instr) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		emit(REX_MULPD);
+		emitByte(0xe0 + instr.src + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FMUL_M(Instruction& instr) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		emit(REX_CVTDQ2PD_XMM12);
+		emit(REX_ANDPS_XMM12);
+		emit(REX_MULPD);
+		emitByte(0xe4 + 8 * instr.dst);
+		emit(REX_MAXPD);
+		emitByte(0xe5 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FDIV_R(Instruction& instr) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		emit(REX_DIVPD);
+		emitByte(0xe0 + instr.src + 8 * instr.dst);
+		emit(REX_MAXPD);
+		emitByte(0xe5 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FDIV_M(Instruction& instr) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		emit(REX_CVTDQ2PD_XMM12);
+		emit(REX_ANDPS_XMM12);
+		emit(REX_DIVPD);
+		emitByte(0xe4 + 8 * instr.dst);
+		emit(REX_MAXPD);
+		emitByte(0xe5 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_FSQRT_R(Instruction& instr) {
+		instr.dst %= 4;
+		emit(SQRTPD);
+		emitByte(0xe4 + 9 * instr.dst);
+	}
+
+	void JitCompilerX86::h_CFROUND(Instruction& instr) {
+		emit(REX_MOV_RR64);
+		emitByte(0xc0 + instr.src);	
+		int rotate = (13 - (instr.imm32 & 63)) & 63;
+		if (rotate != 0) {
+			emit(ROL_RAX);
+			emitByte(rotate);
 		}
-		emit(0xc1af0f48); //imul rax,rcx
-		gencr(instr);
+		emit(AND_OR_MOV_LDMXCSR);
 	}
 
-	void JitCompilerX86::h_IMULH_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) <= 5) {
-			emit(uint16_t(0x8b49)); //mov rcx, r64
-			emitByte(0xc8 + (instr.regb % RegistersCount));
-		}
-		else {
-			emitByte(0x48);
-			emit(uint16_t(0xc1c7)); // mov rcx, imm32
-			emit(instr.imm32);
-		}
-		emitByte(0x48);
-		emit(uint16_t(0xe9f7)); // imul rcx
-		emitByte(0x48);
-		emit(uint16_t(0xc28b)); //	mov rax,rdx
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_DIV_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) <= 5) {
-			emitByte(0xb9); //mov ecx, 1
-			emit(1);
-			emit(uint16_t(0x8b41)); //mov edx, r32
-			emitByte(0xd0 + (instr.regb % RegistersCount));
-			emit(0x450fd285); //test edx, edx; cmovne ecx,edx
-			emitByte(0xca);
-		}
-		else {
-			emitByte(0xb9); //mov ecx, imm32
-			emit(instr.imm32 != 0 ? instr.imm32 : 1);
-		}
-		emit(0xf748d233); //xor edx,edx; div rcx
-		emitByte(0xf1);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_IDIV_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) <= 5) {
-			emit(uint16_t(0x8b41)); //mov edx, r32
-			emitByte(0xd0 + (instr.regb % RegistersCount));
-		}
-		else {
-			emitByte(0xba); // xxx edx, imm32
-			emit(instr.imm32);
-		}
-		emit(0xc88b480b75fffa83);
-		emit(0x1274c9ff48c1d148);
-		emit(0x0fd28500000001b9);
-		emit(0x489948c96348ca45);
-		emit(uint16_t(0xf9f7)); //idiv rcx
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_AND_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr1(instr, 0x2349, 0x2548);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_AND_32(Instruction& instr, int i) {
-		genar(instr);
-		genbr132(instr, 0x2341, 0x25);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_OR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr1(instr, 0x0b49, 0x0d48);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_OR_32(Instruction& instr, int i) {
-		genar(instr);
-		genbr132(instr, 0x0b41, 0x0d);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_XOR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr1(instr, 0x3349, 0x3548);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_XOR_32(Instruction& instr, int i) {
-		genar(instr);
-		genbr132(instr, 0x3341, 0x35);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_SHL_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, 0xe0d3, 0xe0c1);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_SHR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, 0xe8d3, 0xe8c1);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_SAR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, 0xf8d3, 0xf8c1);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_ROL_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, 0xc0d3, 0xc0c1);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_ROR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, 0xc8d3, 0xc8c1);
-		gencr(instr);
-	}
-
-	void JitCompilerX86::h_FPADD(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, 0x58);
-		gencf(instr);
-	}
-
-	void JitCompilerX86::h_FPSUB(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, 0x5c);
-		gencf(instr);
-	}
-
-	void JitCompilerX86::h_FPMUL(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, 0x59);
-		emit(0x00c9c20f66c8280f); //movaps xmm1,xmm0; cmpeqpd xmm1,xmm1
-		emit(uint16_t(0x540f)); //andps  xmm0,xmm1
-		emitByte(0xc1);
-		gencf(instr);
-	}
-
-	void JitCompilerX86::h_FPDIV(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, 0x5e);
-		emit(0x00c9c20f66c8280f); //movaps xmm1,xmm0; cmpeqpd xmm1,xmm1
-		emit(uint16_t(0x540f)); //andps  xmm0,xmm1
-		emitByte(0xc1);
-		gencf(instr);
-	}
-
-	void JitCompilerX86::h_FPSQRT(Instruction& instr, int i) {
-		genaf(instr);
-		emit(0xc0510f66c2540f41); //andps  xmm0,xmm10; sqrtpd xmm0,xmm0
-		gencf(instr);
-	}
-
-	void JitCompilerX86::h_FPROUND(Instruction& instr, int i) {
-		genar(instr);
-		emit(0x81480de0c1c88b48);
-		emit(0x600025fffff800e1);
-		emit(uint16_t(0x0000));
-		emitByte(0xf2);
-		int regc = (instr.regc % RegistersCount);
-		if (regc <= 1) {
-			emitByte(0x4c); //REX
-		}
-		else {
-			emitByte(0x48); //REX
-		}
-		emit(uint16_t(0x2a0f));
-		emitByte(0xc1 + 8 * regc);
-		emitByte(0x0d);
-		emit(0xf824448900009fc0);
-		emit(0x2454ae0f); //ldmxcsr DWORD PTR [rsp-0x8]
-		emitByte(0xf8);
-		gencf(instr, true);
-	}
-
-	static inline uint8_t jumpCondition(Instruction& instr, bool invert = false) {
-		switch ((instr.locb & 7) ^ invert)
+	static inline uint8_t condition(Instruction& instr, bool invert = false) {
+		switch ((instr.mod & 7) ^ invert)
 		{
 			case 0:
-				return 0x76; //jbe
+				return 0x96; //setbe
 			case 1:
-				return 0x77; //ja
+				return 0x97; //seta
 			case 2:
-				return 0x78; //js
+				return 0x98; //sets
 			case 3:
-				return 0x79; //jns
+				return 0x99; //setns
 			case 4:
-				return 0x70; //jo
+				return 0x90; //seto
 			case 5:
-				return 0x71; //jno
+				return 0x91; //setno
 			case 6:
-				return 0x7c; //jl
+				return 0x9c; //setl
 			case 7:
-				return 0x7d; //jge
+				return 0x9d; //setge
+			default:
+				UNREACHABLE;
 		}
 	}
 
-	void JitCompilerX86::h_CALL(Instruction& instr, int i) {
-		genar(instr);
-		emit(uint16_t(0x8141)); //cmp regb, imm32
-		emitByte(0xf8 + (instr.regb % RegistersCount));
-		emit(instr.imm32);
-		emitByte(jumpCondition(instr));
-		if ((instr.locc & 7) <= 3) {
-			emitByte(0x16);
-		}
-		else {
-			emitByte(0x05);
-		}
-		gencr(instr);
-		emit(uint16_t(0x06eb)); //jmp to next
-		emitByte(0x50); //push rax
-		emitByte(0xe8); //call
-		i = wrapInstr(i + (instr.imm8 & 127) + 2);
-		if (i < instructionOffsets.size()) {
-			emit(instructionOffsets[i] - (codePos + 4));
-		}
-		else {
-			callOffsets.push_back(CallOffset(codePos, i));
-			codePos += 4;
-		}
+	void JitCompilerX86::h_COND_R(Instruction& instr) {
+		emit(XOR_ECX_ECX);
+		emit(REX_CMP_R32I);
+		emitByte(0xf8 + instr.src);
+		emit32(instr.imm32);
+		emitByte(0x0f);
+		emitByte(condition(instr));
+		emitByte(0xc1);
+		emit(REX_ADD_RM);
+		emitByte(0xc1 + 8 * instr.dst);
 	}
 
-	void JitCompilerX86::h_RET(Instruction& instr, int i) {
-		genar(instr);
-		int crlen = 0;
-		if ((instr.locc & 7) <= 3) {
-			crlen = 17;
-		}
-		emit(0x74e53b48); //cmp rsp, rbp; je
-		emitByte(11 + crlen);
-		emitByte(0x48);
-		emit(0x08244433); //xor rax,QWORD PTR [rsp+0x8]
-		gencr(instr);
-		emitByte(0xc2); //ret 8
-		emit(uint16_t(0x0008));
-		gencr(instr);
+	void JitCompilerX86::h_COND_M(Instruction& instr) {
+		emit(XOR_ECX_ECX);
+		genAddressReg(instr);
+		emit(REX_CMP_M32I);
+		emit32(instr.imm32);
+		emitByte(0x0f);
+		emitByte(condition(instr));
+		emitByte(0xc1);
+		emit(REX_ADD_RM);
+		emitByte(0xc1 + 8 * instr.dst);
+	}
+
+	void JitCompilerX86::h_ISTORE(Instruction& instr) {
+		genAddressRegDst(instr);
+		emit(REX_MOV_MR);
+		emitByte(0x04 + 8 * instr.src);
+		emitByte(0x06);
+	}
+
+	void JitCompilerX86::h_FSTORE(Instruction& instr) {
+		genAddressRegDst(instr, true);
+		emit(MOVAPD);
+		emitByte(0x04 + 8 * instr.src);
+		emitByte(0x06);
+	}
+
+	void JitCompilerX86::h_NOP(Instruction& instr) {
+		emitByte(0x90);
 	}
 
 #include "instructionWeights.hpp"
 #define INST_HANDLE(x) REPN(&JitCompilerX86::h_##x, WT(x))
 
 	InstructionGeneratorX86 JitCompilerX86::engine[256] = {
-		INST_HANDLE(ADD_64)
-		INST_HANDLE(ADD_32)
-		INST_HANDLE(SUB_64)
-		INST_HANDLE(SUB_32)
-		INST_HANDLE(MUL_64)
-		INST_HANDLE(MULH_64)
-		INST_HANDLE(MUL_32)
-		INST_HANDLE(IMUL_32)
-		INST_HANDLE(IMULH_64)
-		INST_HANDLE(DIV_64)
-		INST_HANDLE(IDIV_64)
-		INST_HANDLE(AND_64)
-		INST_HANDLE(AND_32)
-		INST_HANDLE(OR_64)
-		INST_HANDLE(OR_32)
-		INST_HANDLE(XOR_64)
-		INST_HANDLE(XOR_32)
-		INST_HANDLE(SHL_64)
-		INST_HANDLE(SHR_64)
-		INST_HANDLE(SAR_64)
-		INST_HANDLE(ROL_64)
-		INST_HANDLE(ROR_64)
-		INST_HANDLE(FPADD)
-		INST_HANDLE(FPSUB)
-		INST_HANDLE(FPMUL)
-		INST_HANDLE(FPDIV)
-		INST_HANDLE(FPSQRT)
-		INST_HANDLE(FPROUND)
-		INST_HANDLE(CALL)
-		INST_HANDLE(RET)
+		INST_HANDLE(IADD_R)
+		INST_HANDLE(IADD_M)
+		INST_HANDLE(IADD_RC)
+		INST_HANDLE(ISUB_R)
+		INST_HANDLE(ISUB_M)
+		INST_HANDLE(IMUL_9C)
+		INST_HANDLE(IMUL_R)
+		INST_HANDLE(IMUL_M)
+		INST_HANDLE(IMULH_R)
+		INST_HANDLE(IMULH_M)
+		INST_HANDLE(ISMULH_R)
+		INST_HANDLE(ISMULH_M)
+		INST_HANDLE(IDIV_C)
+		INST_HANDLE(ISDIV_C)
+		INST_HANDLE(INEG_R)
+		INST_HANDLE(IXOR_R)
+		INST_HANDLE(IXOR_M)
+		INST_HANDLE(IROR_R)
+		INST_HANDLE(IROL_R)
+		INST_HANDLE(ISWAP_R)
+		INST_HANDLE(FSWAP_R)
+		INST_HANDLE(FADD_R)
+		INST_HANDLE(FADD_M)
+		INST_HANDLE(FSUB_R)
+		INST_HANDLE(FSUB_M)
+		INST_HANDLE(FNEG_R)
+		INST_HANDLE(FMUL_R)
+		INST_HANDLE(FMUL_M)
+		INST_HANDLE(FDIV_R)
+		INST_HANDLE(FDIV_M)
+		INST_HANDLE(FSQRT_R)
+		INST_HANDLE(COND_R)
+		INST_HANDLE(COND_M)
+		INST_HANDLE(CFROUND)
+		INST_HANDLE(ISTORE)
+		INST_HANDLE(FSTORE)
+		INST_HANDLE(NOP)
 	};
 
+
 #endif
 }
\ No newline at end of file
diff --git a/src/JitCompilerX86.hpp b/src/JitCompilerX86.hpp
index e2c432c..fedcf20 100644
--- a/src/JitCompilerX86.hpp
+++ b/src/JitCompilerX86.hpp
@@ -24,94 +24,108 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <cstring>
 #include <vector>
 
-class Pcg32;
-
 namespace RandomX {
 
+	class Program;
 	class JitCompilerX86;
 
-	typedef void(JitCompilerX86::*InstructionGeneratorX86)(Instruction&, int);
+	typedef void(JitCompilerX86::*InstructionGeneratorX86)(Instruction&);
 
 	constexpr uint32_t CodeSize = 64 * 1024;
-	constexpr uint32_t CacheLineSize = 64;
-
-	struct CallOffset {
-		CallOffset(int32_t p, int32_t i) : pos(p), index(i) {}
-		int32_t pos;
-		int32_t index;
-	};
 
 	class JitCompilerX86 {
 	public:
 		JitCompilerX86();
-		void generateProgram(Pcg32&);
+		void generateProgram(Program&);
 		ProgramFunc getProgramFunc() {
 			return (ProgramFunc)code;
 		}
 		uint8_t* getCode() {
 			return code;
 		}
+		size_t getCodeSize();
 	private:
 		static InstructionGeneratorX86 engine[256];
 		uint8_t* code;
 		int32_t codePos;
-		std::vector<int32_t> instructionOffsets;
-		std::vector<CallOffset> callOffsets;
 
-		void genar(Instruction&);
-		void genaf(Instruction&);
-		void genbr0(Instruction&, uint16_t, uint16_t);
-		void genbr1(Instruction&, uint16_t, uint16_t);
-		void genbr132(Instruction&, uint16_t, uint8_t);
-		void genbf(Instruction&, uint8_t);
-		void scratchpadStoreR(Instruction&, uint32_t);
-		void scratchpadStoreF(Instruction&, int, uint32_t, bool);
-		void gencr(Instruction&);
-		void gencf(Instruction&, bool);
-		void generateCode(Instruction&, int);
-		void fixCallOffsets();
+		void genAddressReg(Instruction&, bool);
+		void genAddressRegDst(Instruction&, bool);
+		void genAddressImm(Instruction&);
+		void genSIB(int scale, int index, int base);
+
+		void generateCode(Instruction&);
 
 		void emitByte(uint8_t val) {
 			code[codePos] = val;
 			codePos++;
 		}
 
-		template<typename T>
-		void emit(T val) {
-			*reinterpret_cast<T*>(code + codePos) = val;
-			codePos += sizeof(T);
+		void emit32(uint32_t val) {
+			code[codePos + 0] = val;
+			code[codePos + 1] = val >> 8;
+			code[codePos + 2] = val >> 16;
+			code[codePos + 3] = val >> 24;
+			codePos += 4;
 		}
 
-		void h_ADD_64(Instruction&, int);
-		void h_ADD_32(Instruction&, int);
-		void h_SUB_64(Instruction&, int);
-		void h_SUB_32(Instruction&, int);
-		void h_MUL_64(Instruction&, int);
-		void h_MULH_64(Instruction&, int);
-		void h_MUL_32(Instruction&, int);
-		void h_IMUL_32(Instruction&, int);
-		void h_IMULH_64(Instruction&, int);
-		void h_DIV_64(Instruction&, int);
-		void h_IDIV_64(Instruction&, int);
-		void h_AND_64(Instruction&, int);
-		void h_AND_32(Instruction&, int);
-		void h_OR_64(Instruction&, int);
-		void h_OR_32(Instruction&, int);
-		void h_XOR_64(Instruction&, int);
-		void h_XOR_32(Instruction&, int);
-		void h_SHL_64(Instruction&, int);
-		void h_SHR_64(Instruction&, int);
-		void h_SAR_64(Instruction&, int);
-		void h_ROL_64(Instruction&, int);
-		void h_ROR_64(Instruction&, int);
-		void h_FPADD(Instruction&, int);
-		void h_FPSUB(Instruction&, int);
-		void h_FPMUL(Instruction&, int);
-		void h_FPDIV(Instruction&, int);
-		void h_FPSQRT(Instruction&, int);
-		void h_FPROUND(Instruction&, int);
-		void h_CALL(Instruction&, int);
-		void h_RET(Instruction&, int);
+		void emit64(uint64_t val) {
+			code[codePos + 0] = val;
+			code[codePos + 1] = val >> 8;
+			code[codePos + 2] = val >> 16;
+			code[codePos + 3] = val >> 24;
+			code[codePos + 4] = val >> 32;
+			code[codePos + 5] = val >> 40;
+			code[codePos + 6] = val >> 48;
+			code[codePos + 7] = val >> 56;
+			codePos += 8;
+		}
+
+		template<size_t N>
+		void emit(const uint8_t (&src)[N]) {
+			for (unsigned i = 0; i < N; ++i) {
+				code[codePos + i] = src[i];
+			}
+			codePos += N;
+		}
+
+		void  h_IADD_R(Instruction&);
+		void  h_IADD_M(Instruction&);
+		void  h_IADD_RC(Instruction&);
+		void  h_ISUB_R(Instruction&);
+		void  h_ISUB_M(Instruction&);
+		void  h_IMUL_9C(Instruction&);
+		void  h_IMUL_R(Instruction&);
+		void  h_IMUL_M(Instruction&);
+		void  h_IMULH_R(Instruction&);
+		void  h_IMULH_M(Instruction&);
+		void  h_ISMULH_R(Instruction&);
+		void  h_ISMULH_M(Instruction&);
+		void  h_IDIV_C(Instruction&);
+		void  h_ISDIV_C(Instruction&);
+		void  h_INEG_R(Instruction&);
+		void  h_IXOR_R(Instruction&);
+		void  h_IXOR_M(Instruction&);
+		void  h_IROR_R(Instruction&);
+		void  h_IROL_R(Instruction&);
+		void  h_ISWAP_R(Instruction&);
+		void  h_FSWAP_R(Instruction&);
+		void  h_FADD_R(Instruction&);
+		void  h_FADD_M(Instruction&);
+		void  h_FSUB_R(Instruction&);
+		void  h_FSUB_M(Instruction&);
+		void  h_FNEG_R(Instruction&);
+		void  h_FMUL_R(Instruction&);
+		void  h_FMUL_M(Instruction&);
+		void  h_FDIV_R(Instruction&);
+		void  h_FDIV_M(Instruction&);
+		void  h_FSQRT_R(Instruction&);
+		void  h_COND_R(Instruction&);
+		void  h_COND_M(Instruction&);
+		void  h_CFROUND(Instruction&);
+		void  h_ISTORE(Instruction&);
+		void  h_FSTORE(Instruction&);
+		void  h_NOP(Instruction&);
 	};
 
 }
\ No newline at end of file
diff --git a/src/LightClientAsyncWorker.cpp b/src/LightClientAsyncWorker.cpp
new file mode 100644
index 0000000..f79d03d
--- /dev/null
+++ b/src/LightClientAsyncWorker.cpp
@@ -0,0 +1,123 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "LightClientAsyncWorker.hpp"
+#include "dataset.hpp"
+#include "Cache.hpp"
+
+namespace RandomX {
+
+	template<bool softAes>
+	LightClientAsyncWorker<softAes>::LightClientAsyncWorker(const Cache* c) : ILightClientAsyncWorker(c), output(nullptr), hasWork(false), 
+#ifdef TRACE
+		sw(true),
+#endif
+		workerThread(&LightClientAsyncWorker::runWorker, this) {
+
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::prepareBlock(addr_t addr) {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": prepareBlock-enter " << addr / CacheLineSize << std::endl;
+#endif
+		{
+			std::lock_guard<std::mutex> lk(mutex);
+			startBlock = addr / CacheLineSize;
+			blockCount = 1;
+			output = currentLine.data();
+			hasWork = true;
+		}
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": prepareBlock-notify " << startBlock << "/" << blockCount << std::endl;
+#endif
+		notifier.notify_one();
+	}
+
+	template<bool softAes>
+	const uint64_t* LightClientAsyncWorker<softAes>::getBlock(addr_t addr) {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": getBlock-enter " << addr / CacheLineSize << std::endl;
+#endif
+		uint32_t currentBlock = addr / CacheLineSize;
+		if (currentBlock != startBlock || output != currentLine.data()) {
+			initBlock(cache->getCache(), (uint8_t*)currentLine.data(), currentBlock, cache->getKeys());
+		}
+		else {
+			sync();
+		}
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": getBlock-return " << addr / CacheLineSize << std::endl;
+#endif
+		return currentLine.data();
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": prepareBlocks-enter " << startBlock << "/" << blockCount << std::endl;
+#endif
+		{
+			std::lock_guard<std::mutex> lk(mutex);
+			this->startBlock = startBlock;
+			this->blockCount = blockCount;
+			output = out;
+			hasWork = true;
+			notifier.notify_one();
+		}
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) {
+		for (uint32_t i = 0; i < blockCount; ++i) {
+			initBlock(cache->getCache(), (uint8_t*)out + CacheLineSize * i, startBlock + i, cache->getKeys());
+		}
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::sync() {
+		std::unique_lock<std::mutex> lk(mutex);
+		notifier.wait(lk, [this] { return !hasWork; });
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::runWorker() {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": runWorker-enter " << std::endl;
+#endif
+		for (;;) {
+			std::unique_lock<std::mutex> lk(mutex);
+			notifier.wait(lk, [this] { return hasWork; });
+#ifdef TRACE
+			std::cout << sw.getElapsed() << ": runWorker-getBlocks " << startBlock << "/" << blockCount << std::endl;
+#endif
+			//getBlocks(output, startBlock, blockCount);
+			initBlock(cache->getCache(), (uint8_t*)output, startBlock, cache->getKeys());
+			hasWork = false;
+#ifdef TRACE
+			std::cout << sw.getElapsed() << ": runWorker-finished " << startBlock << "/" << blockCount << std::endl;
+#endif
+			lk.unlock();
+			notifier.notify_one();
+		}
+	}
+
+	template class LightClientAsyncWorker<true>;
+	template class LightClientAsyncWorker<false>;
+}
\ No newline at end of file
diff --git a/src/LightClientAsyncWorker.hpp b/src/LightClientAsyncWorker.hpp
new file mode 100644
index 0000000..29571e5
--- /dev/null
+++ b/src/LightClientAsyncWorker.hpp
@@ -0,0 +1,60 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+//#define TRACE
+#include "common.hpp"
+
+#include <thread>
+#include <mutex>
+#include <condition_variable>
+#include <array>
+#ifdef TRACE
+#include "Stopwatch.hpp"
+#include <iostream>
+#endif
+
+namespace RandomX {
+
+	class Cache;
+
+	using DatasetLine = std::array<uint64_t, CacheLineSize / sizeof(uint64_t)>;
+
+	template<bool softAes>
+	class LightClientAsyncWorker : public ILightClientAsyncWorker {
+	public:
+		LightClientAsyncWorker(const Cache*);
+		void prepareBlock(addr_t) final;
+		void prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) final;
+		const uint64_t* getBlock(addr_t) final;
+		void getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) final;
+		void sync() final;
+	private:
+		void runWorker();
+		std::condition_variable notifier;
+		std::mutex mutex;
+		alignas(16) DatasetLine currentLine;
+		void* output;
+		uint32_t startBlock, blockCount;
+		bool hasWork;
+#ifdef TRACE
+		Stopwatch sw;
+#endif
+		std::thread workerThread;
+	};
+}
\ No newline at end of file
diff --git a/src/Pcg32.hpp b/src/Pcg32.hpp
deleted file mode 100644
index 906800f..0000000
--- a/src/Pcg32.hpp
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
-Copyright (c) 2018 tevador
-
-This file is part of RandomX.
-
-RandomX is free software: you can redistribute it and/or modify
-it under the terms of the GNU General Public License as published by
-the Free Software Foundation, either version 3 of the License, or
-(at your option) any later version.
-
-RandomX is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
-*/
-
-// Based on:
-// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
-// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
-
-#pragma once
-#include <cstdint>
-
-#if defined(_MSC_VER)
-#pragma warning (disable : 4146)
-#endif
-
-class Pcg32 {
-public:
-	typedef uint32_t result_type;
-	static constexpr result_type min() { return 0U; }
-	static constexpr result_type max() { return UINT32_MAX; }
-	Pcg32(const void* seed) {
-		auto* u64seed = (const uint64_t*)seed;
-		state = *(u64seed + 0);
-		inc = *(u64seed + 1) | 1ull;
-	}
-	Pcg32(uint64_t state, uint64_t inc) : state(state), inc(inc | 1ull) {
-	}
-	result_type operator()() {
-		return next();
-	}
-	result_type getUniform(result_type min, result_type max) {
-		const result_type range = max - min;
-		const result_type erange = range + 1;
-		result_type ret;
-
-		for (;;) {
-			ret = next();
-			if (ret / erange < UINT32_MAX / erange || UINT32_MAX % erange == range) {
-				ret %= erange;
-				break;
-			}
-		}
-		return ret + min;
-	}
-private:
-	uint64_t state;
-	uint64_t inc;
-	result_type next() {
-		uint64_t oldstate = state;
-		// Advance internal state
-		state = oldstate * 6364136223846793005ULL + inc;
-		// Calculate output function (XSH RR), uses old state for max ILP
-		uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
-		uint32_t rot = oldstate >> 59u;
-		return (xorshifted >> rot) | (xorshifted << (-rot & 31));
-	}
-};
diff --git a/src/Program.cpp b/src/Program.cpp
index 6e94fca..bb4e086 100644
--- a/src/Program.cpp
+++ b/src/Program.cpp
@@ -18,19 +18,12 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 
 #include "Program.hpp"
-#include "Pcg32.hpp"
+#include "hashAes1Rx4.hpp"
 
 namespace RandomX {
-	void Program::initialize(Pcg32& gen) {
-		for (unsigned i = 0; i < sizeof(programBuffer) / sizeof(Pcg32::result_type); ++i) {
-			*(((uint32_t*)&programBuffer) + i) = gen();
-		}
-	}
-
 	void Program::print(std::ostream& os) const {
 		for (int i = 0; i < RandomX::ProgramLength; ++i) {
 			auto instr = programBuffer[i];
-			os << std::dec << instr.getName() << " (" << i << "):" << std::endl;
 			os << instr;
 		}
 	}
diff --git a/src/Program.hpp b/src/Program.hpp
index 35b45d2..1f695a0 100644
--- a/src/Program.hpp
+++ b/src/Program.hpp
@@ -24,22 +24,25 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include "common.hpp"
 #include "Instruction.hpp"
 
-class Pcg32;
-
 namespace RandomX {
 
 	class Program {
 	public:
-		Instruction& operator()(uint64_t pc) {
+		Instruction& operator()(int pc) {
 			return programBuffer[pc];
 		}
-		void initialize(Pcg32& gen);
 		friend std::ostream& operator<<(std::ostream& os, const Program& p) {
 			p.print(os);
 			return os;
 		}
+		uint64_t getEntropy(int i) {
+			return entropyBuffer[i];
+		}
 	private:
 		void print(std::ostream&) const;
+		uint64_t entropyBuffer[16];
 		Instruction programBuffer[ProgramLength];
 	};
+
+	static_assert(sizeof(Program) % 64 == 0, "Invalid size of class Program");
 }
diff --git a/src/Stopwatch.hpp b/src/Stopwatch.hpp
index 4f3a5a1..931bc02 100644
--- a/src/Stopwatch.hpp
+++ b/src/Stopwatch.hpp
@@ -53,7 +53,7 @@ public:
 			isRunning = false;
 		}
 	}
-	double getElapsed() {
+	double getElapsed() const {
 		return getElapsedNanosec() / 1e+9;
 	}
 private:
@@ -63,7 +63,7 @@ private:
 	uint64_t elapsed;
 	bool isRunning;
 
-	uint64_t getElapsedNanosec() {
+	uint64_t getElapsedNanosec() const {
 		uint64_t elns = elapsed;
 		if (isRunning) {
 			chrono_t endMark = std::chrono::high_resolution_clock::now();
diff --git a/src/VirtualMachine.cpp b/src/VirtualMachine.cpp
index 103d245..057026c 100644
--- a/src/VirtualMachine.cpp
+++ b/src/VirtualMachine.cpp
@@ -19,85 +19,82 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 #include "VirtualMachine.hpp"
 #include "common.hpp"
-#include "dataset.hpp"
-#include "Cache.hpp"
-#include "t1ha/t1ha.h"
+#include "hashAes1Rx4.hpp"
 #include "blake2/blake2.h"
 #include <cstring>
 #include <iomanip>
+#include "intrinPortable.h"
 
 std::ostream& operator<<(std::ostream& os, const RandomX::RegisterFile& rf) {
 	for (int i = 0; i < RandomX::RegistersCount; ++i)
-		os << std::hex << "r" << i << " = " << rf.r[i].u64 << std::endl << std::dec;
-	for (int i = 0; i < RandomX::RegistersCount; ++i)
-		os << std::hex << "f" << i << " = " << rf.f[i].hi.u64 << " (" << rf.f[i].hi.f64 << ")" << std::endl
-		<< "   = " << rf.f[i].lo.u64 << " (" << rf.f[i].lo.f64 << ")" << std::endl << std::dec;
+		os << std::hex << "r" << i << " = " << rf.r[i] << std::endl << std::dec;
+	for (int i = 0; i < 4; ++i)
+		os << std::hex << "f" << i << " = " << *(uint64_t*)&rf.f[i].hi << " (" << rf.f[i].hi << ")" << std::endl
+		<< "   = " << *(uint64_t*)&rf.f[i].lo << " (" << rf.f[i].lo << ")" << std::endl << std::dec;
+	for (int i = 0; i < 4; ++i)
+		os << std::hex << "e" << i << " = " << *(uint64_t*)&rf.e[i].hi << " (" << rf.e[i].hi << ")" << std::endl
+		<< "   = " << *(uint64_t*)&rf.e[i].lo << " (" << rf.e[i].lo << ")" << std::endl << std::dec;
+	for (int i = 0; i < 4; ++i)
+		os << std::hex << "a" << i << " = " << *(uint64_t*)&rf.a[i].hi << " (" << rf.a[i].hi << ")" << std::endl
+		<< "   = " << *(uint64_t*)&rf.a[i].lo << " (" << rf.a[i].lo << ")" << std::endl << std::dec;
 	return os;
 }
 
 namespace RandomX {
 
-	VirtualMachine::VirtualMachine(bool softAes) : softAes(softAes), lightClient(false) {
+	constexpr int mantissaSize = 52;
+	constexpr int exponentSize = 11;
+	constexpr uint64_t mantissaMask = (1ULL << mantissaSize) - 1;
+	constexpr uint64_t exponentMask = (1ULL << exponentSize) - 1;
+	constexpr int exponentBias = 1023;
+
+	static inline uint64_t getSmallPositiveFloatBits(uint64_t entropy) {
+		auto exponent = entropy >> 59; //0..31
+		auto mantissa = entropy & mantissaMask;
+		exponent += exponentBias;
+		exponent &= exponentMask;
+		exponent <<= mantissaSize;
+		return exponent | mantissa;
+	}
+
+	VirtualMachine::VirtualMachine() {
 		mem.ds.dataset = nullptr;
 	}
 
-	VirtualMachine::~VirtualMachine() {
-		if (lightClient) {
-			delete mem.ds.lightDataset->block;
-			delete mem.ds.lightDataset;
-		}
+	void VirtualMachine::resetRoundingMode() {
+		initFpu();
 	}
 
-	void VirtualMachine::setDataset(dataset_t ds, bool light) {
-		if (mem.ds.dataset != nullptr) {
-			throw std::runtime_error("Dataset is already initialized");
-		}
-		lightClient = light;
-		if (light) {
-			auto lds = mem.ds.lightDataset = new LightClientDataset();
-			lds->cache = ds.cache;
-			lds->block = (uint8_t*)_mm_malloc(DatasetBlockSize, sizeof(__m128i));
-			lds->blockNumber = -1;
-			if (lds->block == nullptr) {
-				throw std::bad_alloc();
-			}
-			if (softAes) {
-				readDataset = &datasetReadLight<true>;
-			}
-			else {
-				readDataset = &datasetReadLight<false>;
-			}
-		}
-		else {
-			mem.ds = ds;
-			readDataset = &datasetRead;
-		}
+	void VirtualMachine::initialize() {
+		store64(&reg.a[0].lo, getSmallPositiveFloatBits(program.getEntropy(0)));
+		store64(&reg.a[0].hi, getSmallPositiveFloatBits(program.getEntropy(1)));
+		store64(&reg.a[1].lo, getSmallPositiveFloatBits(program.getEntropy(2)));
+		store64(&reg.a[1].hi, getSmallPositiveFloatBits(program.getEntropy(3)));
+		store64(&reg.a[2].lo, getSmallPositiveFloatBits(program.getEntropy(4)));
+		store64(&reg.a[2].hi, getSmallPositiveFloatBits(program.getEntropy(5)));
+		store64(&reg.a[3].lo, getSmallPositiveFloatBits(program.getEntropy(6)));
+		store64(&reg.a[3].hi, getSmallPositiveFloatBits(program.getEntropy(7)));
+		mem.ma = program.getEntropy(8) & CacheLineAlignMask;
+		mem.mx = program.getEntropy(10);
+		auto addressRegisters = program.getEntropy(12);
+		readReg0 = 0 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		readReg1 = 2 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		readReg2 = 4 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		readReg3 = 6 + (addressRegisters & 1);
 	}
 
-	void VirtualMachine::initializeScratchpad(uint32_t index) {
-		if (lightClient) {
-			auto cache = mem.ds.lightDataset->cache;
-			if (softAes) {
-				for (int i = 0; i < ScratchpadSize / DatasetBlockSize; ++i) {
-					initBlock<true>(cache->getCache(), ((uint8_t*)scratchpad) + DatasetBlockSize * i, (ScratchpadSize / DatasetBlockSize) * index + i, cache->getKeys());
-				}
-			}
-			else {
-				for (int i = 0; i < ScratchpadSize / DatasetBlockSize; ++i) {
-					initBlock<false>(cache->getCache(), ((uint8_t*)scratchpad) + DatasetBlockSize * i, (ScratchpadSize / DatasetBlockSize) * index + i, cache->getKeys());
-				}
-			}
-		}
-		else {
-			memcpy(scratchpad, mem.ds.dataset + ScratchpadSize * index, ScratchpadSize);
+	template<bool softAes>
+	void VirtualMachine::getResult(void* scratchpad, size_t scratchpadSize, void* outHash) {
+		if (scratchpadSize > 0) {
+			hashAes1Rx4<false>(scratchpad, scratchpadSize, &reg.a);
 		}
+		blake2b(outHash, ResultSize, &reg, sizeof(RegisterFile), nullptr, 0);
 	}
 
-	void VirtualMachine::getResult(void* out) {
-		constexpr size_t smallStateLength = sizeof(RegisterFile) / sizeof(uint64_t) + 2;
-		uint64_t smallState[smallStateLength];
-		memcpy(smallState, &reg, sizeof(RegisterFile));
-		smallState[smallStateLength - 1] = t1ha2_atonce128(&smallState[smallStateLength - 2], scratchpad, ScratchpadSize, reg.r[0].u64);
-		blake2b(out, ResultSize, smallState, sizeof(smallState), nullptr, 0);
-	}
+	template void VirtualMachine::getResult<false>(void* scratchpad, size_t scratchpadSize, void* outHash);
+	template void VirtualMachine::getResult<true>(void* scratchpad, size_t scratchpadSize, void* outHash);
+
 }
\ No newline at end of file
diff --git a/src/VirtualMachine.hpp b/src/VirtualMachine.hpp
index f7fdcd0..d1dbe26 100644
--- a/src/VirtualMachine.hpp
+++ b/src/VirtualMachine.hpp
@@ -20,26 +20,36 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #pragma once
 #include <cstdint>
 #include "common.hpp"
+#include "Program.hpp"
 
 namespace RandomX {
 
+
+
 	class VirtualMachine {
 	public:
-		VirtualMachine(bool softAes);
-		virtual ~VirtualMachine();
-		virtual void setDataset(dataset_t ds, bool light = false);
-		void initializeScratchpad(uint32_t index);
-		virtual void initializeProgram(const void* seed) = 0;
+		VirtualMachine();
+		virtual ~VirtualMachine() {}
+		virtual void setDataset(dataset_t ds) = 0;
+		void setScratchpad(void* ptr) {
+			scratchpad = (uint8_t*)ptr;
+		}
+		void resetRoundingMode();
+		virtual void initialize();
 		virtual void execute() = 0;
-		void getResult(void*);
+		template<bool softAes>
+		void getResult(void* scratchpad, size_t scratchpadSize, void* outHash);
 		const RegisterFile& getRegisterFile() {
 			return reg;
 		}
+		Program* getProgramBuffer() {
+			return &program;
+		}
 	protected:
-		bool softAes, lightClient;
-		DatasetReadFunc readDataset;
+		alignas(16) Program program;
 		alignas(16) RegisterFile reg;
 		MemoryRegisters mem;
-		alignas(16) convertible_t scratchpad[ScratchpadLength];
+		uint8_t* scratchpad;
+		uint32_t readReg0, readReg1, readReg2, readReg3;
 	};
 }
\ No newline at end of file
diff --git a/src/asm/program_epilogue_store.inc b/src/asm/program_epilogue_store.inc
index b7b779b..b94fa4d 100644
--- a/src/asm/program_epilogue_store.inc
+++ b/src/asm/program_epilogue_store.inc
@@ -1,6 +1,3 @@
-	;# unroll VM stack
-	mov rsp, rbp
-
 	;# save VM register values
 	pop rcx
 	mov qword ptr [rcx+0], r8
@@ -11,8 +8,8 @@
 	mov qword ptr [rcx+40], r13
 	mov qword ptr [rcx+48], r14
 	mov qword ptr [rcx+56], r15
-	movdqa xmmword ptr [rcx+64], xmm8
-	movdqa xmmword ptr [rcx+80], xmm9
+	movdqa xmmword ptr [rcx+64], xmm0
+	movdqa xmmword ptr [rcx+80], xmm1
 	movdqa xmmword ptr [rcx+96], xmm2
 	movdqa xmmword ptr [rcx+112], xmm3
 	lea rcx, [rcx+64]
diff --git a/src/asm/program_epilogue_win64.inc b/src/asm/program_epilogue_win64.inc
index 220bed8..f2e4b44 100644
--- a/src/asm/program_epilogue_win64.inc
+++ b/src/asm/program_epilogue_win64.inc
@@ -1,6 +1,12 @@
 	include program_epilogue_store.inc
 
 	;# restore callee-saved registers - Microsoft x64 calling convention
+	movdqu xmm15, xmmword ptr [rsp]
+	movdqu xmm14, xmmword ptr [rsp+16]
+	movdqu xmm13, xmmword ptr [rsp+32]
+	movdqu xmm12, xmmword ptr [rsp+48]
+	movdqu xmm11, xmmword ptr [rsp+64]
+	add rsp, 80
 	movdqu xmm10, xmmword ptr [rsp]
 	movdqu xmm9, xmmword ptr [rsp+16]
 	movdqu xmm8, xmmword ptr [rsp+32]
@@ -17,4 +23,4 @@
 	pop rbx
 
 	;# program finished
-	ret	0
\ No newline at end of file
+	ret
diff --git a/src/asm/program_loop_load.inc b/src/asm/program_loop_load.inc
new file mode 100644
index 0000000..76b8f3d
--- /dev/null
+++ b/src/asm/program_loop_load.inc
@@ -0,0 +1,28 @@
+	mov rdx, rax
+	and eax, 2097088
+	lea rcx, [rsi+rax]
+	push rcx
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	ror rdx, 32
+	and edx, 2097088
+	lea rcx, [rsi+rdx]
+	push rcx
+	cvtdq2pd xmm0, qword ptr [rcx+0]
+	cvtdq2pd xmm1, qword ptr [rcx+8]
+	cvtdq2pd xmm2, qword ptr [rcx+16]
+	cvtdq2pd xmm3, qword ptr [rcx+24]
+	cvtdq2pd xmm4, qword ptr [rcx+32]
+	cvtdq2pd xmm5, qword ptr [rcx+40]
+	cvtdq2pd xmm6, qword ptr [rcx+48]
+	cvtdq2pd xmm7, qword ptr [rcx+56]
+	andps xmm4, xmm14
+	andps xmm5, xmm14
+	andps xmm6, xmm14
+	andps xmm7, xmm14
diff --git a/src/asm/program_loop_store.inc b/src/asm/program_loop_store.inc
new file mode 100644
index 0000000..a0acebc
--- /dev/null
+++ b/src/asm/program_loop_store.inc
@@ -0,0 +1,18 @@
+	pop rcx
+	mov qword ptr [rcx+0], r8
+	mov qword ptr [rcx+8], r9
+	mov qword ptr [rcx+16], r10
+	mov qword ptr [rcx+24], r11
+	mov qword ptr [rcx+32], r12
+	mov qword ptr [rcx+40], r13
+	mov qword ptr [rcx+48], r14
+	mov qword ptr [rcx+56], r15
+	pop rcx
+	mulpd xmm0, xmm4
+	mulpd xmm1, xmm5
+	mulpd xmm2, xmm6
+	mulpd xmm3, xmm7
+	movapd xmmword ptr [rcx+0], xmm0
+	movapd xmmword ptr [rcx+16], xmm1
+	movapd xmmword ptr [rcx+32], xmm2
+	movapd xmmword ptr [rcx+48], xmm3
diff --git a/src/asm/program_prologue_linux.inc b/src/asm/program_prologue_linux.inc
index 8d09d88..bdde664 100644
--- a/src/asm/program_prologue_linux.inc
+++ b/src/asm/program_prologue_linux.inc
@@ -7,11 +7,13 @@
 	push r15
 
 	;# function arguments
-	push rdi        ;# RegisterFile& registerFile
-	mov rbx, rsi    ;# MemoryRegisters& memory
-	mov rsi, rdx    ;# convertible_t* scratchpad
+	mov rbx, rcx                ;# loop counter
+	push rdi                    ;# RegisterFile& registerFile
 	mov rcx, rdi
+	mov rbp, qword ptr [rsi]    ;# "mx", "ma"
+	mov rdi, qword ptr [rsi+8]  ;# uint8_t* dataset
+	mov rsi, rdx                ;# convertible_t* scratchpad
 
 	#include "program_prologue_load.inc"
 
-	jmp randomx_program_begin
\ No newline at end of file
+	jmp DECL(randomx_program_loop_begin)
\ No newline at end of file
diff --git a/src/asm/program_prologue_load.inc b/src/asm/program_prologue_load.inc
index df44c08..757cf10 100644
--- a/src/asm/program_prologue_load.inc
+++ b/src/asm/program_prologue_load.inc
@@ -1,63 +1,21 @@
-	mov rbp, rsp      ;# beginning of VM stack
-	mov rdi, 1048577  ;# number of VM instructions to execute + 1
+	mov rax, rbp
 
-	xorps xmm10, xmm10
-	cmpeqpd xmm10, xmm10
-	psrlq xmm10, 1    ;# mask for absolute value = 0x7fffffffffffffff7fffffffffffffff
+	;# zero integer registers
+	xor r8, r8
+	xor r9, r9
+	xor r10, r10
+	xor r11, r11
+	xor r12, r12
+	xor r13, r13
+	xor r14, r14
+	xor r15, r15
 
-	;# reset rounding mode
-	mov dword ptr [rsp-8], 40896
-	ldmxcsr dword ptr [rsp-8]
-
-	;# load integer registers
-	mov r8, qword ptr [rcx+0]
-	mov r9, qword ptr [rcx+8]
-	mov r10, qword ptr [rcx+16]
-	mov r11, qword ptr [rcx+24]
-	mov r12, qword ptr [rcx+32]
-	mov r13, qword ptr [rcx+40]
-	mov r14, qword ptr [rcx+48]
-	mov r15, qword ptr [rcx+56]
-
-	;# initialize floating point registers
-	xorps xmm8, xmm8
-	cvtsi2sd xmm8, qword ptr [rcx+72]
-	pslldq xmm8, 8
-	cvtsi2sd xmm8, qword ptr [rcx+64]
-
-	xorps xmm9, xmm9
-	cvtsi2sd xmm9, qword ptr [rcx+88]
-	pslldq xmm9, 8
-	cvtsi2sd xmm9, qword ptr [rcx+80]
-
-	xorps xmm2, xmm2
-	cvtsi2sd xmm2, qword ptr [rcx+104]
-	pslldq xmm2, 8
-	cvtsi2sd xmm2, qword ptr [rcx+96]
-
-	xorps xmm3, xmm3
-	cvtsi2sd xmm3, qword ptr [rcx+120]
-	pslldq xmm3, 8
-	cvtsi2sd xmm3, qword ptr [rcx+112]
-
-	lea rcx, [rcx+64]
-
-	xorps xmm4, xmm4
-	cvtsi2sd xmm4, qword ptr [rcx+72]
-	pslldq xmm4, 8
-	cvtsi2sd xmm4, qword ptr [rcx+64]
-
-	xorps xmm5, xmm5
-	cvtsi2sd xmm5, qword ptr [rcx+88]
-	pslldq xmm5, 8
-	cvtsi2sd xmm5, qword ptr [rcx+80]
-
-	xorps xmm6, xmm6
-	cvtsi2sd xmm6, qword ptr [rcx+104]
-	pslldq xmm6, 8
-	cvtsi2sd xmm6, qword ptr [rcx+96]
-
-	xorps xmm7, xmm7
-	cvtsi2sd xmm7, qword ptr [rcx+120]
-	pslldq xmm7, 8
-	cvtsi2sd xmm7, qword ptr [rcx+112]
\ No newline at end of file
+	;# load constant registers
+	lea rcx, [rcx+120]
+	movapd xmm8, xmmword ptr [rcx+72]
+	movapd xmm9, xmmword ptr [rcx+88]
+	movapd xmm10, xmmword ptr [rcx+104]
+	movapd xmm11, xmmword ptr [rcx+120]
+	movapd xmm13, xmmword ptr [minDbl]
+	movapd xmm14, xmmword ptr [absMask]
+	movapd xmm15, xmmword ptr [signMask]
diff --git a/src/asm/program_prologue_win64.inc b/src/asm/program_prologue_win64.inc
index 6059904..b1da4d7 100644
--- a/src/asm/program_prologue_win64.inc
+++ b/src/asm/program_prologue_win64.inc
@@ -13,12 +13,20 @@
 	movdqu xmmword ptr [rsp+32], xmm8
 	movdqu xmmword ptr [rsp+16], xmm9
 	movdqu xmmword ptr [rsp+0], xmm10
+	sub rsp, 80
+	movdqu xmmword ptr [rsp+64], xmm11
+	movdqu xmmword ptr [rsp+48], xmm12
+	movdqu xmmword ptr [rsp+32], xmm13
+	movdqu xmmword ptr [rsp+16], xmm14
+	movdqu xmmword ptr [rsp+0], xmm15
 
-	;# function arguments
-	push rcx        ;# RegisterFile& registerFile
-	mov rbx, rdx    ;# MemoryRegisters& memory
-	mov rsi, r8     ;# convertible_t* scratchpad
+	; function arguments
+	push rcx                    ; RegisterFile& registerFile
+	mov rbp, qword ptr [rdx]    ; "mx", "ma"
+	mov rdi, qword ptr [rdx+8]  ; uint8_t* dataset
+	mov rsi, r8                 ; convertible_t* scratchpad
+	mov rbx, r9                 ; loop counter
 
 	include program_prologue_load.inc
 
-	jmp randomx_program_begin
\ No newline at end of file
+	jmp randomx_program_loop_begin
\ No newline at end of file
diff --git a/src/asm/program_read_dataset.inc b/src/asm/program_read_dataset.inc
new file mode 100644
index 0000000..061d32c
--- /dev/null
+++ b/src/asm/program_read_dataset.inc
@@ -0,0 +1,17 @@
+	xor rbp, rax                       ;# modify "mx"
+	xor eax, eax
+	and rbp, -64                       ;# align "mx" to the start of a cache line
+	mov edx, ebp                       ;# edx = mx
+	prefetchnta byte ptr [rdi+rdx]
+	ror rbp, 32                        ;# swap "ma" and "mx"
+	mov edx, ebp                       ;# edx = ma
+	lea rcx, [rdi+rdx]                 ;# dataset cache line
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	
\ No newline at end of file
diff --git a/src/asm/program_read_f.inc b/src/asm/program_read_f.inc
deleted file mode 100644
index 1d70dab..0000000
--- a/src/asm/program_read_f.inc
+++ /dev/null
@@ -1,13 +0,0 @@
-	mov edx, dword ptr [rbx]      ;# ma
-	mov rax, qword ptr [rbx+8]    ;# dataset
-	cvtdq2pd xmm0, qword ptr [rax+rdx]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]    ;# mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 65528
-	jne short rx_read_dataset_f_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	prefetcht0 byte ptr [rax+rcx]
-rx_read_dataset_f_ret:
-	ret 0
\ No newline at end of file
diff --git a/src/asm/program_read_r.inc b/src/asm/program_read_r.inc
deleted file mode 100644
index b3102dc..0000000
--- a/src/asm/program_read_r.inc
+++ /dev/null
@@ -1,13 +0,0 @@
-	mov eax, dword ptr [rbx]      ;# ma
-	mov rdx, qword ptr [rbx+8]    ;# dataset
-	mov rax, qword ptr [rdx+rax]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]    ;# mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 65528
-	jne short rx_read_dataset_r_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	prefetcht0 byte ptr [rdx+rcx]
-rx_read_dataset_r_ret:
-	ret 0
\ No newline at end of file
diff --git a/src/asm/program_transform_address.inc b/src/asm/program_transform_address.inc
new file mode 100644
index 0000000..0815e29
--- /dev/null
+++ b/src/asm/program_transform_address.inc
@@ -0,0 +1,154 @@
+	;# 90 address transformations
+	;# forced REX prefix is used to make all transformations 4 bytes long
+	lea eax, [rax+rax*8+109]
+	db 64
+	xor eax, 96
+	lea eax, [rax+rax*8-19]
+	db 64
+	add eax, -98
+	db 64
+	add eax, -21
+	db 64
+	xor eax, -80
+	lea eax, [rax+rax*8-92]
+	db 64
+	add eax, 113
+	lea eax, [rax+rax*8+100]
+	db 64
+	add eax, -39
+	db 64
+	xor eax, 120
+	lea eax, [rax+rax*8-119]
+	db 64
+	add eax, -113
+	db 64
+	add eax, 111
+	db 64
+	xor eax, 104
+	lea eax, [rax+rax*8-83]
+	lea eax, [rax+rax*8+127]
+	db 64
+	xor eax, -112
+	db 64
+	add eax, 89
+	db 64
+	add eax, -32
+	db 64
+	add eax, 104
+	db 64
+	xor eax, -120
+	db 64
+	xor eax, 24
+	lea eax, [rax+rax*8+9]
+	db 64
+	add eax, -31
+	db 64
+	xor eax, -16
+	db 64
+	add eax, 68
+	lea eax, [rax+rax*8-110]
+	db 64
+	xor eax, 64
+	db 64
+	xor eax, -40
+	db 64
+	xor eax, -8
+	db 64
+	add eax, -10
+	db 64
+	xor eax, -32
+	db 64
+	add eax, 14
+	lea eax, [rax+rax*8-46]
+	db 64
+	xor eax, -104
+	lea eax, [rax+rax*8+36]
+	db 64
+	add eax, 100
+	lea eax, [rax+rax*8-65]
+	lea eax, [rax+rax*8+27]
+	lea eax, [rax+rax*8+91]
+	db 64
+	add eax, -101
+	db 64
+	add eax, -94
+	lea eax, [rax+rax*8-10]
+	db 64
+	xor eax, 80
+	db 64
+	add eax, -108
+	db 64
+	add eax, -58
+	db 64
+	xor eax, 48
+	lea eax, [rax+rax*8+73]
+	db 64
+	xor eax, -48
+	db 64
+	xor eax, 32
+	db 64
+	xor eax, -96
+	db 64
+	add eax, 118
+	db 64
+	add eax, 91
+	lea eax, [rax+rax*8+18]
+	db 64
+	add eax, -11
+	lea eax, [rax+rax*8+63]
+	db 64
+	add eax, 114
+	lea eax, [rax+rax*8+45]
+	db 64
+	add eax, -67
+	db 64
+	add eax, 53
+	lea eax, [rax+rax*8-101]
+	lea eax, [rax+rax*8-1]
+	db 64
+	xor eax, 16
+	lea eax, [rax+rax*8-37]
+	lea eax, [rax+rax*8-28]
+	lea eax, [rax+rax*8-55]
+	db 64
+	xor eax, -88
+	db 64
+	xor eax, -72
+	db 64
+	add eax, 36
+	db 64
+	xor eax, -56
+	db 64
+	add eax, 116
+	db 64
+	xor eax, 88
+	db 64
+	xor eax, -128
+	db 64
+	add eax, 50
+	db 64
+	add eax, 105
+	db 64
+	add eax, -37
+	db 64
+	xor eax, 112
+	db 64
+	xor eax, 8
+	db 64
+	xor eax, -24
+	lea eax, [rax+rax*8+118]
+	db 64
+	xor eax, 72
+	db 64
+	xor eax, -64
+	db 64
+	add eax, 40
+	lea eax, [rax+rax*8-74]
+	lea eax, [rax+rax*8+82]
+	lea eax, [rax+rax*8+54]
+	db 64
+	xor eax, 56
+	db 64
+	xor eax, 40
+	db 64
+	add eax, 87
\ No newline at end of file
diff --git a/src/asm/program_xmm_constants.inc b/src/asm/program_xmm_constants.inc
new file mode 100644
index 0000000..38c897c
--- /dev/null
+++ b/src/asm/program_xmm_constants.inc
@@ -0,0 +1,6 @@
+minDbl:
+	db 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 16, 0
+absMask:
+	db 255, 255, 255, 255, 255, 255, 255, 127, 255, 255, 255, 255, 255, 255, 255, 127
+signMask:
+	db 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 128
\ No newline at end of file
diff --git a/src/asm/squareHash.inc b/src/asm/squareHash.inc
new file mode 100644
index 0000000..b62dc9e
--- /dev/null
+++ b/src/asm/squareHash.inc
@@ -0,0 +1,87 @@
+	mov rax, 1613783669344650115
+	add rax, rcx
+	mul rax
+	sub rax, rdx ;# 1
+	mul rax
+	sub rax, rdx ;# 2
+	mul rax
+	sub rax, rdx ;# 3
+	mul rax
+	sub rax, rdx ;# 4
+	mul rax
+	sub rax, rdx ;# 5
+	mul rax
+	sub rax, rdx ;# 6
+	mul rax
+	sub rax, rdx ;# 7
+	mul rax
+	sub rax, rdx ;# 8
+	mul rax
+	sub rax, rdx ;# 9
+	mul rax
+	sub rax, rdx ;# 10
+	mul rax
+	sub rax, rdx ;# 11
+	mul rax
+	sub rax, rdx ;# 12
+	mul rax
+	sub rax, rdx ;# 13
+	mul rax
+	sub rax, rdx ;# 14
+	mul rax
+	sub rax, rdx ;# 15
+	mul rax
+	sub rax, rdx ;# 16
+	mul rax
+	sub rax, rdx ;# 17
+	mul rax
+	sub rax, rdx ;# 18
+	mul rax
+	sub rax, rdx ;# 19
+	mul rax
+	sub rax, rdx ;# 20
+	mul rax
+	sub rax, rdx ;# 21
+	mul rax
+	sub rax, rdx ;# 22
+	mul rax
+	sub rax, rdx ;# 23
+	mul rax
+	sub rax, rdx ;# 24
+	mul rax
+	sub rax, rdx ;# 25
+	mul rax
+	sub rax, rdx ;# 26
+	mul rax
+	sub rax, rdx ;# 27
+	mul rax
+	sub rax, rdx ;# 28
+	mul rax
+	sub rax, rdx ;# 29
+	mul rax
+	sub rax, rdx ;# 30
+	mul rax
+	sub rax, rdx ;# 31
+	mul rax
+	sub rax, rdx ;# 32
+	mul rax
+	sub rax, rdx ;# 33
+	mul rax
+	sub rax, rdx ;# 34
+	mul rax
+	sub rax, rdx ;# 35
+	mul rax
+	sub rax, rdx ;# 36
+	mul rax
+	sub rax, rdx ;# 37
+	mul rax
+	sub rax, rdx ;# 38
+	mul rax
+	sub rax, rdx ;# 39
+	mul rax
+	sub rax, rdx ;# 40
+	mul rax
+	sub rax, rdx ;# 41
+	mul rax
+	sub rax, rdx ;# 42
+	ret
\ No newline at end of file
diff --git a/src/blake2/blake2-impl.h b/src/blake2/blake2-impl.h
index 60b26fe..f294ba6 100644
--- a/src/blake2/blake2-impl.h
+++ b/src/blake2/blake2-impl.h
@@ -27,105 +27,10 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #define PORTABLE_BLAKE2_IMPL_H
 
 #include <stdint.h>
-#include <string.h>
 
-#if defined(_MSC_VER)
-#define BLAKE2_INLINE __inline
-#elif defined(__GNUC__) || defined(__clang__)
-#define BLAKE2_INLINE __inline__
-#else
-#define BLAKE2_INLINE
-#endif
+#include "endian.h"
 
- /* Argon2 Team - Begin Code */
- /*
-	Not an exhaustive list, but should cover the majority of modern platforms
-	Additionally, the code will always be correct---this is only a performance
-	tweak.
- */
-#if (defined(__BYTE_ORDER__) &&                                                \
-     (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)) ||                           \
-    defined(__LITTLE_ENDIAN__) || defined(__ARMEL__) || defined(__MIPSEL__) || \
-    defined(__AARCH64EL__) || defined(__amd64__) || defined(__i386__) ||       \
-    defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) ||                \
-    defined(_M_ARM)
-#define NATIVE_LITTLE_ENDIAN
-#endif
- /* Argon2 Team - End Code */
-
-static BLAKE2_INLINE uint32_t load32(const void *src) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	uint32_t w;
-	memcpy(&w, src, sizeof w);
-	return w;
-#else
-	const uint8_t *p = (const uint8_t *)src;
-	uint32_t w = *p++;
-	w |= (uint32_t)(*p++) << 8;
-	w |= (uint32_t)(*p++) << 16;
-	w |= (uint32_t)(*p++) << 24;
-	return w;
-#endif
-}
-
-static BLAKE2_INLINE uint64_t load64(const void *src) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	uint64_t w;
-	memcpy(&w, src, sizeof w);
-	return w;
-#else
-	const uint8_t *p = (const uint8_t *)src;
-	uint64_t w = *p++;
-	w |= (uint64_t)(*p++) << 8;
-	w |= (uint64_t)(*p++) << 16;
-	w |= (uint64_t)(*p++) << 24;
-	w |= (uint64_t)(*p++) << 32;
-	w |= (uint64_t)(*p++) << 40;
-	w |= (uint64_t)(*p++) << 48;
-	w |= (uint64_t)(*p++) << 56;
-	return w;
-#endif
-}
-
-static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	memcpy(dst, &w, sizeof w);
-#else
-	uint8_t *p = (uint8_t *)dst;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-#endif
-}
-
-static BLAKE2_INLINE void store64(void *dst, uint64_t w) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	memcpy(dst, &w, sizeof w);
-#else
-	uint8_t *p = (uint8_t *)dst;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-#endif
-}
-
-static BLAKE2_INLINE uint64_t load48(const void *src) {
+static FORCE_INLINE uint64_t load48(const void *src) {
 	const uint8_t *p = (const uint8_t *)src;
 	uint64_t w = *p++;
 	w |= (uint64_t)(*p++) << 8;
@@ -136,7 +41,7 @@ static BLAKE2_INLINE uint64_t load48(const void *src) {
 	return w;
 }
 
-static BLAKE2_INLINE void store48(void *dst, uint64_t w) {
+static FORCE_INLINE void store48(void *dst, uint64_t w) {
 	uint8_t *p = (uint8_t *)dst;
 	*p++ = (uint8_t)w;
 	w >>= 8;
@@ -151,11 +56,11 @@ static BLAKE2_INLINE void store48(void *dst, uint64_t w) {
 	*p++ = (uint8_t)w;
 }
 
-static BLAKE2_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
+static FORCE_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
 	return (w >> c) | (w << (32 - c));
 }
 
-static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
+static FORCE_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
 	return (w >> c) | (w << (64 - c));
 }
 
diff --git a/src/blake2/blake2b.c b/src/blake2/blake2b.c
index e7569b4..329ed3c 100644
--- a/src/blake2/blake2b.c
+++ b/src/blake2/blake2b.c
@@ -51,29 +51,29 @@ static const unsigned int blake2b_sigma[12][16] = {
 	{14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3},
 };
 
-static BLAKE2_INLINE void blake2b_set_lastnode(blake2b_state *S) {
+static FORCE_INLINE void blake2b_set_lastnode(blake2b_state *S) {
 	S->f[1] = (uint64_t)-1;
 }
 
-static BLAKE2_INLINE void blake2b_set_lastblock(blake2b_state *S) {
+static FORCE_INLINE void blake2b_set_lastblock(blake2b_state *S) {
 	if (S->last_node) {
 		blake2b_set_lastnode(S);
 	}
 	S->f[0] = (uint64_t)-1;
 }
 
-static BLAKE2_INLINE void blake2b_increment_counter(blake2b_state *S,
+static FORCE_INLINE void blake2b_increment_counter(blake2b_state *S,
 	uint64_t inc) {
 	S->t[0] += inc;
 	S->t[1] += (S->t[0] < inc);
 }
 
-static BLAKE2_INLINE void blake2b_invalidate_state(blake2b_state *S) {
+static FORCE_INLINE void blake2b_invalidate_state(blake2b_state *S) {
 	//clear_internal_memory(S, sizeof(*S));      /* wipe */
 	blake2b_set_lastblock(S); /* invalidate for further use */
 }
 
-static BLAKE2_INLINE void blake2b_init0(blake2b_state *S) {
+static FORCE_INLINE void blake2b_init0(blake2b_state *S) {
 	memset(S, 0, sizeof(*S));
 	memcpy(S->h, blake2b_IV, sizeof(S->h));
 }
diff --git a/src/blake2/blamka-round-ref.h b/src/blake2/blamka-round-ref.h
index d7acd68..d087b72 100644
--- a/src/blake2/blamka-round-ref.h
+++ b/src/blake2/blamka-round-ref.h
@@ -30,7 +30,7 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include "blake2-impl.h"
 
  /* designed by the Lyra PHC team */
-static BLAKE2_INLINE uint64_t fBlaMka(uint64_t x, uint64_t y) {
+static FORCE_INLINE uint64_t fBlaMka(uint64_t x, uint64_t y) {
 	const uint64_t m = UINT64_C(0xFFFFFFFF);
 	const uint64_t xy = (x & m) * (y & m);
 	return x + y + 2 * xy;
diff --git a/src/blake2/endian.h b/src/blake2/endian.h
new file mode 100644
index 0000000..fab1eed
--- /dev/null
+++ b/src/blake2/endian.h
@@ -0,0 +1,99 @@
+#pragma once
+#include <stdint.h>
+#include <string.h>
+
+#if defined(_MSC_VER)
+#define FORCE_INLINE __inline
+#elif defined(__GNUC__) || defined(__clang__)
+#define FORCE_INLINE __inline__
+#else
+#define FORCE_INLINE
+#endif
+
+ /* Argon2 Team - Begin Code */
+ /*
+	Not an exhaustive list, but should cover the majority of modern platforms
+	Additionally, the code will always be correct---this is only a performance
+	tweak.
+ */
+#if (defined(__BYTE_ORDER__) &&                                                \
+     (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)) ||                           \
+    defined(__LITTLE_ENDIAN__) || defined(__ARMEL__) || defined(__MIPSEL__) || \
+    defined(__AARCH64EL__) || defined(__amd64__) || defined(__i386__) ||       \
+    defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) ||                \
+    defined(_M_ARM)
+#define NATIVE_LITTLE_ENDIAN
+#endif
+ /* Argon2 Team - End Code */
+
+static FORCE_INLINE uint32_t load32(const void *src) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	uint32_t w;
+	memcpy(&w, src, sizeof w);
+	return w;
+#else
+	const uint8_t *p = (const uint8_t *)src;
+	uint32_t w = *p++;
+	w |= (uint32_t)(*p++) << 8;
+	w |= (uint32_t)(*p++) << 16;
+	w |= (uint32_t)(*p++) << 24;
+	return w;
+#endif
+}
+
+static FORCE_INLINE uint64_t load64(const void *src) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	uint64_t w;
+	memcpy(&w, src, sizeof w);
+	return w;
+#else
+	const uint8_t *p = (const uint8_t *)src;
+	uint64_t w = *p++;
+	w |= (uint64_t)(*p++) << 8;
+	w |= (uint64_t)(*p++) << 16;
+	w |= (uint64_t)(*p++) << 24;
+	w |= (uint64_t)(*p++) << 32;
+	w |= (uint64_t)(*p++) << 40;
+	w |= (uint64_t)(*p++) << 48;
+	w |= (uint64_t)(*p++) << 56;
+	return w;
+#endif
+}
+
+static FORCE_INLINE void store32(void *dst, uint32_t w) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	memcpy(dst, &w, sizeof w);
+#else
+	uint8_t *p = (uint8_t *)dst;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+#endif
+}
+
+static FORCE_INLINE void store64(void *dst, uint64_t w) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	memcpy(dst, &w, sizeof w);
+#else
+	uint8_t *p = (uint8_t *)dst;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+#endif
+}
diff --git a/src/common.hpp b/src/common.hpp
index 0bfc834..1d7f597 100644
--- a/src/common.hpp
+++ b/src/common.hpp
@@ -21,62 +21,68 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 #include <cstdint>
 #include <iostream>
+#include "blake2/endian.h"
 
 namespace RandomX {
 
 	using addr_t = uint32_t;
 
-	constexpr int RoundToNearest = 0;
-	constexpr int RoundDown = 1;
-	constexpr int RoundUp = 2;
-	constexpr int RoundToZero = 3;
-
 	constexpr int SeedSize = 32;
-	constexpr int ResultSize = 32;
+	constexpr int ResultSize = 64;
 
-	constexpr int CacheBlockSize = 1024;
-	constexpr int CacheShift = CacheBlockSize / 2;
-	constexpr int BlockExpansionRatio = 64;
-	constexpr uint32_t DatasetBlockSize = BlockExpansionRatio * CacheBlockSize;
-	constexpr uint32_t DatasetBlockCount = 65536;
-	constexpr uint32_t CacheSize = DatasetBlockCount * CacheBlockSize;
-	constexpr uint64_t DatasetSize = (uint64_t)DatasetBlockCount * DatasetBlockSize;
-
-	constexpr int ArgonIterations = 12;
-	constexpr uint32_t ArgonMemorySize = 65536; //KiB
+	constexpr int ArgonIterations = 3;
+	constexpr uint32_t ArgonMemorySize = 262144; //KiB
 	constexpr int ArgonLanes = 1;
 	const char ArgonSalt[] = "Monero\x1A$";
 	constexpr int ArgonSaltSize = sizeof(ArgonSalt) - 1;
 
+	constexpr int CacheLineSize = 64;
+	constexpr uint32_t CacheLineAlignMask = 0xFFFFFFFF & ~(CacheLineSize - 1);
+	constexpr uint64_t DatasetSize = 4ULL * 1024 * 1024 * 1024; //4 GiB
+	constexpr uint32_t CacheSize = ArgonMemorySize * 1024;
+	constexpr int CacheBlockCount = CacheSize / CacheLineSize;
+	constexpr int BlockExpansionRatio = DatasetSize / CacheSize;
+	constexpr int DatasetBlockCount = BlockExpansionRatio * CacheBlockCount;
+	constexpr int DatasetIterations = 16;
+
+
 #ifdef TRACE
 	constexpr bool trace = true;
 #else
 	constexpr bool trace = false;
 #endif
 
-	union convertible_t {
-		double f64;
-		int64_t i64;
-		uint64_t u64;
-		int32_t i32;
-		uint32_t u32;
-		struct {
-			int32_t i32lo;
-			int32_t i32hi;
-		};
-	};
+#ifndef UNREACHABLE
+#ifdef __GNUC__
+#define UNREACHABLE __builtin_unreachable()
+#elif _MSC_VER
+#define UNREACHABLE __assume(false)
+#else
+#define UNREACHABLE
+#endif
+#endif
+
+	using int_reg_t = uint64_t;
 
 	struct fpu_reg_t {
-		convertible_t lo;
-		convertible_t hi;
+		double lo;
+		double hi;
 	};
 
-	constexpr int ProgramLength = 512;
-	constexpr uint32_t InstructionCount = 1024 * 1024;
-	constexpr uint32_t ScratchpadSize = 256 * 1024;
-	constexpr uint32_t ScratchpadLength = ScratchpadSize / sizeof(convertible_t);
-	constexpr uint32_t ScratchpadL1 = ScratchpadSize / 16 / sizeof(convertible_t);
-	constexpr uint32_t ScratchpadL2 = ScratchpadSize / sizeof(convertible_t);
+	constexpr int ProgramLength = 256;
+	constexpr uint32_t InstructionCount = 2048;
+	constexpr uint32_t ScratchpadSize = 2 * 1024 * 1024;
+	constexpr uint32_t ScratchpadLength = ScratchpadSize / sizeof(int_reg_t);
+	constexpr uint32_t ScratchpadL1 = ScratchpadSize / 128 / sizeof(int_reg_t);
+	constexpr uint32_t ScratchpadL2 = ScratchpadSize / 8 / sizeof(int_reg_t);
+	constexpr uint32_t ScratchpadL3 = ScratchpadSize / sizeof(int_reg_t);
+	constexpr int ScratchpadL1Mask = (ScratchpadL1 - 1) * 8;
+	constexpr int ScratchpadL2Mask = (ScratchpadL2 - 1) * 8;
+	constexpr int ScratchpadL1Mask16 = (ScratchpadL1 / 2 - 1) * 16;
+	constexpr int ScratchpadL2Mask16 = (ScratchpadL2 / 2 - 1) * 16;
+	constexpr int ScratchpadL3Mask = (ScratchpadLength - 1) * 8;
+	constexpr int ScratchpadL3Mask64 = (ScratchpadLength / 8 - 1) * 64;
+	constexpr uint32_t TransformationCount = 90;
 	constexpr int RegistersCount = 8;
 
 	class Cache;
@@ -85,38 +91,50 @@ namespace RandomX {
 		return i % RandomX::ProgramLength;
 	}
 
-	struct LightClientDataset {
-		Cache* cache;
-		uint8_t* block;
-		uint32_t blockNumber;
+	class ILightClientAsyncWorker {
+	public:
+		virtual ~ILightClientAsyncWorker() {}
+		virtual void prepareBlock(addr_t) = 0;
+		virtual void prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) = 0;
+		virtual const uint64_t* getBlock(addr_t) = 0;
+		virtual void getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) = 0;
+		virtual void sync() = 0;
+		const Cache* getCache() {
+			return cache;
+		}
+	protected:
+		ILightClientAsyncWorker(const Cache* c) : cache(c) {}
+		const Cache* cache;
 	};
 
 	union dataset_t {
 		uint8_t* dataset;
 		Cache* cache;
-		LightClientDataset* lightDataset;
+		ILightClientAsyncWorker* asyncWorker;
 	};
 
 	struct MemoryRegisters {
-		addr_t ma, mx;
+		addr_t mx, ma;
 		dataset_t ds;
 	};
 
 	static_assert(sizeof(MemoryRegisters) == 2 * sizeof(addr_t) + sizeof(uintptr_t), "Invalid alignment of struct RandomX::MemoryRegisters");
 
 	struct RegisterFile {
-		convertible_t r[RegistersCount];
-		fpu_reg_t f[RegistersCount];
+		int_reg_t r[RegistersCount];
+		fpu_reg_t f[RegistersCount / 2];
+		fpu_reg_t e[RegistersCount / 2];
+		fpu_reg_t a[RegistersCount / 2];
 	};
 
-	static_assert(sizeof(RegisterFile) == 3 * RegistersCount * sizeof(convertible_t), "Invalid alignment of struct RandomX::RegisterFile");
+	static_assert(sizeof(RegisterFile) == 256, "Invalid alignment of struct RandomX::RegisterFile");
 
-	typedef convertible_t(*DatasetReadFunc)(addr_t, MemoryRegisters&);
+	typedef void(*DatasetReadFunc)(addr_t, MemoryRegisters&, int_reg_t(&reg)[RegistersCount]);
 
-	typedef void(*ProgramFunc)(RegisterFile&, MemoryRegisters&, convertible_t*);
+	typedef void(*ProgramFunc)(RegisterFile&, MemoryRegisters&, uint8_t* /* scratchpad */, uint64_t);
 
 	extern "C" {
-		void executeProgram(RegisterFile&, MemoryRegisters&, convertible_t*, DatasetReadFunc);
+		void executeProgram(RegisterFile&, MemoryRegisters&, uint8_t* /* scratchpad */, uint64_t);
 	}
 }
 
diff --git a/src/dataset.cpp b/src/dataset.cpp
index dee40c5..5b618f9 100644
--- a/src/dataset.cpp
+++ b/src/dataset.cpp
@@ -24,156 +24,103 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 #include "common.hpp"
 #include "dataset.hpp"
-#include "Pcg32.hpp"
 #include "Cache.hpp"
+#include "virtualMemory.hpp"
+#include "softAes.h"
+#include "squareHash.h"
+#include "blake2/endian.h"
 
 #if defined(__SSE2__)
 #include <wmmintrin.h>
-#define PREFETCH(memory) _mm_prefetch((const char *)((memory).ds.dataset + (memory).ma), _MM_HINT_T0)
+#define PREFETCHNTA(x) _mm_prefetch((const char *)(x), _MM_HINT_NTA)
 #else
 #define PREFETCH(memory)
 #endif
 
 namespace RandomX {
 
-	template<typename T>
-	static inline void shuffle(T* buffer, size_t bytes, Pcg32& gen) {
-		auto count = bytes / sizeof(T);
-		for (auto i = count - 1; i >= 1; --i) {
-			int j = gen.getUniform(0, i);
-			std::swap(buffer[j], buffer[i]);
+	void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber, const KeysContainer& keys) {
+		uint64_t r0, r1, r2, r3, r4, r5, r6, r7;
+
+		r0 = 4ULL * blockNumber;
+		r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;
+
+		constexpr uint32_t mask = (CacheSize - 1) & CacheLineAlignMask;
+
+		for (auto i = 0; i < DatasetIterations; ++i) {
+			const uint8_t* mixBlock = cache + (r0 & mask);
+			PREFETCHNTA(mixBlock);
+			r0 = squareHash(r0);
+			r0 ^= load64(mixBlock + 0);
+			r1 ^= load64(mixBlock + 8);
+			r2 ^= load64(mixBlock + 16);
+			r3 ^= load64(mixBlock + 24);
+			r4 ^= load64(mixBlock + 32);
+			r5 ^= load64(mixBlock + 40);
+			r6 ^= load64(mixBlock + 48);
+			r7 ^= load64(mixBlock + 56);
 		}
+
+		store64(out + 0, r0);
+		store64(out + 8, r1);
+		store64(out + 16, r2);
+		store64(out + 24, r3);
+		store64(out + 32, r4);
+		store64(out + 40, r5);
+		store64(out + 48, r6);
+		store64(out + 56, r7);
 	}
 
-	template<bool soft>
-	static inline __m128i aesenc(__m128i in, __m128i key) {
-		return soft ? soft_aesenc(in, key) : _mm_aesenc_si128(in, key);
-	}
-
-	template<bool soft>
-	static inline __m128i aesdec(__m128i in, __m128i key) {
-		return soft ? soft_aesdec(in, key) : _mm_aesdec_si128(in, key);
-	}
-
-	template<bool soft, bool enc>
-	void initBlock(const uint8_t* in, uint8_t* out, uint32_t blockNumber, const KeysContainer& keys) {
-		__m128i xin, xout;
-		//Initialization vector = block number extended to 128 bits
-		xout = _mm_cvtsi32_si128(blockNumber);
-		//Expand + AES
-		for (uint32_t i = 0; i < DatasetBlockSize / sizeof(__m128i); ++i) {
-			if ((i % 32) == 0) {
-				xin = _mm_set_epi64x(*(uint64_t*)(in + i / 4), 0);
-				xout = _mm_xor_si128(xin, xout);
-			}
-			if (enc) {
-				xout = aesenc<soft>(xout, keys[0]);
-				xout = aesenc<soft>(xout, keys[1]);
-				xout = aesenc<soft>(xout, keys[2]);
-				xout = aesenc<soft>(xout, keys[3]);
-				xout = aesenc<soft>(xout, keys[4]);
-				xout = aesenc<soft>(xout, keys[5]);
-				xout = aesenc<soft>(xout, keys[6]);
-				xout = aesenc<soft>(xout, keys[7]);
-				xout = aesenc<soft>(xout, keys[8]);
-				xout = aesenc<soft>(xout, keys[9]);
-			}
-			else {
-				xout = aesdec<soft>(xout, keys[0]);
-				xout = aesdec<soft>(xout, keys[1]);
-				xout = aesdec<soft>(xout, keys[2]);
-				xout = aesdec<soft>(xout, keys[3]);
-				xout = aesdec<soft>(xout, keys[4]);
-				xout = aesdec<soft>(xout, keys[5]);
-				xout = aesdec<soft>(xout, keys[6]);
-				xout = aesdec<soft>(xout, keys[7]);
-				xout = aesdec<soft>(xout, keys[8]);
-				xout = aesdec<soft>(xout, keys[9]);
-			}
-			_mm_store_si128((__m128i*)(out + i * sizeof(__m128i)), xout);
-		}
-		//Shuffle
-		Pcg32 gen(&xout);
-		shuffle<uint32_t>((uint32_t*)out, DatasetBlockSize, gen);
-	}
-
-	template
-		void initBlock<true, true>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<true, false>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<false, true>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<false, false>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	convertible_t datasetRead(addr_t addr, MemoryRegisters& memory) {
-		convertible_t data;
-		data.u64 = *(uint64_t*)(memory.ds.dataset + memory.ma);
-		memory.ma += 8;
+	void datasetRead(addr_t addr, MemoryRegisters& memory, RegisterFile& reg) {
+		uint64_t* datasetLine = (uint64_t*)(memory.ds.dataset + memory.ma);
 		memory.mx ^= addr;
-		if ((memory.mx & 0xFFF8) == 0) {
-			memory.ma = memory.mx & ~7;
-			PREFETCH(memory);
-		}
-		return data;
+		memory.mx &= -64; //align to cache line
+		std::swap(memory.mx, memory.ma);
+		PREFETCHNTA(memory.ds.dataset + memory.ma);
+		for (int i = 0; i < RegistersCount; ++i)
+			reg.r[i] ^= datasetLine[i];
 	}
 
-	template<bool softAes>
-	void initBlock(const uint8_t* cache, uint8_t* block, uint32_t blockNumber, const KeysContainer& keys) {
-		if (blockNumber % 2 == 1) {
-			initBlock<softAes, true>(cache + blockNumber * CacheBlockSize, block, blockNumber, keys);
-		}
-		else {
-			initBlock<softAes, false>(cache + blockNumber * CacheBlockSize, block, blockNumber, keys);
-		}
-	}
-
-	template
-		void initBlock<true>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<false>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template<bool softAes>
-	convertible_t datasetReadLight(addr_t addr, MemoryRegisters& memory) {
-		convertible_t data;
-		LightClientDataset* lds = memory.ds.lightDataset;
-		auto blockNumber = memory.ma / DatasetBlockSize;
-		if (lds->blockNumber != blockNumber) {
-			initBlock<softAes>(lds->cache->getCache(), (uint8_t*)lds->block, blockNumber, lds->cache->getKeys());
-			lds->blockNumber = blockNumber;
-		}
-		data.u64 = *(uint64_t*)(lds->block + (memory.ma % DatasetBlockSize));
-		memory.ma += 8;
+	void datasetReadLight(addr_t addr, MemoryRegisters& memory, int_reg_t (&reg)[RegistersCount]) {
 		memory.mx ^= addr;
-		if ((memory.mx & 0xFFF8) == 0) {
-			memory.ma = memory.mx & ~7;
-		}
-		return data;
+		memory.mx &= CacheLineAlignMask; //align to cache line
+		Cache* cache = memory.ds.cache;
+		uint64_t datasetLine[CacheLineSize / sizeof(uint64_t)];
+		initBlock(cache->getCache(), (uint8_t*)datasetLine, memory.ma / CacheLineSize, cache->getKeys());
+		for (int i = 0; i < RegistersCount; ++i)
+			reg[i] ^= datasetLine[i];
+		std::swap(memory.mx, memory.ma);
 	}
 
-	template
-		convertible_t datasetReadLight<false>(addr_t addr, MemoryRegisters& memory);
+	void datasetReadLightAsync(addr_t addr, MemoryRegisters& memory, int_reg_t(&reg)[RegistersCount]) {
+		ILightClientAsyncWorker* aw = memory.ds.asyncWorker;
+		const uint64_t* datasetLine = aw->getBlock(memory.ma);
+		for (int i = 0; i < RegistersCount; ++i)
+			reg[i] ^= datasetLine[i];
+		memory.mx ^= addr;
+		memory.mx &= CacheLineAlignMask; //align to cache line
+		std::swap(memory.mx, memory.ma);
+		aw->prepareBlock(memory.ma);
+	}
 
-	template
-		convertible_t datasetReadLight<true>(addr_t addr, MemoryRegisters& memory);
-
-	void datasetAlloc(dataset_t& ds) {
+	void datasetAlloc(dataset_t& ds, bool largePages) {
 		if (sizeof(size_t) <= 4)
 			throw std::runtime_error("Platform doesn't support enough memory for the dataset");
-		ds.dataset = (uint8_t*)_mm_malloc(DatasetSize, /*sizeof(__m128i)*/ 64);
-		if (ds.dataset == nullptr) {
-			throw std::runtime_error("Dataset memory allocation failed. >4 GiB of free virtual memory is needed.");
+		if (largePages) {
+			ds.dataset = (uint8_t*)allocLargePagesMemory(DatasetSize);
+		}
+		else {
+			ds.dataset = (uint8_t*)_mm_malloc(DatasetSize, 64);
+			if (ds.dataset == nullptr) {
+				throw std::runtime_error("Dataset memory allocation failed. >4 GiB of free virtual memory is needed.");
+			}
 		}
 	}
 
 	template<bool softAes>
 	void datasetInit(Cache* cache, dataset_t ds, uint32_t startBlock, uint32_t blockCount) {
 		for (uint32_t i = startBlock; i < startBlock + blockCount; ++i) {
-			initBlock<softAes>(cache->getCache(), ds.dataset + i * DatasetBlockSize, i, cache->getKeys());
+			initBlock(cache->getCache(), ds.dataset + i * CacheLineSize, i, cache->getKeys());
 		}
 	}
 
@@ -184,14 +131,26 @@ namespace RandomX {
 		void datasetInit<true>(Cache*, dataset_t, uint32_t, uint32_t);
 
 	template<bool softAes>
-	void datasetInitCache(const void* seed, dataset_t& ds) {
-		ds.cache = new Cache();
+	void datasetInitCache(const void* seed, dataset_t& ds, bool largePages) {
+		ds.cache = new(Cache::alloc(largePages)) Cache();
 		ds.cache->initialize<softAes>(seed, SeedSize);
 	}
 
 	template
-		void datasetInitCache<false>(const void*, dataset_t&);
+		void datasetInitCache<false>(const void*, dataset_t&, bool);
 
 	template
-		void datasetInitCache<true>(const void*, dataset_t&);
+		void datasetInitCache<true>(const void*, dataset_t&, bool);
+
+	template<bool softAes>
+	void aesBench(uint32_t blockCount) {
+		alignas(16) KeysContainer keys;
+		alignas(16) uint8_t buffer[CacheLineSize];
+		for (uint32_t block = 0; block < blockCount; ++block) {
+			initBlock(buffer, buffer, 0, keys);
+		}
+	}
+
+	template void aesBench<false>(uint32_t blockCount);
+	template void aesBench<true>(uint32_t blockCount);
 }
diff --git a/src/dataset.hpp b/src/dataset.hpp
index bb29197..77a477d 100644
--- a/src/dataset.hpp
+++ b/src/dataset.hpp
@@ -23,7 +23,6 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <array>
 #include "intrinPortable.h"
 #include "common.hpp"
-#include "softAes.h"
 
 namespace RandomX {
 
@@ -32,20 +31,23 @@ namespace RandomX {
 	template<bool soft, bool enc>
 	void initBlock(const uint8_t* in, uint8_t* out, uint32_t blockNumber, const KeysContainer& keys);
 
-	template<bool softAes>
 	void initBlock(const uint8_t* cache, uint8_t* block, uint32_t blockNumber, const KeysContainer& keys);
 
-	void datasetAlloc(dataset_t& ds);
+	void datasetAlloc(dataset_t& ds, bool largePages);
 
 	template<bool softAes>
 	void datasetInit(Cache* cache, dataset_t ds, uint32_t startBlock, uint32_t blockCount);
 
-	convertible_t datasetRead(addr_t addr, MemoryRegisters& memory);
+	void datasetRead(addr_t addr, MemoryRegisters& memory, RegisterFile&);
 
 	template<bool softAes>
-	void datasetInitCache(const void* seed, dataset_t& dataset);
+	void datasetInitCache(const void* seed, dataset_t& dataset, bool largePages);
+
+	void datasetReadLight(addr_t addr, MemoryRegisters& memory, int_reg_t(&reg)[RegistersCount]);
+
+	void datasetReadLightAsync(addr_t addr, MemoryRegisters& memory, int_reg_t(&reg)[RegistersCount]);
 
 	template<bool softAes>
-	convertible_t datasetReadLight(addr_t addr, MemoryRegisters& memory);
+	void aesBench(uint32_t blockCount);
 }
 
diff --git a/src/divideByConstantCodegen.c b/src/divideByConstantCodegen.c
new file mode 100644
index 0000000..255baf4
--- /dev/null
+++ b/src/divideByConstantCodegen.c
@@ -0,0 +1,169 @@
+/*
+  Reference implementations of computing and using the "magic number" approach to dividing
+  by constants, including codegen instructions. The unsigned division incorporates the
+  "round down" optimization per ridiculous_fish.
+
+  This is free and unencumbered software. Any copyright is dedicated to the Public Domain.
+*/
+
+#include <limits.h> //for CHAR_BIT
+#include <assert.h>
+
+#include "divideByConstantCodegen.h"
+
+struct magicu_info compute_unsigned_magic_info(unsigned_type D, unsigned num_bits) {
+
+	//The numerator must fit in a unsigned_type
+	assert(num_bits > 0 && num_bits <= sizeof(unsigned_type) * CHAR_BIT);
+
+	// D must be larger than zero and not a power of 2
+	assert(D & (D - 1));
+
+	// The eventual result
+	struct magicu_info result;
+
+	// Bits in a unsigned_type
+	const unsigned UINT_BITS = sizeof(unsigned_type) * CHAR_BIT;
+
+	// The extra shift implicit in the difference between UINT_BITS and num_bits
+	const unsigned extra_shift = UINT_BITS - num_bits;
+
+	// The initial power of 2 is one less than the first one that can possibly work
+	const unsigned_type initial_power_of_2 = (unsigned_type)1 << (UINT_BITS - 1);
+
+	// The remainder and quotient of our power of 2 divided by d
+	unsigned_type quotient = initial_power_of_2 / D, remainder = initial_power_of_2 % D;
+
+	// ceil(log_2 D)
+	unsigned ceil_log_2_D;
+
+	// The magic info for the variant "round down" algorithm
+	unsigned_type down_multiplier = 0;
+	unsigned down_exponent = 0;
+	int has_magic_down = 0;
+
+	// Compute ceil(log_2 D)
+	ceil_log_2_D = 0;
+	unsigned_type tmp;
+	for (tmp = D; tmp > 0; tmp >>= 1)
+		ceil_log_2_D += 1;
+
+
+	// Begin a loop that increments the exponent, until we find a power of 2 that works.
+	unsigned exponent;
+	for (exponent = 0; ; exponent++) {
+		// Quotient and remainder is from previous exponent; compute it for this exponent.
+		if (remainder >= D - remainder) {
+			// Doubling remainder will wrap around D
+			quotient = quotient * 2 + 1;
+			remainder = remainder * 2 - D;
+		}
+		else {
+			// Remainder will not wrap
+			quotient = quotient * 2;
+			remainder = remainder * 2;
+		}
+
+		// We're done if this exponent works for the round_up algorithm.
+		// Note that exponent may be larger than the maximum shift supported,
+		// so the check for >= ceil_log_2_D is critical.
+		if ((exponent + extra_shift >= ceil_log_2_D) || (D - remainder) <= ((unsigned_type)1 << (exponent + extra_shift)))
+			break;
+
+		// Set magic_down if we have not set it yet and this exponent works for the round_down algorithm
+		if (!has_magic_down && remainder <= ((unsigned_type)1 << (exponent + extra_shift))) {
+			has_magic_down = 1;
+			down_multiplier = quotient;
+			down_exponent = exponent;
+		}
+	}
+
+	if (exponent < ceil_log_2_D) {
+		// magic_up is efficient
+		result.multiplier = quotient + 1;
+		result.pre_shift = 0;
+		result.post_shift = exponent;
+		result.increment = 0;
+	}
+	else if (D & 1) {
+		// Odd divisor, so use magic_down, which must have been set
+		assert(has_magic_down);
+		result.multiplier = down_multiplier;
+		result.pre_shift = 0;
+		result.post_shift = down_exponent;
+		result.increment = 1;
+	}
+	else {
+		// Even divisor, so use a prefix-shifted dividend
+		unsigned pre_shift = 0;
+		unsigned_type shifted_D = D;
+		while ((shifted_D & 1) == 0) {
+			shifted_D >>= 1;
+			pre_shift += 1;
+		}
+		result = compute_unsigned_magic_info(shifted_D, num_bits - pre_shift);
+		assert(result.increment == 0 && result.pre_shift == 0); //expect no increment or pre_shift in this path
+		result.pre_shift = pre_shift;
+	}
+	return result;
+}
+
+struct magics_info compute_signed_magic_info(signed_type D) {
+	// D must not be zero and must not be a power of 2 (or its negative)
+	assert(D != 0 && (D & -D) != D && (D & -D) != -D);
+
+	// Our result
+	struct magics_info result;
+
+	// Bits in an signed_type
+	const unsigned SINT_BITS = sizeof(signed_type) * CHAR_BIT;
+
+	// Absolute value of D (we know D is not the most negative value since that's a power of 2)
+	const unsigned_type abs_d = (D < 0 ? -D : D);
+
+	// The initial power of 2 is one less than the first one that can possibly work
+	// "two31" in Warren
+	unsigned exponent = SINT_BITS - 1;
+	const unsigned_type initial_power_of_2 = (unsigned_type)1 << exponent;
+
+	// Compute the absolute value of our "test numerator,"
+	// which is the largest dividend whose remainder with d is d-1.
+	// This is called anc in Warren.
+	const unsigned_type tmp = initial_power_of_2 + (D < 0);
+	const unsigned_type abs_test_numer = tmp - 1 - tmp % abs_d;
+
+	// Initialize our quotients and remainders (q1, r1, q2, r2 in Warren)
+	unsigned_type quotient1 = initial_power_of_2 / abs_test_numer, remainder1 = initial_power_of_2 % abs_test_numer;
+	unsigned_type quotient2 = initial_power_of_2 / abs_d, remainder2 = initial_power_of_2 % abs_d;
+	unsigned_type delta;
+
+	// Begin our loop
+	do {
+		// Update the exponent
+		exponent++;
+
+		// Update quotient1 and remainder1
+		quotient1 *= 2;
+		remainder1 *= 2;
+		if (remainder1 >= abs_test_numer) {
+			quotient1 += 1;
+			remainder1 -= abs_test_numer;
+		}
+
+		// Update quotient2 and remainder2
+		quotient2 *= 2;
+		remainder2 *= 2;
+		if (remainder2 >= abs_d) {
+			quotient2 += 1;
+			remainder2 -= abs_d;
+		}
+
+		// Keep going as long as (2**exponent) / abs_d <= delta
+		delta = abs_d - remainder2;
+	} while (quotient1 < delta || (quotient1 == delta && remainder1 == 0));
+
+	result.multiplier = quotient2 + 1;
+	if (D < 0) result.multiplier = -result.multiplier;
+	result.shift = exponent - SINT_BITS;
+	return result;
+}
diff --git a/src/divideByConstantCodegen.h b/src/divideByConstantCodegen.h
new file mode 100644
index 0000000..800647c
--- /dev/null
+++ b/src/divideByConstantCodegen.h
@@ -0,0 +1,117 @@
+/*
+Copyright (c) 2018 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#pragma once
+#include <stdint.h>
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+	typedef uint64_t unsigned_type;
+	typedef int64_t signed_type;
+
+	/* Computes "magic info" for performing signed division by a fixed integer D.
+	   The type 'signed_type' is assumed to be defined as a signed integer type large enough
+	   to hold both the dividend and the divisor.
+	   Here >> is arithmetic (signed) shift, and >>> is logical shift.
+
+	   To emit code for n/d, rounding towards zero, use the following sequence:
+
+		 m = compute_signed_magic_info(D)
+		 emit("result = (m.multiplier * n) >> SINT_BITS");
+		 if d > 0 and m.multiplier < 0: emit("result += n")
+		 if d < 0 and m.multiplier > 0: emit("result -= n")
+		 if m.post_shift > 0: emit("result >>= m.shift")
+		 emit("result += (result < 0)")
+
+	  The shifts by SINT_BITS may be "free" if the high half of the full multiply
+	  is put in a separate register.
+
+	  The final add can of course be implemented via the sign bit, e.g.
+		  result += (result >>> (SINT_BITS - 1))
+	   or
+		  result -= (result >> (SINT_BITS - 1))
+
+	   This code is heavily indebted to Hacker's Delight by Henry Warren.
+	   See http://www.hackersdelight.org/HDcode/magic.c.txt
+	   Used with permission from http://www.hackersdelight.org/permissions.htm
+	 */
+
+	struct magics_info {
+		signed_type multiplier; // the "magic number" multiplier
+		unsigned shift; // shift for the dividend after multiplying
+	};
+	struct magics_info compute_signed_magic_info(signed_type D);
+
+
+	/* Computes "magic info" for performing unsigned division by a fixed positive integer D.
+	   The type 'unsigned_type' is assumed to be defined as an unsigned integer type large enough
+	   to hold both the dividend and the divisor. num_bits can be set appropriately if n is
+	   known to be smaller than the largest unsigned_type; if this is not known then pass
+	   (sizeof(unsigned_type) * CHAR_BIT) for num_bits.
+
+	   Assume we have a hardware register of width UINT_BITS, a known constant D which is
+	   not zero and not a power of 2, and a variable n of width num_bits (which may be
+	   up to UINT_BITS). To emit code for n/d, use one of the two following sequences
+	   (here >>> refers to a logical bitshift):
+
+		 m = compute_unsigned_magic_info(D, num_bits)
+		 if m.pre_shift > 0: emit("n >>>= m.pre_shift")
+		 if m.increment: emit("n = saturated_increment(n)")
+		 emit("result = (m.multiplier * n) >>> UINT_BITS")
+		 if m.post_shift > 0: emit("result >>>= m.post_shift")
+
+	   or
+
+		 m = compute_unsigned_magic_info(D, num_bits)
+		 if m.pre_shift > 0: emit("n >>>= m.pre_shift")
+		 emit("result = m.multiplier * n")
+		 if m.increment: emit("result = result + m.multiplier")
+		 emit("result >>>= UINT_BITS")
+		 if m.post_shift > 0: emit("result >>>= m.post_shift")
+
+	  The shifts by UINT_BITS may be "free" if the high half of the full multiply
+	  is put in a separate register.
+
+	  saturated_increment(n) means "increment n unless it would wrap to 0," i.e.
+		if n == (1 << UINT_BITS)-1: result = n
+		else: result = n+1
+	  A common way to implement this is with the carry bit. For example, on x86:
+		 add 1
+		 sbb 0
+
+	  Some invariants:
+	   1: At least one of pre_shift and increment is zero
+	   2: multiplier is never zero
+
+	   This code incorporates the "round down" optimization per ridiculous_fish.
+	 */
+
+	struct magicu_info {
+		unsigned_type multiplier; // the "magic number" multiplier
+		unsigned pre_shift; // shift for the dividend before multiplying
+		unsigned post_shift; //shift for the dividend after multiplying
+		int increment; // 0 or 1; if set then increment the numerator, using one of the two strategies
+	};
+	struct magicu_info compute_unsigned_magic_info(unsigned_type D, unsigned num_bits);
+
+#if defined(__cplusplus)
+}
+#endif
\ No newline at end of file
diff --git a/src/executeProgram-win64.asm b/src/executeProgram-win64.asm
index 356428c..ac49e50 100644
--- a/src/executeProgram-win64.asm
+++ b/src/executeProgram-win64.asm
@@ -15,20 +15,22 @@
 ;# You should have received a copy of the GNU General Public License
 ;# along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
-PUBLIC executeProgram
+IFDEF RAX
 
-.code
+_RANDOMX_EXECUTE_PROGRAM SEGMENT PAGE READ EXECUTE
+
+PUBLIC executeProgram
 
 executeProgram PROC
 	; REGISTER ALLOCATION:
 	; rax -> temporary
-	; rbx -> MemoryRegisters& memory
+	; rbx -> "ic"
 	; rcx -> temporary
 	; rdx -> temporary
-	; rsi -> convertible_t& scratchpad
-	; rdi -> "ic" (instruction counter)
-	; rbp	-> beginning of VM stack
-	; rsp -> end of VM stack
+	; rsi -> scratchpad pointer
+	; rdi -> dataset pointer
+	; rbp -> "ma", "mx"
+	; rsp -> stack pointer
 	; r8 	-> "r0"
 	; r9 	-> "r1"
 	; r10 -> "r2"
@@ -37,31 +39,22 @@ executeProgram PROC
 	; r13 -> "r5"
 	; r14 -> "r6"
 	; r15 -> "r7"
-	; xmm0 -> temporary
-	; xmm1 -> temporary
+	; xmm0 -> "f0"
+	; xmm1 -> "f1"
 	; xmm2 -> "f2"
 	; xmm3 -> "f3"
-	; xmm4 -> "f4"
-	; xmm5 -> "f5"
-	; xmm6 -> "f6"
-	; xmm7 -> "f7"
-	; xmm8 -> "f0"
-	; xmm9 -> "f1"
-	; xmm10 -> absolute value mask
-
-	; STACK STRUCTURE:
-	;   |
-	;   |
-	;   | saved registers
-	;   |
-	;   v
-	; [rbp] RegisterFile& registerFile
-	;   |
-	;   |
-	;   | VM stack
-	;   |
-	;   v
-	; [rsp] last element of VM stack
+	; xmm4 -> "e0"
+	; xmm5 -> "e1"
+	; xmm6 -> "e2"
+	; xmm7 -> "e3"
+	; xmm8 -> "a0"
+	; xmm9 -> "a1"
+	; xmm10 -> "a2"
+	; xmm11 -> "a3"
+	; xmm12 -> temporary
+	; xmm13 -> DBL_MIN
+	; xmm14 -> absolute value mask
+	; xmm15 -> sign mask
 
 	; store callee-saved registers
 	push rbx
@@ -78,95 +71,131 @@ executeProgram PROC
 	movdqu xmmword ptr [rsp+32], xmm8
 	movdqu xmmword ptr [rsp+16], xmm9
 	movdqu xmmword ptr [rsp+0], xmm10
+	sub rsp, 80
+	movdqu xmmword ptr [rsp+64], xmm11
+	movdqu xmmword ptr [rsp+48], xmm12
+	movdqu xmmword ptr [rsp+32], xmm13
+	movdqu xmmword ptr [rsp+16], xmm14
+	movdqu xmmword ptr [rsp+0], xmm15
 
 	; function arguments
-	push rcx				; RegisterFile& registerFile
-	mov rbx, rdx		; MemoryRegisters& memory
-	mov rsi, r8			; convertible_t& scratchpad
-	push r9
+	push rcx                    ; RegisterFile& registerFile
+	mov rbp, qword ptr [rdx]    ; "mx", "ma"
+	mov eax, ebp                ; "mx"
+	mov rdi, qword ptr [rdx+8]  ; uint8_t* dataset
+	mov rsi, r8                 ; convertible_t* scratchpad
+	mov rbx, r9                 ; loop counter
+	
+	;# zero integer registers
+	xor r8, r8
+	xor r9, r9
+	xor r10, r10
+	xor r11, r11
+	xor r12, r12
+	xor r13, r13
+	xor r14, r14
+	xor r15, r15
+	
+	;# load constant registers
+	lea rcx, [rcx+120]
+	movapd xmm8, xmmword ptr [rcx+72]
+	movapd xmm9, xmmword ptr [rcx+88]
+	movapd xmm10, xmmword ptr [rcx+104]
+	movapd xmm11, xmmword ptr [rcx+120]
+	movapd xmm13, xmmword ptr [minDbl]
+	movapd xmm14, xmmword ptr [absMask]
+	movapd xmm15, xmmword ptr [signMask]
 
-	mov rbp, rsp			; beginning of VM stack
-	mov rdi, 1048577	; number of VM instructions to execute + 1
+	jmp program_begin
 
-	xorps xmm10, xmm10
-	cmpeqpd xmm10, xmm10
-	psrlq xmm10, 1		; mask for absolute value = 0x7fffffffffffffff7fffffffffffffff
+ALIGN 64
+minDbl:
+	db 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 16, 0
+absMask:
+	db 255, 255, 255, 255, 255, 255, 255, 127, 255, 255, 255, 255, 255, 255, 255, 127
+signMask:
+	db 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 128
 
-	; reset rounding mode
-	mov dword ptr [rsp-8], 40896
-	ldmxcsr dword ptr [rsp-8]
-
-	; load integer registers
-	mov r8, qword ptr [rcx+0]
-	mov r9, qword ptr [rcx+8]
-	mov r10, qword ptr [rcx+16]
-	mov r11, qword ptr [rcx+24]
-	mov r12, qword ptr [rcx+32]
-	mov r13, qword ptr [rcx+40]
-	mov r14, qword ptr [rcx+48]
-	mov r15, qword ptr [rcx+56]
-
-	; load register f0 hi, lo
-	xorps xmm8, xmm8
-	cvtsi2sd xmm8, qword ptr [rcx+72]
-	pslldq xmm8, 8
-	cvtsi2sd xmm8, qword ptr [rcx+64]
-
-	; load register f1 hi, lo
-	xorps xmm9, xmm9
-	cvtsi2sd xmm9, qword ptr [rcx+88]
-	pslldq xmm9, 8
-	cvtsi2sd xmm9, qword ptr [rcx+80]
-
-	; load register f2 hi, lo
-	xorps xmm2, xmm2
-	cvtsi2sd xmm2, qword ptr [rcx+104]
-	pslldq xmm2, 8
-	cvtsi2sd xmm2, qword ptr [rcx+96]
-
-	; load register f3 hi, lo
-	xorps xmm3, xmm3
-	cvtsi2sd xmm3, qword ptr [rcx+120]
-	pslldq xmm3, 8
-	cvtsi2sd xmm3, qword ptr [rcx+112]
-
-	lea rcx, [rcx+64]
-
-	; load register f4 hi, lo
-	xorps xmm4, xmm4
-	cvtsi2sd xmm4, qword ptr [rcx+72]
-	pslldq xmm4, 8
-	cvtsi2sd xmm4, qword ptr [rcx+64]
-
-	; load register f5 hi, lo
-	xorps xmm5, xmm5
-	cvtsi2sd xmm5, qword ptr [rcx+88]
-	pslldq xmm5, 8
-	cvtsi2sd xmm5, qword ptr [rcx+80]
-
-	; load register f6 hi, lo
-	xorps xmm6, xmm6
-	cvtsi2sd xmm6, qword ptr [rcx+104]
-	pslldq xmm6, 8
-	cvtsi2sd xmm6, qword ptr [rcx+96]
-
-	; load register f7 hi, lo
-	xorps xmm7, xmm7
-	cvtsi2sd xmm7, qword ptr [rcx+120]
-	pslldq xmm7, 8
-	cvtsi2sd xmm7, qword ptr [rcx+112]
-
-	; program body
+ALIGN 64
+program_begin:
+	xor rax, r8                      ;# read address register 1
+	xor rax, r9
+	mov rdx, rax
+	and eax, 1048512
+	push rax
+	lea rcx, [rsi+rax]
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	ror rdx, 32
+	and edx, 1048512
+	push rdx
+	lea rcx, [rsi+rdx]
+	cvtdq2pd xmm0, qword ptr [rcx+0]
+	cvtdq2pd xmm1, qword ptr [rcx+8]
+	cvtdq2pd xmm2, qword ptr [rcx+16]
+	cvtdq2pd xmm3, qword ptr [rcx+24]
+	cvtdq2pd xmm4, qword ptr [rcx+32]
+	cvtdq2pd xmm5, qword ptr [rcx+40]
+	cvtdq2pd xmm6, qword ptr [rcx+48]
+	cvtdq2pd xmm7, qword ptr [rcx+56]
+	andps xmm4, xmm14
+	andps xmm5, xmm14
+	andps xmm6, xmm14
+	andps xmm7, xmm14
 
+	;# 256 instructions
 	include program.inc
 
+	mov eax, r8d                       ;# read address register 1
+	xor eax, r9d                       ;# read address register 2
+	xor rbp, rax                       ;# modify "mx"
+	and rbp, -64                       ;# align "mx" to the start of a cache line
+	mov edx, ebp                       ;# edx = mx
+	prefetchnta byte ptr [rdi+rdx]
+	ror rbp, 32                        ;# swap "ma" and "mx"
+	mov edx, ebp                       ;# edx = ma
+	lea rcx, [rdi+rdx]                 ;# dataset cache line
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	pop rax
+	lea rcx, [rsi+rax]
+	mov qword ptr [rcx+0], r8
+	mov qword ptr [rcx+8], r9
+	mov qword ptr [rcx+16], r10
+	mov qword ptr [rcx+24], r11
+	mov qword ptr [rcx+32], r12
+	mov qword ptr [rcx+40], r13
+	mov qword ptr [rcx+48], r14
+	mov qword ptr [rcx+56], r15
+	pop rax
+	lea rcx, [rsi+rax]
+	mulpd xmm0, xmm4
+	mulpd xmm1, xmm5
+	mulpd xmm2, xmm6
+	mulpd xmm3, xmm7
+	movapd xmmword ptr [rcx+0], xmm0
+	movapd xmmword ptr [rcx+16], xmm1
+	movapd xmmword ptr [rcx+32], xmm2
+	movapd xmmword ptr [rcx+48], xmm3
+	xor eax, eax
+	dec ebx
+	jnz program_begin
+	
 rx_finish:
-	; unroll the stack
-	mov rsp, rbp
-
 	; save VM register values
 	pop rcx
-	pop rcx
 	mov qword ptr [rcx+0], r8
 	mov qword ptr [rcx+8], r9
 	mov qword ptr [rcx+16], r10
@@ -175,8 +204,8 @@ rx_finish:
 	mov qword ptr [rcx+40], r13
 	mov qword ptr [rcx+48], r14
 	mov qword ptr [rcx+56], r15
-	movdqa xmmword ptr [rcx+64], xmm8
-	movdqa xmmword ptr [rcx+80], xmm9
+	movdqa xmmword ptr [rcx+64], xmm0
+	movdqa xmmword ptr [rcx+80], xmm1
 	movdqa xmmword ptr [rcx+96], xmm2
 	movdqa xmmword ptr [rcx+112], xmm3
 	lea rcx, [rcx+64]
@@ -186,6 +215,12 @@ rx_finish:
 	movdqa xmmword ptr [rcx+112], xmm7
 
 	; load callee-saved registers
+	movdqu xmm15, xmmword ptr [rsp]
+	movdqu xmm14, xmmword ptr [rsp+16]
+	movdqu xmm13, xmmword ptr [rsp+32]
+	movdqu xmm12, xmmword ptr [rsp+48]
+	movdqu xmm11, xmmword ptr [rsp+64]
+	add rsp, 80
 	movdqu xmm10, xmmword ptr [rsp]
 	movdqu xmm9, xmmword ptr [rsp+16]
 	movdqu xmm8, xmmword ptr [rsp+32]
@@ -202,57 +237,50 @@ rx_finish:
 	pop rbx
 
 	; return
-	ret	0
+	ret
+	
+TransformAddress MACRO reg32, reg64
+;# Transforms the address in the register so that the transformed address
+;# lies in a different cache line than the original address (mod 2^N).
+;# This is done to prevent a load-store dependency.
+;# There are 3 different transformations that can be used: x -> 9*x+C, x -> x+C, x -> x^C
+	;lea reg32, [reg64+reg64*8+127]  ;# C = -119 -110 -101 -92 -83 -74 -65 -55 -46 -37 -28 -19 -10 -1 9 18 27 36 45 54 63 73 82 91 100 109 118 127
+	db 64
+	add reg32, -39                   ;# C = all except -7 to +7
+	;xor reg32, -8                   ;# C = all except 0 to 7
+ENDM
 
-rx_read_dataset:
-	push r8
-	push r9
-	push r10
-	push r11
-	mov rdx, rbx
-	movd qword ptr [rsp - 8], xmm1
-	movd qword ptr [rsp - 16], xmm2
-	sub rsp, 48
-	call qword ptr [rbp]
-	add rsp, 48
-	movd xmm2, qword ptr [rsp - 16]
-	movd xmm1, qword ptr [rsp - 8]
-	pop r11
-	pop r10
-	pop r9
-	pop r8
-	ret 0
-
-rx_read_dataset_r:
-	mov edx, dword ptr [rbx]	; ma
-	mov rax, qword ptr [rbx+8]	; dataset
-	mov rax, qword ptr [rax+rdx]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]	; mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 0FFF8h
-	jne short rx_read_dataset_r_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	mov rdx, qword ptr [rbx+8]
-	prefetcht0 byte ptr [rdx+rcx]
-rx_read_dataset_r_ret:
-	ret 0
-
-rx_read_dataset_f:
-	mov edx, dword ptr [rbx]	; ma
-	mov rax, qword ptr [rbx+8]	; dataset
-	cvtdq2pd xmm0, qword ptr [rax+rdx]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]	; mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 0FFF8h
-	jne short rx_read_dataset_f_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	prefetcht0 byte ptr [rax+rcx]
-rx_read_dataset_f_ret:
-	ret 0
+ALIGN 64
+rx_read:
+;# IN     eax = random 32-bit address
+;# GLOBAL rdi = address of the dataset address
+;# GLOBAL rsi = address of the scratchpad
+;# GLOBAL rbp = low 32 bits = "mx", high 32 bits = "ma"
+;# MODIFY rcx, rdx
+	TransformAddress eax, rax       ;# TransformAddress function
+	mov rcx, qword ptr [rdi]        ;# load the dataset address
+	xor rbp, rax                    ;# modify "mx"
+	;# prefetch cacheline "mx"
+	and rbp, -64                    ;# align "mx" to the start of a cache line
+	mov edx, ebp                    ;# edx = mx
+	prefetchnta byte ptr [rcx+rdx]
+	;# read cacheline "ma"
+	ror rbp, 32                     ;# swap "ma" and "mx"
+	mov edx, ebp                    ;# edx = ma
+	lea rcx, [rcx+rdx]              ;# dataset cache line
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	ret
 executeProgram ENDP
 
+_RANDOMX_EXECUTE_PROGRAM ENDS
+
+ENDIF
+
 END
diff --git a/src/hashAes1Rx4.cpp b/src/hashAes1Rx4.cpp
new file mode 100644
index 0000000..db1c6a2
--- /dev/null
+++ b/src/hashAes1Rx4.cpp
@@ -0,0 +1,136 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "softAes.h"
+
+/*
+	Calculate a 512-bit hash of 'input' using 4 lanes of AES.
+	The input is treated as a set of round keys for the encryption
+	of the initial state.
+
+	'inputSize' must be a multiple of 64.
+
+	For a 2 MiB input, this has the same security as 32768-round
+	AES encryption.
+
+	Hashing throughput: >20 GiB/s per CPU core with hardware AES
+*/
+template<bool softAes>
+void hashAes1Rx4(const void *input, size_t inputSize, void *hash) {
+	const uint8_t* inptr = (uint8_t*)input;
+	const uint8_t* inputEnd = inptr + inputSize;
+
+	__m128i state0, state1, state2, state3;
+	__m128i in0, in1, in2, in3;
+
+	//intial state
+	state0 = _mm_set_epi32(0x9d04b0ae, 0x59943385, 0x30ac8d93, 0x3fe49f5d);
+	state1 = _mm_set_epi32(0x8a39ebf1, 0xddc10935, 0xa724ecd3, 0x7b0c6064);
+	state2 = _mm_set_epi32(0x7ec70420, 0xdf01edda, 0x7c12ecf7, 0xfb5382e3);
+	state3 = _mm_set_epi32(0x94a9d201, 0x5082d1c8, 0xb2e74109, 0x7728b705);
+
+	//process 64 bytes at a time in 4 lanes
+	while (inptr < inputEnd) {
+		in0 = _mm_load_si128((__m128i*)inptr + 0);
+		in1 = _mm_load_si128((__m128i*)inptr + 1);
+		in2 = _mm_load_si128((__m128i*)inptr + 2);
+		in3 = _mm_load_si128((__m128i*)inptr + 3);
+
+		state0 = aesenc<softAes>(state0, in0);
+		state1 = aesdec<softAes>(state1, in1);
+		state2 = aesenc<softAes>(state2, in2);
+		state3 = aesdec<softAes>(state3, in3);
+
+		inptr += 64;
+	}
+
+	//two extra rounds to achieve full diffusion
+	__m128i xkey0 = _mm_set_epi32(0x4ff637c5, 0x053bd705, 0x8231a744, 0xc3767b17);
+	__m128i xkey1 = _mm_set_epi32(0x6594a1a6, 0xa8879d58, 0xb01da200, 0x8a8fae2e);
+
+	state0 = aesenc<softAes>(state0, xkey0);
+	state1 = aesdec<softAes>(state1, xkey0);
+	state2 = aesenc<softAes>(state2, xkey0);
+	state3 = aesdec<softAes>(state3, xkey0);
+
+	state0 = aesenc<softAes>(state0, xkey1);
+	state1 = aesdec<softAes>(state1, xkey1);
+	state2 = aesenc<softAes>(state2, xkey1);
+	state3 = aesdec<softAes>(state3, xkey1);
+
+	//output hash
+	_mm_store_si128((__m128i*)hash + 0, state0);
+	_mm_store_si128((__m128i*)hash + 1, state1);
+	_mm_store_si128((__m128i*)hash + 2, state2);
+	_mm_store_si128((__m128i*)hash + 3, state3);
+}
+
+template void hashAes1Rx4<false>(const void *input, size_t inputSize, void *hash);
+template void hashAes1Rx4<true>(const void *input, size_t inputSize, void *hash);
+
+/*
+	Fill 'buffer' with pseudorandom data based on 512-bit 'state'.
+	The state is encrypted using a single AES round per 16 bytes of output
+	in 4 lanes.
+
+	'outputSize' must be a multiple of 64.
+
+	The modified state is written back to 'state' to allow multiple
+	calls to this function.
+*/
+template<bool softAes>
+void fillAes1Rx4(void *state, size_t outputSize, void *buffer) {
+	const uint8_t* outptr = (uint8_t*)buffer;
+	const uint8_t* outputEnd = outptr + outputSize;
+
+	__m128i state0, state1, state2, state3;
+	__m128i key0, key1, key2, key3;
+
+	key0 = _mm_set_epi32(0x9274f206, 0x79498d2f, 0x7d2de6ab, 0x67a04d26);
+	key1 = _mm_set_epi32(0xe1f7af05, 0x2a3a6f1d, 0x86658a15, 0x4f719812);
+	key2 = _mm_set_epi32(0xd1b1f791, 0x9e2ec914, 0x14c77bce, 0xba90750e);
+	key3 = _mm_set_epi32(0x179d0fd9, 0x6e57883c, 0xa53bbe4f, 0xaa07621f);
+
+	state0 = _mm_load_si128((__m128i*)state + 0);
+	state1 = _mm_load_si128((__m128i*)state + 1);
+	state2 = _mm_load_si128((__m128i*)state + 2);
+	state3 = _mm_load_si128((__m128i*)state + 3);
+
+	while (outptr < outputEnd) {
+		state0 = aesdec<softAes>(state0, key0);
+		state1 = aesenc<softAes>(state1, key1);
+		state2 = aesdec<softAes>(state2, key2);
+		state3 = aesenc<softAes>(state3, key3);
+
+		_mm_store_si128((__m128i*)outptr + 0, state0);
+		_mm_store_si128((__m128i*)outptr + 1, state1);
+		_mm_store_si128((__m128i*)outptr + 2, state2);
+		_mm_store_si128((__m128i*)outptr + 3, state3);
+
+		outptr += 64;
+	}
+
+	_mm_store_si128((__m128i*)state + 0, state0);
+	_mm_store_si128((__m128i*)state + 1, state1);
+	_mm_store_si128((__m128i*)state + 2, state2);
+	_mm_store_si128((__m128i*)state + 3, state3);
+}
+
+template void fillAes1Rx4<true>(void *state, size_t outputSize, void *buffer);
+template void fillAes1Rx4<false>(void *state, size_t outputSize, void *buffer);
diff --git a/src/hashAes1Rx4.hpp b/src/hashAes1Rx4.hpp
new file mode 100644
index 0000000..8c0c156
--- /dev/null
+++ b/src/hashAes1Rx4.hpp
@@ -0,0 +1,26 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "softAes.h"
+
+template<bool softAes>
+void hashAes1Rx4(const void *input, size_t inputSize, void *hash);
+
+template<bool softAes>
+void fillAes1Rx4(void *state, size_t outputSize, void *buffer);
diff --git a/src/instructionWeights.hpp b/src/instructionWeights.hpp
index bb99ca7..c336b29 100644
--- a/src/instructionWeights.hpp
+++ b/src/instructionWeights.hpp
@@ -19,44 +19,63 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 #pragma once
 
-#define WT_ADD_64 11
-#define WT_ADD_32 2
-#define WT_SUB_64 11
-#define WT_SUB_32 2
-#define WT_MUL_64 23
-#define WT_MULH_64 10
-#define WT_MUL_32 15
-#define WT_IMUL_32 15
-#define WT_IMULH_64 6
-#define WT_DIV_64 1
-#define WT_IDIV_64 1
-#define WT_AND_64 4
-#define WT_AND_32 2
-#define WT_OR_64 4
-#define WT_OR_32 2
-#define WT_XOR_64 4
-#define WT_XOR_32 2
-#define WT_SHL_64 3
-#define WT_SHR_64 3
-#define WT_SAR_64 3
-#define WT_ROL_64 6
-#define WT_ROR_64 6
-#define WT_FPADD 20
-#define WT_FPSUB 20
-#define WT_FPMUL 22
-#define WT_FPDIV 8
-#define WT_FPSQRT 6
-#define WT_FPROUND 2
-#define WT_CALL 20
-#define WT_RET 22
+//Integer
+#define WT_IADD_R 12
+#define WT_IADD_M 7
+#define WT_IADD_RC 16
+#define WT_ISUB_R 12
+#define WT_ISUB_M 7
+#define WT_IMUL_9C 9
+#define WT_IMUL_R 16
+#define WT_IMUL_M 4
+#define WT_IMULH_R 4
+#define WT_IMULH_M 1
+#define WT_ISMULH_R 4
+#define WT_ISMULH_M 1
+#define WT_IDIV_C 4
+#define WT_ISDIV_C 4
+#define WT_INEG_R 2
+#define WT_IXOR_R 16
+#define WT_IXOR_M 4
+#define WT_IROR_R 10
+#define WT_IROL_R 0
+#define WT_ISWAP_R 4
 
+//Common floating point
+#define WT_FSWAP_R 8
 
-constexpr int wtSum = WT_ADD_64 + WT_ADD_32 + WT_SUB_64 + WT_SUB_32 + \
-WT_MUL_64 + WT_MULH_64 + WT_MUL_32 + WT_IMUL_32 + WT_IMULH_64 + \
-WT_DIV_64 + WT_IDIV_64 + WT_AND_64 + WT_AND_32 + WT_OR_64 + \
-WT_OR_32 + WT_XOR_64 + WT_XOR_32 + WT_SHL_64 + WT_SHR_64 + \
-WT_SAR_64 + WT_ROL_64 + WT_ROR_64 + WT_FPADD + WT_FPSUB + WT_FPMUL \
-+ WT_FPDIV + WT_FPSQRT + WT_FPROUND + WT_CALL + WT_RET;
+//Floating point group F
+#define WT_FADD_R 20
+#define WT_FADD_M 5
+#define WT_FSUB_R 20
+#define WT_FSUB_M 5
+#define WT_FNEG_R 6
+
+//Floating point group E
+#define WT_FMUL_R 20
+#define WT_FMUL_M 0
+#define WT_FDIV_R 0
+#define WT_FDIV_M 4
+#define WT_FSQRT_R 6
+
+//Control
+#define WT_COND_R 7
+#define WT_COND_M 1
+#define WT_CFROUND 1
+
+//Store
+#define WT_ISTORE 16
+#define WT_FSTORE 0
+
+#define WT_NOP 0
+
+constexpr int wtSum = WT_IADD_R + WT_IADD_M + WT_IADD_RC + WT_ISUB_R + \
+WT_ISUB_M + WT_IMUL_9C + WT_IMUL_R + WT_IMUL_M + WT_IMULH_R + \
+WT_IMULH_M + WT_ISMULH_R + WT_ISMULH_M + WT_IDIV_C + WT_ISDIV_C + \
+WT_INEG_R + WT_IXOR_R + WT_IXOR_M + WT_IROR_R + WT_IROL_R + \
+WT_ISWAP_R + WT_FSWAP_R + WT_FADD_R + WT_FADD_M + WT_FSUB_R + WT_FSUB_M + \
+WT_FNEG_R + WT_FMUL_R + WT_FMUL_M + WT_FDIV_R + WT_FDIV_M + \
+WT_FSQRT_R + WT_COND_R + WT_COND_M + WT_CFROUND + WT_ISTORE + WT_FSTORE + WT_NOP;
 
 static_assert(wtSum == 256,
 	"Sum of instruction weights must be 256");
@@ -97,8 +116,46 @@ static_assert(wtSum == 256,
 #define REP33(x) REP32(x) x,
 #define REP40(x) REP32(x) REP8(x)
 #define REP128(x) REP32(x) REP32(x) REP32(x) REP32(x)
+#define REP232(x) REP128(x) REP40(x) REP40(x) REP24(x)
 #define REP256(x) REP128(x) REP128(x)
 #define REPNX(x,N) REP##N(x)
 #define REPN(x,N) REPNX(x,N)
 #define NUM(x) x
 #define WT(x) NUM(WT_##x)
+
+#define REPCASE0(x)
+#define REPCASE1(x) case __COUNTER__:
+#define REPCASE2(x) REPCASE1(x) case __COUNTER__:
+#define REPCASE3(x) REPCASE2(x) case __COUNTER__:
+#define REPCASE4(x) REPCASE3(x) case __COUNTER__:
+#define REPCASE5(x) REPCASE4(x) case __COUNTER__:
+#define REPCASE6(x) REPCASE5(x) case __COUNTER__:
+#define REPCASE7(x) REPCASE6(x) case __COUNTER__:
+#define REPCASE8(x) REPCASE7(x) case __COUNTER__:
+#define REPCASE9(x) REPCASE8(x) case __COUNTER__:
+#define REPCASE10(x) REPCASE9(x) case __COUNTER__:
+#define REPCASE11(x) REPCASE10(x) case __COUNTER__:
+#define REPCASE12(x) REPCASE11(x) case __COUNTER__:
+#define REPCASE13(x) REPCASE12(x) case __COUNTER__:
+#define REPCASE14(x) REPCASE13(x) case __COUNTER__:
+#define REPCASE15(x) REPCASE14(x) case __COUNTER__:
+#define REPCASE16(x) REPCASE15(x) case __COUNTER__:
+#define REPCASE17(x) REPCASE16(x) case __COUNTER__:
+#define REPCASE18(x) REPCASE17(x) case __COUNTER__:
+#define REPCASE19(x) REPCASE18(x) case __COUNTER__:
+#define REPCASE20(x) REPCASE19(x) case __COUNTER__:
+#define REPCASE21(x) REPCASE20(x) case __COUNTER__:
+#define REPCASE22(x) REPCASE21(x) case __COUNTER__:
+#define REPCASE23(x) REPCASE22(x) case __COUNTER__:
+#define REPCASE24(x) REPCASE23(x) case __COUNTER__:
+#define REPCASE25(x) REPCASE24(x) case __COUNTER__:
+#define REPCASE26(x) REPCASE25(x) case __COUNTER__:
+#define REPCASE27(x) REPCASE26(x) case __COUNTER__:
+#define REPCASE28(x) REPCASE27(x) case __COUNTER__:
+#define REPCASE29(x) REPCASE28(x) case __COUNTER__:
+#define REPCASE30(x) REPCASE29(x) case __COUNTER__:
+#define REPCASE31(x) REPCASE30(x) case __COUNTER__:
+#define REPCASE32(x) REPCASE31(x) case __COUNTER__:
+#define REPCASENX(x,N) REPCASE##N(x)
+#define REPCASEN(x,N) REPCASENX(x,N)
+#define CASE_REP(x) REPCASEN(x, WT(x))
\ No newline at end of file
diff --git a/src/instructions.hpp b/src/instructions.hpp
deleted file mode 100644
index 2321be6..0000000
--- a/src/instructions.hpp
+++ /dev/null
@@ -1,63 +0,0 @@
-/*
-Copyright (c) 2018 tevador
-
-This file is part of RandomX.
-
-RandomX is free software: you can redistribute it and/or modify
-it under the terms of the GNU General Public License as published by
-the Free Software Foundation, either version 3 of the License, or
-(at your option) any later version.
-
-RandomX is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
-*/
-
-#include <cstdint>
-#include "common.hpp"
-
-namespace RandomX {
-
-	//Clears the 11 least-significant bits before conversion. This is done so the number
-	//fits exactly into the 52-bit mantissa without rounding.
-	inline double convertSigned52(int64_t x) {
-		return (double)(x & -2048L);
-	}
-
-	extern "C" {
-		void ADD_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void ADD_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SUB_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SUB_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void MUL_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void MULH_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void MUL_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void IMUL_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void IMULH_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void DIV_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void IDIV_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void AND_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void AND_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void OR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void OR_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void XOR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void XOR_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SHL_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SHR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SAR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void ROL_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void ROR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		bool JMP_COND(uint8_t, convertible_t&, int32_t);
-		void FPINIT();
-		void FPADD(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPSUB(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPMUL(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPDIV(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPSQRT(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPROUND(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-	}
-}
\ No newline at end of file
diff --git a/src/instructionsPortable.cpp b/src/instructionsPortable.cpp
index 790506b..59d19c5 100644
--- a/src/instructionsPortable.cpp
+++ b/src/instructionsPortable.cpp
@@ -17,26 +17,27 @@ You should have received a copy of the GNU General Public License
 along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 //#define DEBUG
-#include "instructions.hpp"
 #include "intrinPortable.h"
+#include "blake2/endian.h"
 #pragma STDC FENV_ACCESS on
 #include <cfenv>
 #include <cmath>
 #ifdef DEBUG
 #include <iostream>
 #endif
+#include "common.hpp"
 
 #if defined(__SIZEOF_INT128__)
 	typedef unsigned __int128 uint128_t;
 	typedef __int128 int128_t;
-	static inline uint64_t __umulhi64(uint64_t a, uint64_t b) {
+	uint64_t mulh(uint64_t a, uint64_t b) {
 		return ((uint128_t)a * b) >> 64;
 	}
-	static inline uint64_t __imulhi64(int64_t a, int64_t b) {
+	int64_t smulh(int64_t a, int64_t b) {
 		return ((int128_t)a * b) >> 64;
 	}
-	#define umulhi64 __umulhi64
-	#define imulhi64 __imulhi64
+	#define HAVE_MULH
+	#define HAVE_SMULH
 #endif
 
 #if defined(_MSC_VER)
@@ -44,62 +45,62 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 	#define EVAL_DEFINE(X) HAS_VALUE(X)
 	#include <intrin.h>
 	#include <stdlib.h>
-	#define ror64 _rotr64
-	#define rol64 _rotl64
+
+	uint64_t rotl(uint64_t x, int c) {
+		return _rotl64(x, c);
+	}
+	uint64_t rotr(uint64_t x , int c) {
+		return _rotr64(x, c);
+	}
+	#define HAVE_ROTL
+	#define HAVE_ROTR
+
 	#if EVAL_DEFINE(__MACHINEARM64_X64(1))
-		#define umulhi64 __umulh
+		uint64_t mulh(uint64_t a, uint64_t b) {
+			return __umulh(a, b);
+		}
+		#define HAVE_MULH
 	#endif
+
 	#if EVAL_DEFINE(__MACHINEX64(1))
-		static inline uint64_t __imulhi64(int64_t a, int64_t b) {
+		int64_t smulh(int64_t a, int64_t b) {
 			int64_t hi;
 			_mul128(a, b, &hi);
 			return hi;
 		}
-		#define imulhi64 __imulhi64
+		#define HAVE_SMULH
 	#endif
-	static inline uint32_t _setRoundMode(uint32_t mode) {
-		return _controlfp(mode, _MCW_RC);
+
+	static void setRoundMode__(uint32_t mode) {
+		_controlfp(mode, _MCW_RC);
 	}
-	#define setRoundMode _setRoundMode
+	#define HAVE_SETROUNDMODE_IMPL
 #endif
 
-#ifndef setRoundMode
-	#define setRoundMode fesetround
+#ifndef HAVE_SETROUNDMODE_IMPL
+	static void setRoundMode__(uint32_t mode) {
+		fesetround(mode);
+	}
 #endif
 
-#ifndef ror64
-	static inline uint64_t __ror64(uint64_t a, int b) {
+#ifndef HAVE_ROTR
+	uint64_t rotr(uint64_t a, int b) {
 		return (a >> b) | (a << (64 - b));
 	}
-	#define ror64 __ror64
+	#define HAS_ROTR
 #endif
 
-#ifndef rol64
-	static inline uint64_t __rol64(uint64_t a, int b) {
+#ifndef HAVE_ROTL
+	uint64_t rotl(uint64_t a, int b) {
 		return (a << b) | (a >> (64 - b));
 	}
-	#define rol64 __rol64
+	#define HAS_ROTL
 #endif
 
-#ifndef sar64
-	#include <type_traits>
-	constexpr int64_t builtintShr64(int64_t value, int shift) noexcept {
-		return value >> shift;
-	}
-
-	struct UsesArithmeticShift : std::integral_constant<bool, builtintShr64(-1LL, 1) == -1LL> {
-	};
-
-	static inline int64_t __sar64(int64_t a, int b) {
-		return UsesArithmeticShift::value ? builtintShr64(a, b) : (a < 0 ? ~(~a >> b) : a >> b);
-	}
-	#define sar64 __sar64
-#endif
-
-#ifndef umulhi64
+#ifndef HAVE_MULH
 	#define LO(x) ((x)&0xffffffff)
 	#define HI(x) ((x)>>32)
-	static inline uint64_t __umulhi64(uint64_t a, uint64_t b) {
+	uint64_t mulh(uint64_t a, uint64_t b) {
 		uint64_t ah = HI(a), al = LO(a);
 		uint64_t bh = HI(b), bl = LO(b);
 		uint64_t x00 = al * bl;
@@ -112,17 +113,17 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 		return (m3 << 32) + LO(m2);
 	}
-	#define umulhi64 __umulhi64
+	#define HAVE_MULH
 #endif
 
-#ifndef imulhi64
-	static inline int64_t __imulhi64(int64_t a, int64_t b) {
-		int64_t hi = umulhi64(a, b);
+#ifndef HAVE_SMULH
+	int64_t smulh(int64_t a, int64_t b) {
+		int64_t hi = mulh(a, b);
 		if (a < 0LL) hi -= b;
 		if (b < 0LL) hi -= a;
 		return hi;
 	}
-	#define imulhi64 __imulhi64
+	#define HAVE_SMULH
 #endif
 
 // avoid undefined behavior of signed overflow
@@ -137,20 +138,20 @@ static inline int32_t safeSub(int32_t a, int32_t b) {
 
 #if defined(__has_builtin)
 #if __has_builtin(__builtin_sub_overflow)
-	static inline bool __subOverflow(int32_t a, int32_t b) {
+	static inline bool subOverflow__(uint32_t a, uint32_t b) {
 		int32_t temp;
-		return __builtin_sub_overflow(a, b, &temp);
+		return __builtin_sub_overflow(unsigned32ToSigned2sCompl(a), unsigned32ToSigned2sCompl(b), &temp);
 	}
-	#define subOverflow __subOverflow
+	#define HAVE_SUB_OVERFLOW
 #endif
 #endif
 
-#ifndef subOverflow
-	static inline bool __subOverflow(int32_t a, int32_t b) {
-		auto c = safeSub(a, b);
-		return (c < a) != (b > 0);
+#ifndef HAVE_SUB_OVERFLOW
+	static inline bool subOverflow__(uint32_t a, uint32_t b) {
+		auto c = unsigned32ToSigned2sCompl(a - b);
+		return (c < unsigned32ToSigned2sCompl(a)) != (unsigned32ToSigned2sCompl(b) > 0);
 	}
-	#define subOverflow __subOverflow
+	#define HAVE_SUB_OVERFLOW
 #endif
 
 static inline double FlushDenormalNaN(double x) {
@@ -165,251 +166,64 @@ static inline double FlushNaN(double x) {
 	return x != x ? 0.0 : x;
 }
 
-namespace RandomX {
-
-	extern "C" {
-
-		void ADD_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 + b.u64;
-		}
-
-		void ADD_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 + b.u32;
-		}
-
-		void SUB_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 - b.u64;
-		}
-
-		void SUB_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 - b.u32;
-		}
-
-		void MUL_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 * b.u64;
-		}
-
-		void MULH_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = umulhi64(a.u64, b.u64);
-		}
-
-		void MUL_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = (uint64_t)a.u32 * b.u32;
-		}
-
-		void IMUL_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.i64 = (int64_t)a.i32 * b.i32;
-		}
-
-		void IMULH_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.i64 = imulhi64(a.i64, b.i64);
-		}
-
-		void DIV_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 / (b.u32 != 0 ? b.u32 : 1U);
-		}
-
-		void IDIV_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			if (a.i64 == INT64_MIN && b.i32 == -1)
-				c.i64 = INT64_MIN;
-			else
-				c.i64 = a.i64 / (b.i32 != 0 ? b.i32 : 1);
-		}
-
-		void AND_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 & b.u64;
-		}
-
-		void AND_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 & b.u32;
-		}
-
-		void OR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 | b.u64;
-		}
-
-		void OR_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 | b.u32;
-		}
-
-		void XOR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 ^ b.u64;
-		}
-
-		void XOR_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 ^ b.u32;
-		}
-
-		void SHL_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 << (b.u64 & 63);
-		}
-
-		void SHR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 >> (b.u64 & 63);
-		}
-
-		void SAR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = sar64(a.i64, b.u64 & 63);
-		}
-
-		void ROL_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = rol64(a.u64, (b.u64 & 63));
-		}
-
-		void ROR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = ror64(a.u64, (b.u64 & 63));
-		}
-
-		bool JMP_COND(uint8_t type, convertible_t& regb, int32_t imm32) {
-			switch (type & 7)
-			{
-				case 0:
-					return regb.u32 <= (uint32_t)imm32;
-				case 1:
-					return regb.u32 > (uint32_t)imm32;
-				case 2:
-					return safeSub(regb.i32, imm32) < 0;
-				case 3:
-					return safeSub(regb.i32, imm32) >= 0;
-				case 4:
-					return subOverflow(regb.i32, imm32);
-				case 5:
-					return !subOverflow(regb.i32, imm32);
-				case 6:
-					return regb.i32 < imm32;
-				case 7:
-					return regb.i32 >= imm32;
-			}
-		}
-
-		void FPINIT() {
-#ifdef __SSE2__
-			_mm_setcsr(0x9FC0); //Flush to zero, denormals are zero, default rounding mode, all exceptions disabled
-#else
-			setRoundMode(FE_TONEAREST);
-#endif
-		}
-
-		void FPADD(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_add_pd(ad, bd);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = alo + b.lo.f64;
-			c.hi.f64 = ahi + b.hi.f64;
-#endif
-		}
-
-		void FPSUB(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_sub_pd(ad, bd);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = alo - b.lo.f64;
-			c.hi.f64 = ahi - b.hi.f64;
-#endif
-		}
-
-		void FPMUL(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_mul_pd(ad, bd);
-			__m128d mask = _mm_cmpeq_pd(cd, cd);
-			cd = _mm_and_pd(cd, mask);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = FlushNaN(alo * b.lo.f64);
-			c.hi.f64 = FlushNaN(ahi * b.hi.f64);
-#endif
-		}
-
-		void FPDIV(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_div_pd(ad, bd);
-			__m128d mask = _mm_cmpeq_pd(cd, cd);
-			cd = _mm_and_pd(cd, mask);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = FlushDenormalNaN(alo / b.lo.f64);
-			c.hi.f64 = FlushDenormalNaN(ahi / b.hi.f64);
-#endif
-		}
-
-		void FPSQRT(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			const __m128d absmask = _mm_castsi128_pd(_mm_set1_epi64x(~(1LL << 63)));
-			ad = _mm_and_pd(ad, absmask);
-			__m128d cd = _mm_sqrt_pd(ad);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = sqrt(std::abs(alo));
-			c.hi.f64 = sqrt(std::abs(ahi));
-#endif
-		}
-
-		void FPROUND(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-			c.lo.f64 = convertSigned52(a.i64);
-			switch (a.u64 & 3) {
-				case RoundDown:
-#ifdef DEBUG
-					std::cout << "Round FE_DOWNWARD (" << FE_DOWNWARD << ") = " <<
-#endif
-					setRoundMode(FE_DOWNWARD);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-				case RoundUp:
-#ifdef DEBUG
-					std::cout << "Round FE_UPWARD (" << FE_UPWARD << ") = " <<
-#endif
-					setRoundMode(FE_UPWARD);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-				case RoundToZero:
-#ifdef DEBUG
-					std::cout << "Round FE_TOWARDZERO (" << FE_TOWARDZERO << ") = " <<
-#endif
-					setRoundMode(FE_TOWARDZERO);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-				default:
-#ifdef DEBUG
-					std::cout << "Round FE_TONEAREST (" << FE_TONEAREST << ") = " <<
-#endif
-					setRoundMode(FE_TONEAREST);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-			}
-		}
+void setRoundMode(uint32_t rcflag) {
+	switch (rcflag & 3) {
+		case RoundDown:
+			setRoundMode__(FE_DOWNWARD);
+			break;
+		case RoundUp:
+			setRoundMode__(FE_UPWARD);
+			break;
+		case RoundToZero:
+			setRoundMode__(FE_TOWARDZERO);
+			break;
+		case RoundToNearest:
+			setRoundMode__(FE_TONEAREST);
+			break;
+		default:
+			UNREACHABLE;
 	}
-}
\ No newline at end of file
+}
+
+bool condition(uint32_t type, uint32_t value, uint32_t imm32) {
+	switch (type & 7)
+	{
+		case 0:
+			return value <= imm32;
+		case 1:
+			return value > imm32;
+		case 2:
+			return unsigned32ToSigned2sCompl(value - imm32) < 0;
+		case 3:
+			return unsigned32ToSigned2sCompl(value - imm32) >= 0;
+		case 4:
+			return subOverflow__(value, imm32);
+		case 5:
+			return !subOverflow__(value, imm32);
+		case 6:
+			return unsigned32ToSigned2sCompl(value) < unsigned32ToSigned2sCompl(imm32);
+		case 7:
+			return unsigned32ToSigned2sCompl(value) >= unsigned32ToSigned2sCompl(imm32);
+		default:
+			UNREACHABLE;
+	}
+}
+
+void initFpu() {
+#ifdef __SSE2__
+	_mm_setcsr(0x9FC0); //Flush to zero, denormals are zero, default rounding mode, all exceptions disabled
+#else
+	setRoundMode(FE_TONEAREST);
+#endif
+}
+
+union double_ser_t {
+	double f;
+	uint64_t i;
+};
+
+double loadDoublePortable(const void* addr) {
+	double_ser_t ds;
+	ds.i = load64(addr);
+	return ds.f;
+}
diff --git a/src/intrinPortable.h b/src/intrinPortable.h
index 3a473a2..2c2e487 100644
--- a/src/intrinPortable.h
+++ b/src/intrinPortable.h
@@ -19,6 +19,8 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 
 #pragma once
 
+#include <cstdint>
+
 #if defined(_MSC_VER)
 #if defined(_M_X64) || (defined(_M_IX86_FP) && _M_IX86_FP == 2)
 #define __SSE2__ 1
@@ -31,12 +33,21 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #else
 #include <intrin.h>
 #endif
+
+inline __m128d _mm_abs(__m128d xd) {
+	const __m128d absmask = _mm_castsi128_pd(_mm_set1_epi64x(~(1LL << 63)));
+	return _mm_and_pd(xd, absmask);
+}
+
+#define PREFETCHNTA(x) _mm_prefetch((const char *)(x), _MM_HINT_NTA)
+
 #else
 #include <cstdint>
 #include <stdexcept>
 
 #define _mm_malloc(a,b) malloc(a)
 #define _mm_free(a) free(a)
+#define PREFETCHNTA(x)
 
 typedef union {
 	uint64_t u64[2];
@@ -45,6 +56,18 @@ typedef union {
 	uint8_t u8[16];
 } __m128i;
 
+typedef struct {
+	double lo;
+	double hi;
+} __m128d;
+
+inline __m128d _mm_load_pd(const double* pd) {
+	__m128d x;
+	x.lo = *(pd + 0);
+	x.hi = *(pd + 1);
+	return x;
+}
+
 static const char* platformError = "Platform doesn't support hardware AES";
 
 inline __m128i _mm_aeskeygenassist_si128(__m128i key, uint8_t rcon) {
@@ -131,4 +154,36 @@ inline __m128i _mm_slli_si128(__m128i _A, int _Imm) {
 	return _A;
 }
 
-#endif
\ No newline at end of file
+#endif
+
+constexpr int RoundToNearest = 0;
+constexpr int RoundDown = 1;
+constexpr int RoundUp = 2;
+constexpr int RoundToZero = 3;
+
+constexpr int32_t unsigned32ToSigned2sCompl(uint32_t x) {
+	return (-1 == ~0) ? (int32_t)x : (x > INT32_MAX ? (-(int32_t)(UINT32_MAX - x) - 1) : (int32_t)x);
+}
+
+constexpr int64_t unsigned64ToSigned2sCompl(uint64_t x) {
+	return (-1 == ~0) ? (int64_t)x : (x > INT64_MAX ? (-(int64_t)(UINT64_MAX - x) - 1) : (int64_t)x);
+}
+
+constexpr uint64_t signExtend2sCompl(uint32_t x) {
+	return (-1 == ~0) ? (int64_t)(int32_t)(x) : (x > INT32_MAX ? (x | 0xffffffff00000000ULL) : (uint64_t)x);
+}
+
+inline __m128d load_cvt_i32x2(const void* addr) {
+	__m128i ix = _mm_load_si128((const __m128i*)addr);
+	return _mm_cvtepi32_pd(ix);
+}
+
+double loadDoublePortable(const void* addr);
+
+uint64_t mulh(uint64_t, uint64_t);
+int64_t smulh(int64_t, int64_t);
+uint64_t rotl(uint64_t, int);
+uint64_t rotr(uint64_t, int);
+void initFpu();
+void setRoundMode(uint32_t);
+bool condition(uint32_t, uint32_t, uint32_t);
diff --git a/src/main.cpp b/src/main.cpp
index 8bb5492..1229feb 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -29,11 +29,11 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <cstring>
 #include "Program.hpp"
 #include <string>
-#include "instructions.hpp"
 #include <thread>
 #include <atomic>
 #include "dataset.hpp"
 #include "Cache.hpp"
+#include "hashAes1Rx4.hpp"
 
 const uint8_t seed[32] = { 191, 182, 222, 175, 249, 89, 134, 104, 241, 68, 191, 62, 162, 166, 61, 64, 123, 191, 227, 193, 118, 60, 188, 53, 223, 133, 175, 24, 123, 230, 55, 74 };
 
@@ -115,7 +115,7 @@ void printUsage(const char* executable) {
 }
 
 void generateAsm(int nonce) {
-	uint64_t hash[4];
+	uint64_t hash[8];
 	unsigned char blockTemplate[] = {
 		0x07, 0x07, 0xf7, 0xa4, 0xf0, 0xd6, 0x05, 0xb3, 0x03, 0x26, 0x08, 0x16, 0xba, 0x3f, 0x10, 0x90, 0x2e, 0x1a, 0x14,
 		0x5a, 0xc5, 0xfa, 0xd3, 0xaa, 0x3a, 0xf6, 0xea, 0x44, 0xc1, 0x18, 0x69, 0xdc, 0x4f, 0x85, 0x3f, 0x00, 0x2b, 0x2e,
@@ -126,11 +126,13 @@ void generateAsm(int nonce) {
 	*noncePtr = nonce;
 	blake2b(hash, sizeof(hash), blockTemplate, sizeof(blockTemplate), nullptr, 0);
 	RandomX::AssemblyGeneratorX86 asmX86;
-	asmX86.generateProgram(hash);
+	RandomX::Program p;
+	fillAes1Rx4<false>(hash, sizeof(p), &p);
+	asmX86.generateProgram(p);
 	asmX86.printCode(std::cout);
 }
 
-void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash& result, int noncesCount, int thread) {
+void generateNative(int nonce) {
 	uint64_t hash[4];
 	unsigned char blockTemplate[] = {
 		0x07, 0x07, 0xf7, 0xa4, 0xf0, 0xd6, 0x05, 0xb3, 0x03, 0x26, 0x08, 0x16, 0xba, 0x3f, 0x10, 0x90, 0x2e, 0x1a, 0x14,
@@ -139,18 +141,44 @@ void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash
 		0xc3, 0x8b, 0xde, 0xd3, 0x4d, 0x2d, 0xcd, 0xee, 0xf9, 0x5c, 0xd2, 0x0c, 0xef, 0xc1, 0x2f, 0x61, 0xd5, 0x61, 0x09
 	};
 	int* noncePtr = (int*)(blockTemplate + 39);
+	*noncePtr = nonce;
+	blake2b(hash, sizeof(hash), blockTemplate, sizeof(blockTemplate), nullptr, 0);
+	alignas(16) RandomX::Program prog;
+	fillAes1Rx4<false>((void*)hash, sizeof(prog), &prog);
+	for (int i = 0; i < RandomX::ProgramLength; ++i) {
+		prog(i).dst %= 8;
+		prog(i).src %= 8;
+	}
+	std::cout << prog << std::endl;
+}
+
+void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash& result, int noncesCount, int thread, uint8_t* scratchpad) {
+	alignas(16) uint64_t hash[8];
+	unsigned char blockTemplate[] = {
+		0x07, 0x07, 0xf7, 0xa4, 0xf0, 0xd6, 0x05, 0xb3, 0x03, 0x26, 0x08, 0x16, 0xba, 0x3f, 0x10, 0x90, 0x2e, 0x1a, 0x14,
+		0x5a, 0xc5, 0xfa, 0xd3, 0xaa, 0x3a, 0xf6, 0xea, 0x44, 0xc1, 0x18, 0x69, 0xdc, 0x4f, 0x85, 0x3f, 0x00, 0x2b, 0x2e,
+		0xea, 0x00, 0x00, 0x00, 0x00, 0x77, 0xb2, 0x06, 0xa0, 0x2c, 0xa5, 0xb1, 0xd4, 0xce, 0x6b, 0xbf, 0xdf, 0x0a, 0xca,
+		0xc3, 0x8b, 0xde, 0xd3, 0x4d, 0x2d, 0xcd, 0xee, 0xf9, 0x5c, 0xd2, 0x0c, 0xef, 0xc1, 0x2f, 0x61, 0xd5, 0x61, 0x09
+	};
+	int* noncePtr = (int*)(blockTemplate + 39);
 	int nonce = atomicNonce.fetch_add(1);
 
 	while (nonce < noncesCount) {
 		//std::cout << "Thread " << thread << " nonce " << nonce << std::endl;
 		*noncePtr = nonce;
 		blake2b(hash, sizeof(hash), blockTemplate, sizeof(blockTemplate), nullptr, 0);
-		int spIndex = ((uint8_t*)hash)[24] | ((((uint8_t*)hash)[25] & 63) << 8);
-		vm->initializeScratchpad(spIndex);
-		vm->initializeProgram(hash);
+		fillAes1Rx4<false>((void*)hash, RandomX::ScratchpadSize, scratchpad);
+		//vm->initializeScratchpad(scratchpad, spIndex);
+		vm->setScratchpad(scratchpad);
 		//dump((char*)((RandomX::CompiledVirtualMachine*)vm)->getProgram(), RandomX::CodeSize, "code-1337-jmp.txt");
-		vm->execute();
-		vm->getResult(hash);
+		for (int chain = 0; chain < 8; ++chain) {
+			fillAes1Rx4<false>((void*)hash, sizeof(RandomX::Program), vm->getProgramBuffer());
+			vm->initialize();
+			vm->execute();
+			vm->getResult<false>(nullptr, 0, hash);
+		}
+		//vm->initializeProgram(hash);
+		vm->getResult<false>(scratchpad, RandomX::ScratchpadSize, hash);
 		result.xorWith(hash);
 		if (RandomX::trace) {
 			std::cout << "Nonce: " << nonce << " ";
@@ -162,7 +190,7 @@ void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash
 }
 
 int main(int argc, char** argv) {
-	bool softAes, lightClient, genAsm, compiled, help;
+	bool softAes, lightClient, genAsm, compiled, help, largePages, async, aesBench, genNative;
 	int programCount, threadCount;
 	readOption("--help", argc, argv, help);
 
@@ -177,33 +205,56 @@ int main(int argc, char** argv) {
 	readOption("--compiled", argc, argv, compiled);
 	readIntOption("--threads", argc, argv, threadCount, 1);
 	readIntOption("--nonces", argc, argv, programCount, 1000);
+	readOption("--largePages", argc, argv, largePages);
+	readOption("--async", argc, argv, async);
+	readOption("--aesBench", argc, argv, aesBench);
+	readOption("--genNative", argc, argv, genNative);
 
 	if (genAsm) {
 		generateAsm(programCount);
 		return 0;
 	}
 
+	if (genNative) {
+		generateNative(programCount);
+		return 0;
+	}
+
+	if (softAes)
+		std::cout << "Using software AES." << std::endl;
+
+	if(aesBench) {
+		programCount *= 10;
+		Stopwatch sw(true);
+		if (softAes) {
+			RandomX::aesBench<true>(programCount);
+		}
+		else {
+			RandomX::aesBench<false>(programCount);
+		}
+		sw.stop();
+		std::cout << "AES performance: " << programCount / sw.getElapsed() << " blocks/s" << std::endl;
+		return 0;
+	}
+
 	std::atomic<int> atomicNonce(0);
 	AtomicHash result;
 	std::vector<RandomX::VirtualMachine*> vms;
 	std::vector<std::thread> threads;
 	RandomX::dataset_t dataset;
 
-	if (softAes)
-		std::cout << "Using software AES." << std::endl;
 	std::cout << "Initializing..." << std::endl;
-
 	try {
 		Stopwatch sw(true);
 		if (softAes) {
-			RandomX::datasetInitCache<true>(seed, dataset);
+			RandomX::datasetInitCache<true>(seed, dataset, largePages);
 		}
 		else {
-			RandomX::datasetInitCache<false>(seed, dataset);
+			RandomX::datasetInitCache<false>(seed, dataset, largePages);
 		}
 		if (RandomX::trace) {
 			std::cout << "Keys: " << std::endl;
-			for (int i = 0; i < dataset.cache->getKeys().size(); ++i) {
+			for (unsigned i = 0; i < dataset.cache->getKeys().size(); ++i) {
 				outputHex(std::cout, (char*)&dataset.cache->getKeys()[i], sizeof(__m128i));
 			}
 			std::cout << std::endl;
@@ -212,11 +263,11 @@ int main(int argc, char** argv) {
 			std::cout << std::endl;
 		}
 		if (lightClient) {
-			std::cout << "Cache (64 MiB) initialized in " << sw.getElapsed() << " s" << std::endl;
+			std::cout << "Cache (256 MiB) initialized in " << sw.getElapsed() << " s" << std::endl;
 		}
 		else {
 			RandomX::Cache* cache = dataset.cache;
-			RandomX::datasetAlloc(dataset);
+			RandomX::datasetAlloc(dataset, largePages);
 			if (threadCount > 1) {
 				auto perThread = RandomX::DatasetBlockCount / threadCount;
 				auto remainder = RandomX::DatasetBlockCount % threadCount;
@@ -229,7 +280,7 @@ int main(int argc, char** argv) {
 						threads.push_back(std::thread(&RandomX::datasetInit<false>, cache, dataset, i * perThread, count));
 					}
 				}
-				for (int i = 0; i < threads.size(); ++i) {
+				for (unsigned i = 0; i < threads.size(); ++i) {
 					threads[i].join();
 				}
 			}
@@ -241,7 +292,7 @@ int main(int argc, char** argv) {
 					RandomX::datasetInit<false>(cache, dataset, 0, RandomX::DatasetBlockCount);
 				}
 			}
-			delete cache;
+			RandomX::Cache::dealloc(cache, largePages);
 			threads.clear();
 			std::cout << "Dataset (4 GiB) initialized in " << sw.getElapsed() << " s" << std::endl;
 		}
@@ -249,37 +300,47 @@ int main(int argc, char** argv) {
 		for (int i = 0; i < threadCount; ++i) {
 			RandomX::VirtualMachine* vm;
 			if (compiled) {
-				vm = new RandomX::CompiledVirtualMachine(softAes);
+				vm = new RandomX::CompiledVirtualMachine();
 			}
 			else {
-				vm = new RandomX::InterpretedVirtualMachine(softAes);
+				vm = new RandomX::InterpretedVirtualMachine(softAes, async);
 			}
-			vm->setDataset(dataset, lightClient);
+			vm->setDataset(dataset);
 			vms.push_back(vm);
 		}
+		uint8_t* scratchpadMem;
+		if (largePages) {
+			scratchpadMem = (uint8_t*)allocLargePagesMemory(threadCount * RandomX::ScratchpadSize);
+		}
+		else {
+			scratchpadMem = (uint8_t*)_mm_malloc(threadCount * RandomX::ScratchpadSize, RandomX::CacheLineSize);
+		}
 		std::cout << "Running benchmark (" << programCount << " programs) ..." << std::endl;
 		sw.restart();
 		if (threadCount > 1) {
-			for (int i = 0; i < vms.size(); ++i) {
-				threads.push_back(std::thread(&mine, vms[i], std::ref(atomicNonce), std::ref(result), programCount, i));
+			for (unsigned i = 0; i < vms.size(); ++i) {
+				threads.push_back(std::thread(&mine, vms[i], std::ref(atomicNonce), std::ref(result), programCount, i, scratchpadMem + RandomX::ScratchpadSize * i));
 			}
-			for (int i = 0; i < threads.size(); ++i) {
+			for (unsigned i = 0; i < threads.size(); ++i) {
 				threads[i].join();
 			}
 		}
 		else {
-			mine(vms[0], std::ref(atomicNonce), std::ref(result), programCount, 0);
+			mine(vms[0], std::ref(atomicNonce), std::ref(result), programCount, 0, scratchpadMem);
+			if (compiled)
+				std::cout << "Average program size: " << ((RandomX::CompiledVirtualMachine*)vms[0])->getTotalSize() / programCount << std::endl;
 		}
 		double elapsed = sw.getElapsed();
 		std::cout << "Calculated result: ";
 		result.print(std::cout);
 		if(programCount == 1000)
 		std::cout << "Reference result:  3e1c5f9b9d0bf8ffa250f860bf5f7ab76ac823b206ddee6a592660119a3640c6" << std::endl;
-		std::cout << "Performance: " << programCount / elapsed << " programs per second" << std::endl;
-		/*if (threadCount == 1 && !compiled) {
-			auto ivm = (RandomX::InterpretedVirtualMachine*)vms[0];
-			std::cout << ivm->getProgam();
-		}*/
+		if (lightClient) {
+			std::cout << "Performance: " << 1000 * elapsed / programCount << " ms per hash" << std::endl;
+		}
+		else {
+			std::cout << "Performance: " << programCount / elapsed << " hashes per second" << std::endl;
+		}
 	}
 	catch (std::exception& e) {
 		std::cout << "ERROR: " << e.what() << std::endl;
diff --git a/src/program.inc b/src/program.inc
index 081647f..ac8957b 100644
--- a/src/program.inc
+++ b/src/program.inc
@@ -1,6863 +1,740 @@
-rx_i_0: ;RET
-	dec edi
-	jz rx_finish
-	xor r9, 0ca9788ah
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_0
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 01a8e4171h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_0:
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 01a8e4171h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_1: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r15, 06afc2fa4h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	and rax, r10
-	mov r12, rax
-
-rx_i_2: ;CALL
-	dec edi
-	jz rx_finish
-	xor r15, 097210f7bh
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r11d, 1348521207
-	jno short taken_call_2
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 05060ccf7h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_3
-taken_call_2:
-	push rax
-	call rx_i_47
-
-rx_i_3: ;FPROUND
-	dec edi
-	jz rx_finish
-	xor r13, 082c73195h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, rax
-	shl eax, 13
-	and rcx, -2048
-	and eax, 24576
-	cvtsi2sd xmm8, rcx
-	or eax, 40896
-	mov dword ptr [rsp - 8], eax
-	ldmxcsr dword ptr [rsp - 8]
+	; COND_M r1, sg(L1[r3], -2004237569)
+	xor ecx, ecx
+	mov eax, r11d
+	and eax, 16376
+	cmp dword ptr [rsi+rax], -2004237569
+	sets cl
+	add r9, rcx
+	; IXOR_R r7, -1379425991
+	xor r15, -1379425991
+	; IXOR_R r2, r6
+	xor r10, r14
+	; FSWAP_R f3
+	shufpd xmm3, xmm3, 1
+	; FADD_R f1, a1
+	addpd xmm1, xmm9
+	; IMUL_R r0, r5
+	imul r8, r13
+	; FMUL_R e1, a3
+	mulpd xmm5, xmm11
+	; IADD_R r3, r2
+	add r11, r10
+	; COND_M r1, ab(L2[r6], -724006934)
+	xor ecx, ecx
+	mov eax, r14d
+	and eax, 262136
+	cmp dword ptr [rsi+rax], -724006934
+	seta cl
+	add r9, rcx
+	; IADD_RC r2, r7, -854121467
+	lea r10, [r10+r15-854121467]
+	; IADD_RC r5, r6, 1291744030
+	lea r13, [r13+r14+1291744030]
+	; ISTORE L2[r6], r4
+	mov eax, r14d
+	and eax, 262136
+	mov qword ptr [rsi+rax], r12
+	; IMUL_R r6, r7
+	imul r14, r15
+	; FSUB_R f0, a3
+	subpd xmm0, xmm11
+	; IADD_M r3, L1[r0]
 	mov eax, r8d
-	xor eax, 06bb1a0b2h
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm8
-
-rx_i_4: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r14, 077daefb4h
-	mov eax, r14d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r14
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 06ce10c20h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_5: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r15, 0379f9ee0h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r12d
-	imul rax, rcx
-	mov r12, rax
-
-rx_i_6: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r8, 03bae7272h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	imul rax, r15
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 098a649d1h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_7: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r10, 0e264ed81h
-	mov eax, r10d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm6
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 057c8c41bh
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_8: ;SHL_64
-	dec edi
-	jz rx_finish
-	xor r13, 068c1e5d2h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	shl rax, 47
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 050267ebdh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_9: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r14, 085121c54h
-	mov eax, r14d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, 565870810
-	mov r10, rax
-
-rx_i_10: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r8, 052efde3eh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	or rax, -727859809
-	mov r13, rax
-
-rx_i_11: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r10, 0a9bf8aa1h
-	mov ecx, r10d
-	call rx_read_dataset_f
-	addpd xmm0, xmm5
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 0852d40d8h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_12: ;CALL
-	dec edi
-	jz rx_finish
-	xor r10, 0db2691ch
-	mov ecx, r10d
-	call rx_read_dataset_r
-	cmp r8d, -1763940407
-	jge short taken_call_12
-	mov r8, rax
-	jmp rx_i_13
-taken_call_12:
-	push rax
-	call rx_i_35
-
-rx_i_13: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r12, 061c0d34dh
-	mov ecx, r12d
-	call rx_read_dataset_f
-	subpd xmm0, xmm3
-	movaps xmm9, xmm0
-
-rx_i_14: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r10, 0e761d1beh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	shr rax, 4
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 03c1a72f8h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_15: ;RET
-	dec edi
-	jz rx_finish
-	xor r11, 074ddb688h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_15
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0468b38b8h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_15:
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0468b38b8h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_16: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r14, 06be90627h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, r10
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0d7e75aeh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_17: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 0fbc6fc35h
-	mov eax, r11d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm4
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 0f77ffe16h
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_18: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r14, 0c28ca080h
-	mov eax, r14d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm4
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0869baa81h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_19: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r13, 0ac009c30h
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm8
-	movaps xmm7, xmm0
-
-rx_i_20: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r13, 0ecca967dh
-	mov ecx, r13d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0aad81365h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_21: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0977f0284h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	addpd xmm0, xmm9
-	movaps xmm7, xmm0
-
-rx_i_22: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r13, 080bdfefah
-	mov ecx, r13d
-	call rx_read_dataset_r
-	add eax, r8d
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0cfa09799h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_23: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r15, 0e1e0d3c4h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	imul rax, r11
-	mov r8, rax
-
-rx_i_24: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r8, 070d3b8c7h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov rcx, r15
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 099b77a68h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_25: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r12, 01cf77a04h
-	mov ecx, r12d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 0baf5c2d4h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_26: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 0e311468ch
-	mov ecx, r11d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r13d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0306ff9ech
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_27: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r12, 01fd9911ah
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm6, xmm0
-
-rx_i_28: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r13, 067df757eh
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	xor rax, r13
-	mov r14, rax
-
-rx_i_29: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r12, 0be2e7c42h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	sub rax, 1944166515
-	mov r14, rax
-
-rx_i_30: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 084d067f7h
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm7, xmm0
-
-rx_i_31: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r14, 0d352ce37h
-	mov eax, r14d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 01e2da792h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_32: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r12, 0a1f248dah
-	mov ecx, r12d
-	call rx_read_dataset_r
-	xor rax, -1936869641
-	mov r9, rax
-
-rx_i_33: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r9, 0554720fch
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r15
-	mul rcx
-	mov rax, rdx
-	mov r12, rax
-
-rx_i_34: ;CALL
-	dec edi
-	jz rx_finish
-	xor r13, 0665e91f1h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r14d, -380224718
-	js short taken_call_34
-	mov r15, rax
-	jmp rx_i_35
-taken_call_34:
-	push rax
-	call rx_i_108
-
-rx_i_35: ;RET
-	dec edi
-	jz rx_finish
-	xor r15, 05ef1be79h
-	mov eax, r15d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_35
-	xor rax, qword ptr [rsp + 8]
-	mov r8, rax
-	ret 8
-not_taken_ret_35:
-	mov r8, rax
-
-rx_i_36: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r8, 012ec7e3ah
+	and eax, 16376
+	add r11, qword ptr [rsi+rax]
+	; ISDIV_C r4, -692911499
+	mov rax, -893288710803585809
+	imul r12
+	xor eax, eax
+	sar rdx, 25
+	sets al
+	add rdx, rax
+	add r12, rdx
+	; FMUL_R e0, a0
+	mulpd xmm4, xmm8
+	; FDIV_M e1, L1[r0]
 	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-
-rx_i_37: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r12, 0d0706601h
-	mov eax, r12d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm9, xmm0
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	andps xmm12, xmm14
+	divpd xmm5, xmm12
+	maxpd xmm5, xmm13
+	; FMUL_R e0, a1
+	mulpd xmm4, xmm9
+	; COND_M r0, no(L1[r1], -540292380)
+	xor ecx, ecx
 	mov eax, r9d
-	xor eax, 0bca81c78h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_38: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r9, 064056913h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r14
-	mov r10, rax
-
-rx_i_39: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r14, 02c1f1eb0h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	add eax, r14d
-	mov r14, rax
-
-rx_i_40: ;RET
-	dec edi
-	jz rx_finish
-	xor r10, 068fd9009h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_40
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0b2a27eceh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_40:
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0b2a27eceh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_41: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 037a30933h
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r14d, -1070581824
-	jo short taken_call_41
-	mov r9, rax
-	jmp rx_i_42
-taken_call_41:
-	push rax
-	call rx_i_127
-
-rx_i_42: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 0bc1de9f6h
+	and eax, 16376
+	cmp dword ptr [rsi+rax], -540292380
+	setno cl
+	add r8, rcx
+	; FSUB_R f1, a1
+	subpd xmm1, xmm9
+	; IADD_RC r0, r2, 310371682
+	lea r8, [r8+r10+310371682]
+	; COND_R r3, lt(r0, -1067603143)
+	xor ecx, ecx
+	cmp r8d, -1067603143
+	setl cl
+	add r11, rcx
+	; FMUL_R e0, a0
+	mulpd xmm4, xmm8
+	; FADD_R f0, a3
+	addpd xmm0, xmm11
+	; COND_R r4, sg(r3, -389806289)
+	xor ecx, ecx
+	cmp r11d, -389806289
+	sets cl
+	add r12, rcx
+	; FMUL_R e0, a3
+	mulpd xmm4, xmm11
+	; ISTORE L2[r7], r4
 	mov eax, r15d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm6
-	movaps xmm6, xmm0
-
-rx_i_43: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r12, 02b2a2eech
-	mov ecx, r12d
-	call rx_read_dataset_r
-	sub rax, 1693705407
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 064f3e4bfh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_44: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r11, 0685817abh
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r9
-	rol rax, cl
-	mov r15, rax
-
-rx_i_45: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r12, 08cd244ebh
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm2
-	movaps xmm5, xmm0
-
-rx_i_46: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r8, 06d8f4254h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	add rax, r9
-	mov rcx, rax
+	and eax, 262136
+	mov qword ptr [rsi+rax], r12
+	; IADD_RC r4, r2, 1888908452
+	lea r12, [r12+r10+1888908452]
+	; IADD_R r1, r2
+	add r9, r10
+	; IXOR_R r6, r5
+	xor r14, r13
+	; IADD_M r7, L1[r0]
 	mov eax, r8d
-	xor eax, 0e9f58436h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_47: ;CALL
-	dec edi
-	jz rx_finish
-	xor r12, 05ba232c6h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	cmp r10d, 119251505
-	jbe short taken_call_47
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 071ba231h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_48
-taken_call_47:
-	push rax
-	call rx_i_131
-
-rx_i_48: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r8, 0aaed618fh
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 020e5d9e9h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_49: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r8, 0f96c6a45h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-
-rx_i_50: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r9, 0da3e4842h
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	or eax, r10d
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 06ac56a2ah
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_51: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r10, 0302b676ah
-	mov ecx, r10d
-	call rx_read_dataset_r
-	sub rax, 419241919
-	mov r15, rax
-
-rx_i_52: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 0fa88f48bh
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp r13d, -534426193
-	js short taken_call_52
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0e0254dafh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_53
-taken_call_52:
-	push rax
-	call rx_i_94
-
-rx_i_53: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 03dff9b9eh
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_53
-	xor rax, qword ptr [rsp + 8]
-	mov r13, rax
-	ret 8
-not_taken_ret_53:
-	mov r13, rax
-
-rx_i_54: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r11, 060638de0h
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, 282209221
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 010d22bc5h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_55: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r10, 0dda983d4h
-	mov eax, r10d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 07c79cddh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_56: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r14, 0f1456b8eh
-	mov eax, r14d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r15
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0fcf95491h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_57: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r9, 010dc4571h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r14
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0a426387h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_58: ;IDIV_64
-	dec edi
-	jz rx_finish
-	xor r14, 0bcec0ebah
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov edx, r13d
-	cmp edx, -1
-	jne short safe_idiv_58
-	mov rcx, rax
-	rol rcx, 1
-	dec rcx
-	jz short result_idiv_58
-safe_idiv_58:
-	mov ecx, 1
-	test edx, edx
-	cmovne ecx, edx
-	movsxd rcx, ecx
-	cqo
-	idiv rcx
-result_idiv_58:
-	mov r8, rax
-
-rx_i_59: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r11, 0980dd402h
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm8
-	movaps xmm7, xmm0
-
-rx_i_60: ;RET
-	dec edi
-	jz rx_finish
-	xor r15, 03de14d1eh
-	mov ecx, r15d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_60
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 07bb60f45h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_60:
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 07bb60f45h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_61: ;CALL
-	dec edi
-	jz rx_finish
-	xor r13, 05058ce64h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r15d, 1933164545
-	jns short taken_call_61
-	mov r11, rax
-	jmp rx_i_62
-taken_call_61:
-	push rax
-	call rx_i_120
-
-rx_i_62: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r15, 0c3089414h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm8
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm2, xmm0
-	mov eax, r10d
-	xor eax, 05c4789e3h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_63: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 065cf272eh
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm7
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_64: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r13, 0ae54dfbfh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	sub rax, r15
-	mov r9, rax
-
-rx_i_65: ;CALL
-	dec edi
-	jz rx_finish
-	xor r13, 07b366ce6h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp r8d, 1498056607
-	js short taken_call_65
-	mov r11, rax
-	jmp rx_i_66
-taken_call_65:
-	push rax
-	call rx_i_129
-
-rx_i_66: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r15, 015a1b689h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 07305e78h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_67: ;CALL
-	dec edi
-	jz rx_finish
-	xor r14, 088393ba0h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp r13d, 2031541081
-	jns short taken_call_67
-	mov r9, rax
-	jmp rx_i_68
-taken_call_67:
-	push rax
-	call rx_i_79
-
-rx_i_68: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r13, 03aa5c3a4h
-	mov ecx, r13d
-	call rx_read_dataset_f
-	subpd xmm0, xmm2
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 03c51ef39h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_69: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 0376c9c27h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	addpd xmm0, xmm5
-	movaps xmm8, xmm0
-
-rx_i_70: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r8, 0bbbec3fah
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r9
-	mul rcx
-	mov rax, rdx
-	mov r13, rax
-
-rx_i_71: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r14, 0e9efb350h
-	mov eax, r14d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-
-rx_i_72: ;CALL
-	dec edi
-	jz rx_finish
-	xor r13, 0f4e51e28h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp r9d, -631091751
-	jno short taken_call_72
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0da624dd9h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_73
-taken_call_72:
-	push rax
-	call rx_i_191
-
-rx_i_73: ;FPROUND
-	dec edi
-	jz rx_finish
-	xor r12, 0c24ddbd4h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov rcx, rax
-	shl eax, 13
-	and rcx, -2048
-	and eax, 24576
-	cvtsi2sd xmm2, rcx
-	or eax, 40896
-	mov dword ptr [rsp - 8], eax
-	ldmxcsr dword ptr [rsp - 8]
-
-rx_i_74: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r8, 04c4b0c7fh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	imul rax, rax, -1431647438
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0aaaacb32h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_75: ;RET
-	dec edi
-	jz rx_finish
-	xor r14, 03bcc02e3h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_75
-	xor rax, qword ptr [rsp + 8]
-	mov r13, rax
-	ret 8
-not_taken_ret_75:
-	mov r13, rax
-
-rx_i_76: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 04b0ff63eh
-	mov eax, r11d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 083bc0396h
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_77: ;RET
-	dec edi
-	jz rx_finish
-	xor r14, 0b956b3e8h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_77
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 03a92bc7ah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_77:
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 03a92bc7ah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_78: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r9, 0edeca680h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r8d
-	imul rax, rcx
-	mov r15, rax
-
-rx_i_79: ;RET
-	dec edi
-	jz rx_finish
-	xor r11, 0fbdddcb5h
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_79
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 06b4a7b43h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_79:
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 06b4a7b43h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_80: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r13, 09cec97a1h
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm3, xmm0
-
-rx_i_81: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r15, 078228167h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	or rax, r13
-	mov r8, rax
-
-rx_i_82: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 078cae1ffh
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp r12d, -68969733
-	jo short taken_call_82
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0fbe39afbh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_83
-taken_call_82:
-	push rax
-	call rx_i_145
-
-rx_i_83: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r10, 0d9b6a533h
-	mov eax, r10d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r10
-	mov r12, rax
-
-rx_i_84: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r15, 0e9e75336h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	mov rcx, r10
-	ror rax, cl
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0ec5c52e6h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_85: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r13, 04c0d378ah
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r8
-	mov r10, rax
-
-rx_i_86: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r11, 04386e368h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	or rax, r8
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0a90410e4h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_87: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r9, 0d75a0ecfh
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r12
-	mov r8, rax
-
-rx_i_88: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 031bb7f7ah
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm6
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0c149906eh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_89: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r9, 03b45ecebh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	imul rax, r8
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0e67532afh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_90: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0ee08e76bh
-	mov eax, r12d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm6, xmm0
-
-rx_i_91: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 042e28e94h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_92: ;CALL
-	dec edi
-	jz rx_finish
-	xor r8, 0729260e1h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	cmp r14d, 1288893603
-	jge short taken_call_92
-	mov r12, rax
-	jmp rx_i_93
-taken_call_92:
-	push rax
-	call rx_i_170
-
-rx_i_93: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0bfcebaf4h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm2
-	movaps xmm2, xmm0
-	mov eax, r10d
-	xor eax, 07e48a0d8h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_94: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 0ea326630h
-	mov eax, r13d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_94
-	xor rax, qword ptr [rsp + 8]
-	mov r8, rax
-	ret 8
-not_taken_ret_94:
-	mov r8, rax
-
-rx_i_95: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r13, 0b5451a2dh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	imul rax, r10
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 01023aa04h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_96: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 04f912ef8h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	mov rax, -1354397081
-	imul rax, rcx
-	mov r11, rax
-
-rx_i_97: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r15, 0acc45b3bh
-	mov ecx, r15d
-	call rx_read_dataset_f
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 0c477e850h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_98: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r14, 09900a4e8h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	sub rax, r15
-	mov r14, rax
-
-rx_i_99: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r9, 0841b2984h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	divpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 04c21df83h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_100: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r15, 07ebea48fh
-	mov ecx, r15d
-	call rx_read_dataset_r
-	add rax, r9
-	mov r14, rax
-
-rx_i_101: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r10, 0631209d3h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r8
-	mov r11, rax
-
-rx_i_102: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r10, 0e50bf07ah
-	mov eax, r10d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-
-rx_i_103: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r10, 02b7096f1h
-	mov eax, r10d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r13
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0e4dd92b6h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_104: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r11, 075deaf71h
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, -1913070089
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 08df8ddf7h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_105: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 036a51f72h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, r15d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 09c8724edh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_106: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 07b512986h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 03cb2505h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_107: ;CALL
-	dec edi
-	jz rx_finish
-	xor r12, 0f1d2e50h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r11d, 1917037441
-	jl short taken_call_107
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 07243ab81h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_108
-taken_call_107:
-	push rax
-	call rx_i_143
-
-rx_i_108: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r9, 07327ba60h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	divpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0678b65beh
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_109: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 0594e37deh
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm2
-	movaps xmm3, xmm0
-
-rx_i_110: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r9, 04cdf5ebah
-	mov ecx, r9d
-	call rx_read_dataset_r
-	mov rcx, r9
-	rol rax, cl
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0ec68532fh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_111: ;RET
-	dec edi
-	jz rx_finish
-	xor r8, 02e16c97ch
-	mov ecx, r8d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_111
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 05d237d0bh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_111:
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 05d237d0bh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_112: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r12, 0d42ddbd4h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r13
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0c2d8d431h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_113: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r10, 07a4f8cbbh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	mov rcx, r9
-	mul rcx
-	mov rax, rdx
-	mov r13, rax
-
-rx_i_114: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r13, 06e83e2cdh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r15
-	imul rcx
-	mov rax, rdx
-	mov r14, rax
-
-rx_i_115: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r14, 0336c980eh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	or rax, r10
-	mov r14, rax
-
-rx_i_116: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r10, 0d122702eh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	mov rcx, -1850776691
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 091af638dh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_117: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r11, 015f2012bh
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, -1205826972
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0b8208a64h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_118: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 037ddf43dh
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm5
-	movaps xmm6, xmm0
-
-rx_i_119: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 0bba475f3h
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm3
-	movaps xmm5, xmm0
-
-rx_i_120: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0e5561e3eh
-	mov eax, r12d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm4
-	movaps xmm8, xmm0
-
-rx_i_121: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 03ab8f73h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_122: ;RET
-	dec edi
-	jz rx_finish
-	xor r10, 04e0dbd40h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_122
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 078f6ec29h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_122:
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 078f6ec29h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_123: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r13, 073e9f58ah
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add eax, r15d
-	mov r13, rax
-
-rx_i_124: ;CALL
-	dec edi
-	jz rx_finish
-	xor r12, 0e3fa3670h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	cmp r11d, 1719505436
-	jns short taken_call_124
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0667d921ch
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_125
-taken_call_124:
-	push rax
-	call rx_i_237
-
-rx_i_125: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r8, 0ebec27cdh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, r14d
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_126: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r8, 01feb5264h
-	mov eax, r8d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm2, xmm0
-
-rx_i_127: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r9, 0405f500fh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r10d
-	imul rax, rcx
-	mov r8, rax
-
-rx_i_128: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r13, 0459f1154h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	imul rax, r9
-	mov r9, rax
-
-rx_i_129: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 081918b4ch
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r13d, -590624856
-	jge short taken_call_129
-	mov r9, rax
-	jmp rx_i_130
-taken_call_129:
-	push rax
-	call rx_i_154
-
-rx_i_130: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r9, 077c3b332h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	or rax, -281794782
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0ef342722h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_131: ;RET
-	dec edi
-	jz rx_finish
-	xor r12, 05792310bh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_131
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0dff06f75h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_131:
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0dff06f75h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_132: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r10, 0ebc6e10h
-	mov ecx, r10d
-	call rx_read_dataset_f
-	addpd xmm0, xmm6
-	movaps xmm7, xmm0
-
-rx_i_133: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r14, 0822f8b60h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	xor rax, -1000526796
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0c45d2c34h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_134: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r10, 0d0f18593h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	add rax, 1516102347
-	mov r13, rax
-
-rx_i_135: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 088212ef9h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_136: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r8, 01ae56e03h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 0efd7799dh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_137: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r11, 015a24231h
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r9
-	rol rax, cl
-	mov r11, rax
-
-rx_i_138: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 02fd380c5h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_138
-	xor rax, qword ptr [rsp + 8]
-	mov r10, rax
-	ret 8
-not_taken_ret_138:
-	mov r10, rax
-
-rx_i_139: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r9, 093172470h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 515364082
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 01eb7d4f2h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_140: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r14, 052543553h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r11d
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_141: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 02f636da1h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	addpd xmm0, xmm2
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 099ff9ffdh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_142: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 0b11a4f2ch
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp r12d, 1365939282
-	js short taken_call_142
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0516a9452h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_143
-taken_call_142:
-	push rax
-	call rx_i_257
-
-rx_i_143: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r15, 037f4b5d0h
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r11d
-	imul rax, rcx
-	mov r9, rax
-
-rx_i_144: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r10, 02e59e00ah
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r11
-	imul rcx
-	mov rax, rdx
-	mov r15, rax
-
-rx_i_145: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r13, 08d5c798h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r11
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0dd491985h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_146: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 02327e6e2h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r12d
-	imul rax, rcx
-	mov r10, rax
-
-rx_i_147: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r13, 03a7df043h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, 1784404616
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 06a5bda88h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_148: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r10, 0783e5c4eh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	sub rax, r14
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 08c783d2ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_149: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 0aa0f5b2fh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r14d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 09046b787h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_150: ;DIV_64
-	dec edi
-	jz rx_finish
-	xor r9, 01504ca7ah
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, 1
-	mov edx, r8d
-	test edx, edx
-	cmovne ecx, edx
-	xor edx, edx
-	div rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0c854a524h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_151: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r9, 0ea72a7cfh
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	or eax, r13d
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 087aed7f2h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_152: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r13, 0ad0e7a88h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r10
-	ror rax, cl
-	mov r10, rax
-
-rx_i_153: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r15, 0fd95ab87h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	divpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-	mov eax, r8d
-	xor eax, 09111c981h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm8
-
-rx_i_154: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r10, 0256697b0h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r13d
-	imul rax, rcx
-	mov r10, rax
-
-rx_i_155: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r11, 0d23f3b78h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r10
-	ror rax, cl
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 01c5d3ebeh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_156: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r10, 098917533h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r15d
-	imul rax, rcx
-	mov r15, rax
-
-rx_i_157: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r10, 0dfac3efch
-	mov ecx, r10d
-	call rx_read_dataset_r
-	add rax, r12
-	mov r14, rax
-
-rx_i_158: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r15, 0a64de090h
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 1233402159
-	mov r10, rax
-
-rx_i_159: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 0952a3abbh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_159
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0ff7d3697h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_159:
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0ff7d3697h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_160: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r14, 0b1685b90h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	sub rax, 1518778665
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 05a86b929h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_161: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r15, 0ea992531h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	or rax, r14
-	mov r8, rax
-
-rx_i_162: ;SAR_64
-	dec edi
-	jz rx_finish
-	xor r9, 01fd57a4ah
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r10
-	sar rax, cl
-	mov r13, rax
-
-rx_i_163: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r12, 0e3486c0ah
-	mov ecx, r12d
-	call rx_read_dataset_r
-	sub rax, -2101130488
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 082c34b08h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_164: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 01f0c2737h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r9d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 09aa6da19h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_165: ;RET
-	dec edi
-	jz rx_finish
-	xor r12, 0debb493eh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_165
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 06450685ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_165:
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 06450685ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_166: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r9, 0fe684081h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r8
-	rol rax, cl
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0bb67f8abh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_167: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 0d10371ch
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm4
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm2, xmm0
-	mov eax, r10d
-	xor eax, 02a58510fh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_168: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r12, 071b15effh
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm7, xmm0
-
-rx_i_169: ;RET
-	dec edi
-	jz rx_finish
-	xor r11, 072790347h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_169
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0b353bf8dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_169:
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0b353bf8dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_170: ;CALL
-	dec edi
-	jz rx_finish
-	xor r8, 04ae8a020h
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r10d, -1541051751
-	jl short taken_call_170
-	mov r14, rax
-	jmp rx_i_171
-taken_call_170:
-	push rax
-	call rx_i_204
-
-rx_i_171: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r15, 09901e05bh
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r12
-	imul rcx
-	mov rax, rdx
-	mov r12, rax
-
-rx_i_172: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r13, 050e8c510h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r11
-	mov r12, rax
-
-rx_i_173: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r14, 05422cf8fh
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r12
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0ad60ae9ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_174: ;FPROUND
-	dec edi
-	jz rx_finish
-	xor r12, 0a025c3dbh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, rax
-	shl eax, 13
-	and rcx, -2048
-	and eax, 24576
-	cvtsi2sd xmm6, rcx
-	or eax, 40896
-	mov dword ptr [rsp - 8], eax
-	ldmxcsr dword ptr [rsp - 8]
-	mov eax, r14d
-	xor eax, 02be6989fh
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_175: ;SAR_64
-	dec edi
-	jz rx_finish
-	xor r13, 08f74c11h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r8
-	sar rax, cl
-	mov r8, rax
-
-rx_i_176: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r9, 01f2ed5f1h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	sub rax, r14
-	mov r10, rax
-
-rx_i_177: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r10, 0d2072c79h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	add rax, r10
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 02f5713b7h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_178: ;RET
-	dec edi
-	jz rx_finish
-	xor r15, 0a8e51933h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_178
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0c366b275h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_178:
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0c366b275h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_179: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0934ad492h
-	mov eax, r12d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm2
-	movaps xmm8, xmm0
-
-rx_i_180: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r15, 01cb3ce1fh
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	xor rax, 1995308563
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 076edfe13h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_181: ;RET
-	dec edi
-	jz rx_finish
-	xor r10, 023c7845fh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_181
-	xor rax, qword ptr [rsp + 8]
-	mov r10, rax
-	ret 8
-not_taken_ret_181:
-	mov r10, rax
-
-rx_i_182: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r8, 0f8884327h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	subpd xmm0, xmm7
-	movaps xmm6, xmm0
-
-rx_i_183: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r13, 013070461h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 137260710
-	mov r10, rax
-
-rx_i_184: ;SAR_64
-	dec edi
-	jz rx_finish
-	xor r12, 04764cdf7h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	sar rax, 40
-	mov r12, rax
-
-rx_i_185: ;CALL
-	dec edi
-	jz rx_finish
-	xor r10, 03c41026fh
-	mov eax, r10d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r15d, -1510284125
-	jbe short taken_call_185
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0a5fae4a3h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_186
-taken_call_185:
-	push rax
-	call rx_i_246
-
-rx_i_186: ;XOR_32
-	dec edi
-	jz rx_finish
-	xor r9, 0cded414bh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	xor eax, r15d
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0b55bfba0h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_187: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r13, 05c6d64a8h
-	mov ecx, r13d
-	call rx_read_dataset_f
-	divpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-
-rx_i_188: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 04659becbh
-	mov eax, r9d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_189: ;FPROUND
-	dec edi
-	jz rx_finish
-	xor r11, 0c52741d5h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, rax
-	shl eax, 13
-	and rcx, -2048
-	and eax, 24576
-	cvtsi2sd xmm5, rcx
-	or eax, 40896
-	mov dword ptr [rsp - 8], eax
-	ldmxcsr dword ptr [rsp - 8]
-
-rx_i_190: ;RET
-	dec edi
-	jz rx_finish
-	xor r12, 0217bf5f3h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_190
-	xor rax, qword ptr [rsp + 8]
-	mov r13, rax
-	ret 8
-not_taken_ret_190:
-	mov r13, rax
-
-rx_i_191: ;CALL
-	dec edi
-	jz rx_finish
-	xor r15, 0884f3526h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	cmp r11d, 1687119072
-	jno short taken_call_191
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0648f64e0h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_192
-taken_call_191:
-	push rax
-	call rx_i_275
-
-rx_i_192: ;CALL
-	dec edi
-	jz rx_finish
-	xor r8, 0d76edad3h
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r14d, -117628864
-	jns short taken_call_192
-	mov r8, rax
-	jmp rx_i_193
-taken_call_192:
-	push rax
-	call rx_i_305
-
-rx_i_193: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 0e9939ach
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, r12d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 074e097dch
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_194: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r12, 0f21ca520h
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 040eb9f47h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_195: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r10, 09405152ch
-	mov ecx, r10d
-	call rx_read_dataset_r
-	mov rcx, r8
-	rol rax, cl
-	mov r9, rax
-
-rx_i_196: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r8, 0c2a9f41bh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	sub rax, -1907903895
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 08e47b269h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_197: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 0229208efh
-	mov ecx, r12d
-	call rx_read_dataset_r
-	imul rax, r15
-	mov r11, rax
-
-rx_i_198: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r14, 0c8d95bbbh
-	mov eax, r14d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r14
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 01149cba0h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_199: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r13, 050049e2eh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r10
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0d0e71e9ah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_200: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r10, 0c63b99e8h
-	mov ecx, r10d
-	call rx_read_dataset_f
-	subpd xmm0, xmm2
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 0b05ce8abh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_201: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0cdda801dh
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 040cfe68eh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_202: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r13, 0fa44b04ah
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
+	and eax, 16376
+	add r15, qword ptr [rsi+rax]
+	; IADD_R r5, r6
+	add r13, r14
+	; FSUB_R f0, a1
 	subpd xmm0, xmm9
-	movaps xmm5, xmm0
-
-rx_i_203: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r10, 0d73e472ch
-	mov ecx, r10d
-	call rx_read_dataset_f
-	subpd xmm0, xmm2
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 09bdff355h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_204: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r9, 01af8ab1dh
+	; IMULH_R r5, r4
+	mov rax, r13
+	mul r12
+	mov r13, rdx
+	; IMUL_9C r7, 753606235
+	lea r15, [r15+r15*8+753606235]
+	; FSWAP_R e2
+	shufpd xmm6, xmm6, 1
+	; IMUL_M r7, L1[r1]
 	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r15
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0eb8fc30fh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_205: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r14, 094e997c5h
-	mov eax, r14d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm8
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-
-rx_i_206: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 0e836a177h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm7
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_207: ;AND_32
-	dec edi
-	jz rx_finish
-	xor r9, 039ccdd30h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	and eax, r12d
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 012bbcc84h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_208: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r9, 0f4f126c5h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	imul rax, r12
-	mov r10, rax
-
-rx_i_209: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r8, 0b84811f1h
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	shr rax, 30
-	mov rcx, rax
+	and eax, 16376
+	imul r15, qword ptr [rsi+rax]
+	; IMUL_R r5, 1431156245
+	imul r13, 1431156245
+	; IADD_RC r4, r2, 1268508410
+	lea r12, [r12+r10+1268508410]
+	; FSWAP_R f2
+	shufpd xmm2, xmm2, 1
+	; ISDIV_C r0, -845194077
+	mov rax, -5858725577819591251
+	imul r8
+	xor eax, eax
+	sar rdx, 28
+	sets al
+	add rdx, rax
+	add r8, rdx
+	; COND_R r0, ab(r5, 1644043355)
+	xor ecx, ecx
+	cmp r13d, 1644043355
+	seta cl
+	add r8, rcx
+	; COND_R r5, lt(r0, 1216385844)
+	xor ecx, ecx
+	cmp r8d, 1216385844
+	setl cl
+	add r13, rcx
+	; IMUL_R r5, r2
+	imul r13, r10
+	; ISTORE L1[r4], r6
 	mov eax, r12d
-	xor eax, 0c36b836ah
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_210: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 0c5efc90ah
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, -1027162400
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0c2c6bee0h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_211: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0ce533072h
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm3, xmm0
-
-rx_i_212: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r13, 06b465fdbh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	imul rax, r13
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 067d81043h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_213: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 02dd1d503h
+	and eax, 16376
+	mov qword ptr [rsi+rax], r14
+	; IXOR_R r4, r3
+	xor r12, r11
+	; IXOR_R r6, r2
+	xor r14, r10
+	; FSQRT_R e1
+	sqrtpd xmm5, xmm5
+	; COND_R r5, be(r1, 1781435695)
+	xor ecx, ecx
+	cmp r9d, 1781435695
+	setbe cl
+	add r13, rcx
+	; ISDIV_C r0, 1367038890
+	mov rax, 1811126293978922977
+	imul r8
+	xor eax, eax
+	sar rdx, 27
+	sets al
+	add rdx, rax
+	add r8, rdx
+	; FDIV_M e1, L1[r3]
+	mov eax, r11d
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	andps xmm12, xmm14
+	divpd xmm5, xmm12
+	maxpd xmm5, xmm13
+	; FMUL_R e2, a0
+	mulpd xmm6, xmm8
+	; ISTORE L1[r5], r4
 	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	mov rax, 129993589
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_214: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r9, 0a159f313h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r14
-	rol rax, cl
-	mov r14, rax
-
-rx_i_215: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r15, 08359265eh
-	mov ecx, r15d
-	call rx_read_dataset_r
-	sub rax, r12
-	mov r10, rax
-
-rx_i_216: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 080696de3h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	imul rax, r13
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 03b609d2bh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_217: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r8, 040d5b526h
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r9d
-	imul rax, rcx
-	mov rcx, rax
+	and eax, 16376
+	mov qword ptr [rsi+rax], r12
+	; IXOR_R r0, r4
+	xor r8, r12
+	; IMUL_R r5, r1
+	imul r13, r9
+	; FDIV_M e0, L1[r2]
 	mov eax, r10d
-	xor eax, 017e667h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_218: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 083c0bd93h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp r8d, -585552250
-	jge short taken_call_218
-	mov r11, rax
-	jmp rx_i_219
-taken_call_218:
-	push rax
-	call rx_i_240
-
-rx_i_219: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r8, 0ca37f668h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	xor rax, -740915304
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0d3d68798h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_220: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r9, 0bb44c384h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r11d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0903fd173h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_221: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r9, 0a3deb512h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r15
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 07feab351h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_222: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 084a02d64h
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0d7601963h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_223: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r8, 01e5cc085h
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	andps xmm12, xmm14
+	divpd xmm4, xmm12
+	maxpd xmm4, xmm13
+	; IMUL_R r6, r1
+	imul r14, r9
+	; FSUB_M f1, L1[r0]
 	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm3
-	movaps xmm2, xmm0
-	mov eax, r10d
-	xor eax, 07fca59eeh
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_224: ;SAR_64
-	dec edi
-	jz rx_finish
-	xor r12, 053982440h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov rcx, r14
-	sar rax, cl
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0e500c69dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_225: ;DIV_64
-	dec edi
-	jz rx_finish
-	xor r13, 0c558367eh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov ecx, 1
-	mov edx, r10d
-	test edx, edx
-	cmovne ecx, edx
-	xor edx, edx
-	div rcx
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0fe304a4ah
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_226: ;CALL
-	dec edi
-	jz rx_finish
-	xor r10, 040139b65h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	cmp r8d, -1752488808
-	jno short taken_call_226
-	mov rcx, rax
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	subpd xmm1, xmm12
+	; COND_R r2, ns(r1, 392878356)
+	xor ecx, ecx
+	cmp r9d, 392878356
+	setns cl
+	add r10, rcx
+	; IADD_R r6, r5
+	add r14, r13
+	; FMUL_R e2, a0
+	mulpd xmm6, xmm8
+	; ISTORE L1[r0], r3
 	mov eax, r8d
-	xor eax, 0978b2498h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_227
-taken_call_226:
-	push rax
-	call rx_i_328
-
-rx_i_227: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r11, 0fa312dbdh
-	mov eax, r11d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm7
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0aabe2a0ah
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_228: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 0b64246c0h
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r10d, -2099304
-	jns short taken_call_228
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0ffdff798h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_229
-taken_call_228:
-	push rax
-	call rx_i_283
-
-rx_i_229: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 05c535836h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r12d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 013e8b2e0h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_230: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r15, 0f394972eh
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 01dc2b4f6h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_231: ;RET
-	dec edi
-	jz rx_finish
-	xor r9, 0bb56428dh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_231
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0e6c9edaah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_231:
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0e6c9edaah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_232: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r15, 09ab46ab3h
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-
-rx_i_233: ;CALL
-	dec edi
-	jz rx_finish
-	xor r13, 08eb2cd76h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp r12d, 392389867
-	jo short taken_call_233
-	mov r14, rax
-	jmp rx_i_234
-taken_call_233:
-	push rax
-	call rx_i_268
-
-rx_i_234: ;FPROUND
-	dec edi
-	jz rx_finish
-	xor r15, 0ba687578h
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, rax
-	shl eax, 13
-	and rcx, -2048
-	and eax, 24576
-	cvtsi2sd xmm4, rcx
-	or eax, 40896
-	mov dword ptr [rsp - 8], eax
-	ldmxcsr dword ptr [rsp - 8]
-
-rx_i_235: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 0b6cb9ff2h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r12d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0ca73a89h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_236: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 03ad196ach
-	mov ecx, r15d
-	call rx_read_dataset_f
-	addpd xmm0, xmm4
-	movaps xmm3, xmm0
-
-rx_i_237: ;CALL
-	dec edi
-	jz rx_finish
-	xor r15, 0fab4600h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	cmp r12d, -121899164
-	jge short taken_call_237
-	mov r11, rax
-	jmp rx_i_238
-taken_call_237:
-	push rax
-	call rx_i_295
-
-rx_i_238: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0158f119fh
-	mov ecx, r8d
-	call rx_read_dataset_f
-	addpd xmm0, xmm6
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0331bbf8h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_239: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r13, 044f30b3fh
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, r10
-	mov r10, rax
-
-rx_i_240: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r9, 0d65d29f9h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	mov rax, -423830277
-	imul rax, rcx
-	mov r8, rax
-
-rx_i_241: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 0ce5260adh
-	mov ecx, r11d
-	call rx_read_dataset_f
-	addpd xmm0, xmm3
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0bc2423ebh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_242: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r12, 01119b0f9h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, 319324914
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0130882f2h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_243: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r12, 0d6c2ce3dh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	xor rax, 1198180774
-	mov r14, rax
-
-rx_i_244: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 0c6a6248h
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm6
-	movaps xmm9, xmm0
-
-rx_i_245: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r13, 084505739h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	xor rax, -1546539637
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0a3d1ad8bh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_246: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r15, 027eeaa2eh
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r9
-	mov r12, rax
-
-rx_i_247: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r10, 0c4de0296h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r14d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 03814cf80h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_248: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r8, 0649df46fh
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r15d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 07b10fc32h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_249: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r15, 0499552cch
-	mov ecx, r15d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r11d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0e1afcff9h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_250: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r13, 083eafe6fh
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r8
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 031115b87h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_251: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r13, 0a25a4d8ah
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 05ed767a3h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_252: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r14, 08a75ad41h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	mov rcx, r8
-	rol rax, cl
-	mov r14, rax
-
-rx_i_253: ;CALL
-	dec edi
-	jz rx_finish
-	xor r14, 057f3f596h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp r15d, 1699431947
-	jns short taken_call_253
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0654b460bh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_254
-taken_call_253:
-	push rax
-	call rx_i_367
-
-rx_i_254: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r14, 04cfb709eh
-	mov ecx, r14d
-	call rx_read_dataset_f
-	subpd xmm0, xmm4
-	movaps xmm8, xmm0
-	mov eax, r8d
-	xor eax, 0c251872eh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm8
-
-rx_i_255: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 0b96ec9ech
-	mov ecx, r9d
-	call rx_read_dataset_f
-	addpd xmm0, xmm5
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 0ae781d10h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_256: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r8, 08375472ch
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov rcx, r15
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0f8942c0h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_257: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0d75a8c3fh
-	mov ecx, r12d
-	call rx_read_dataset_f
-	addpd xmm0, xmm5
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0373b1b6fh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_258: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 064fdbda0h
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r14d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 01c58ef2dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_259: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 02e36a073h
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm3, xmm0
-
-rx_i_260: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r13, 0f94e9fa9h
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm9, xmm0
-
-rx_i_261: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r14, 02346171ch
-	mov ecx, r14d
-	call rx_read_dataset_f
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0745a48e9h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_262: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r10, 01c42baa6h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	or eax, r13d
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0a271ff06h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_263: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r11, 0b39b140h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	divpd xmm0, xmm8
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm6, xmm0
-
-rx_i_264: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 01a07d201h
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-
-rx_i_265: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r13, 07a3eb340h
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
+	and eax, 16376
+	mov qword ptr [rsi+rax], r11
+	; IMUL_R r1, r3
+	imul r9, r11
+	; IMUL_R r5, r2
+	imul r13, r10
+	; FADD_R f0, a0
 	addpd xmm0, xmm8
-	movaps xmm2, xmm0
-	mov eax, r10d
-	xor eax, 04c559414h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_266: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 03d0a3a89h
-	mov eax, r13d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_266
-	xor rax, qword ptr [rsp + 8]
-	mov r10, rax
-	ret 8
-not_taken_ret_266:
-	mov r10, rax
-
-rx_i_267: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r8, 0c6c7b37h
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	ror rax, 56
-	mov r11, rax
-
-rx_i_268: ;CALL
-	dec edi
-	jz rx_finish
-	xor r12, 0c2510cebh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r15d, -2062812966
-	jl short taken_call_268
-	mov r13, rax
-	jmp rx_i_269
-taken_call_268:
-	push rax
-	call rx_i_381
-
-rx_i_269: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r11, 0c80cc899h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r8
-	ror rax, cl
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 01ba81447h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_270: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 0eb355caah
-	mov ecx, r11d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-
-rx_i_271: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 0c6f12299h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, -2032281772
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 086ddd754h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_272: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r12, 0695a5dd2h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	or eax, r12d
-	mov r13, rax
-
-rx_i_273: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 0d315e4dch
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r12d, 1670848568
-	jl short taken_call_273
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 063972038h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_274
-taken_call_273:
-	push rax
-	call rx_i_372
-
-rx_i_274: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 0b66ca7e0h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	subpd xmm0, xmm4
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 06a2b2b5bh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_275: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r10, 0788eceb7h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	or rax, r11
-	mov r13, rax
-
-rx_i_276: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 0c6ac5edah
-	mov ecx, r9d
-	call rx_read_dataset_r
-	cmp r11d, -1236180570
-	jns short taken_call_276
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0b65161a6h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_277
-taken_call_276:
-	push rax
-	call rx_i_404
-
-rx_i_277: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 0c9549789h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r10d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 01aca20a3h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_278: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 0a2bc66c9h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	subpd xmm0, xmm7
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 02d00ad10h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_279: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 0f1a91458h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	subpd xmm0, xmm5
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0475ade01h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_280: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r12, 066246b43h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r11
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0211aeb00h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_281: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r10, 05a762727h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	sub rax, r10
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0f3e6c946h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_282: ;SUB_32
-	dec edi
-	jz rx_finish
-	xor r15, 0de1ab603h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	sub eax, 1367326224
-	mov r11, rax
-
-rx_i_283: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r9, 0df4d084fh
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	add eax, -1156732976
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0bb0da7d0h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_284: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 0e68f36ach
-	mov ecx, r15d
-	call rx_read_dataset_f
-	subpd xmm0, xmm6
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0936f2960h
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_285: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r8, 09adb333bh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r8d
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_286: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r14, 082f5e36ch
-	mov eax, r14d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
+	; FADD_R f0, a1
 	addpd xmm0, xmm9
-	movaps xmm7, xmm0
-
-rx_i_287: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r11, 049547c9ch
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	or rax, r15
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 04926c7fah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_288: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r10, 08716ac8bh
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r8
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 062eafa1bh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_289: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r14, 0efef52b5h
-	mov eax, r14d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_290: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r15, 060665748h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm8
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm9, xmm0
-
-rx_i_291: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 0ddf4bd1ah
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_291
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0768a9d75h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_291:
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0768a9d75h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_292: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r13, 05a87cc3dh
-	mov eax, r13d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	ror rax, 23
-	mov r10, rax
-
-rx_i_293: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 0c61f4279h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	subpd xmm0, xmm5
-	movaps xmm8, xmm0
-
-rx_i_294: ;RET
-	dec edi
-	jz rx_finish
-	xor r14, 0f3b9d85h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_294
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0ef8571b7h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_294:
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0ef8571b7h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_295: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 0f42798fdh
-	mov ecx, r9d
-	call rx_read_dataset_f
+	; FSUB_R f0, a0
 	subpd xmm0, xmm8
-	movaps xmm7, xmm0
-
-rx_i_296: ;CALL
-	dec edi
-	jz rx_finish
-	xor r14, 018738758h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp r9d, -207252278
-	jns short taken_call_296
-	mov rcx, rax
+	; IMUL_R r3, r5
+	imul r11, r13
+	; IADD_R r1, r5
+	add r9, r13
+	; IXOR_M r0, L1[r5]
+	mov eax, r13d
+	and eax, 16376
+	xor r8, qword ptr [rsi+rax]
+	; FNEG_R f2
+	xorps xmm2, xmm15
+	; IDIV_C r5, 2577129788
+	mov rax, 15371395512010654233
+	mul r13
+	shr rdx, 31
+	add r13, rdx
+	; COND_R r5, be(r5, -999219370)
+	xor ecx, ecx
+	cmp r13d, -999219370
+	setbe cl
+	add r13, rcx
+	; ISTORE L2[r0], r2
 	mov eax, r8d
-	xor eax, 0f3a594cah
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_297
-taken_call_296:
-	push rax
-	call rx_i_395
-
-rx_i_297: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r15, 0de3b9d9bh
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, r10
-	mov r14, rax
-
-rx_i_298: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r14, 084f53637h
+	and eax, 262136
+	mov qword ptr [rsi+rax], r10
+	; FSUB_R f3, a3
+	subpd xmm3, xmm11
+	; IROR_R r7, r6
+	mov ecx, r14d
+	ror r15, cl
+	; COND_R r6, ab(r4, 1309137534)
+	xor ecx, ecx
+	cmp r12d, 1309137534
+	seta cl
+	add r14, rcx
+	; FMUL_R e3, a0
+	mulpd xmm7, xmm8
+	; COND_M r3, no(L2[r5], 483660199)
+	xor ecx, ecx
+	mov eax, r13d
+	and eax, 262136
+	cmp dword ptr [rsi+rax], 483660199
+	setno cl
+	add r11, rcx
+	; IMUL_R r1, r6
+	imul r9, r14
+	; IADD_RC r7, r2, -1340630490
+	lea r15, [r15+r10-1340630490]
+	; IADD_M r0, L3[1554088]
+	add r8, qword ptr [rsi+1554088]
+	; FMUL_R e2, a3
+	mulpd xmm6, xmm11
+	; IDIV_C r0, 1566192452
+	mov rax, 12646619898641986559
+	mul r8
+	shr rdx, 30
+	add r8, rdx
+	; FADD_R f0, a1
+	addpd xmm0, xmm9
+	; ISWAP_R r6, r0
+	xchg r14, r8
+	; IMUL_9C r4, 1340891034
+	lea r12, [r12+r12*8+1340891034]
+	; IROR_R r7, r2
+	mov ecx, r10d
+	ror r15, cl
+	; FSQRT_R e2
+	sqrtpd xmm6, xmm6
+	; FADD_R f2, a1
+	addpd xmm2, xmm9
+	; IMUL_R r4, r3
+	imul r12, r11
+	; IADD_RC r6, r3, -1584624397
+	lea r14, [r14+r11-1584624397]
+	; IROR_R r1, r7
+	mov ecx, r15d
+	ror r9, cl
+	; IXOR_R r4, r7
+	xor r12, r15
+	; FSWAP_R f0
+	shufpd xmm0, xmm0, 1
+	; FSWAP_R f3
+	shufpd xmm3, xmm3, 1
+	; IROR_R r5, 3
+	ror r13, 3
+	; FADD_R f3, a0
+	addpd xmm3, xmm8
+	; FMUL_R e0, a0
+	mulpd xmm4, xmm8
+	; IADD_R r4, r1
+	add r12, r9
+	; COND_M r4, ge(L1[r6], -1612023931)
+	xor ecx, ecx
 	mov eax, r14d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm7
-	movaps xmm6, xmm0
-
-rx_i_299: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r12, 042f4897h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 21400308
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 01468af4h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_300: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r12, 095765693h
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
+	and eax, 16376
+	cmp dword ptr [rsi+rax], -1612023931
+	setge cl
+	add r12, rcx
+	; FSWAP_R e2
+	shufpd xmm6, xmm6, 1
+	; IADD_R r3, r7
+	add r11, r15
+	; COND_R r5, be(r2, -1083018923)
+	xor ecx, ecx
+	cmp r10d, -1083018923
+	setbe cl
+	add r13, rcx
+	; IADD_R r3, r7
+	add r11, r15
+	; ISTORE L2[r6], r0
+	mov eax, r14d
+	and eax, 262136
+	mov qword ptr [rsi+rax], r8
+	; IXOR_R r2, r3
+	xor r10, r11
+	; FMUL_R e2, a3
+	mulpd xmm6, xmm11
+	; FMUL_R e3, a3
+	mulpd xmm7, xmm11
+	; FADD_R f0, a2
+	addpd xmm0, xmm10
+	; ISTORE L1[r5], r1
+	mov eax, r13d
+	and eax, 16376
+	mov qword ptr [rsi+rax], r9
+	; FMUL_R e3, a3
+	mulpd xmm7, xmm11
+	; ISWAP_R r1, r2
+	xchg r9, r10
+	; FSWAP_R e0
+	shufpd xmm4, xmm4, 1
+	; FSUB_R f1, a2
+	subpd xmm1, xmm10
+	; FSUB_R f0, a0
 	subpd xmm0, xmm8
-	movaps xmm2, xmm0
-
-rx_i_301: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r8, 0a0ec5eech
+	; IROR_R r7, r0
 	mov ecx, r8d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm5
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0433cf2d6h
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_302: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r15, 0f6f8c345h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	add rax, r10
-	mov r11, rax
-
-rx_i_303: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r14, 082a3e965h
-	mov eax, r14d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0bb9ee490h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_304: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 04940c652h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r15
-	mov r13, rax
-
-rx_i_305: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r11, 03c6c62b8h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	imul rax, rax, -65873120
-	mov r10, rax
-
-rx_i_306: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r15, 08b34cdfch
-	mov ecx, r15d
-	call rx_read_dataset_r
-	add rax, r15
-	mov r13, rax
-
-rx_i_307: ;SAR_64
-	dec edi
-	jz rx_finish
-	xor r15, 04c36adb1h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	mov rcx, r8
-	sar rax, cl
-	mov r10, rax
-
-rx_i_308: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r11, 0a4213b21h
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r13
-	mov r15, rax
-
-rx_i_309: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r9, 090c42304h
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, -1652850028
-	imul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 09d7b8294h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_310: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 0f78e1c8ch
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 07c9816c0h
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_311: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r8, 0ff8848cfh
-	mov ecx, r8d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm4
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_312: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 0b18904cdh
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, -1147928648
-	imul rax, rcx
-	mov r10, rax
-
-rx_i_313: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0a0d0befh
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm5
-	movaps xmm6, xmm0
-
-rx_i_314: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r15, 01e3c65f7h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r9d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 07fc7f955h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_315: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r9, 02e36ddafh
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r15
-	shr rax, cl
-	mov r9, rax
-
-rx_i_316: ;RET
-	dec edi
-	jz rx_finish
-	xor r14, 05b0cb5bbh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_316
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 03602c513h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_316:
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 03602c513h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_317: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 0c74e7415h
-	mov ecx, r9d
-	call rx_read_dataset_f
-	addpd xmm0, xmm7
-	movaps xmm5, xmm0
-
-rx_i_318: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 057621d9ah
-	mov ecx, r9d
-	call rx_read_dataset_f
-	addpd xmm0, xmm3
-	movaps xmm7, xmm0
-
-rx_i_319: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r13, 08ee02d99h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r15
-	rol rax, cl
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 01f931a08h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_320: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 013461188h
-	mov ecx, r15d
-	call rx_read_dataset_f
-	addpd xmm0, xmm4
-	movaps xmm2, xmm0
+	ror r15, cl
+	; IADD_RC r5, r4, 283260945
+	lea r13, [r13+r12+283260945]
+	; ISDIV_C r6, -340125851
+	mov rax, -3639652898025032137
+	imul r14
+	xor eax, eax
+	sar rdx, 26
+	sets al
+	add rdx, rax
+	add r14, rdx
+	; ISTORE L2[r2], r3
 	mov eax, r10d
-	xor eax, 02bdc7349h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_321: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 0a7bae383h
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r9d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0f213dach
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_322: ;RET
-	dec edi
-	jz rx_finish
-	xor r14, 08215399bh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_322
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 054292224h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_322:
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 054292224h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_323: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r14, 07b07664bh
-	mov eax, r14d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, -696924877
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0d675c533h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_324: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r9, 0f956baffh
+	and eax, 262136
+	mov qword ptr [rsi+rax], r11
+	; IADD_RC r6, r6, -935765909
+	lea r14, [r14+r14-935765909]
+	; ISDIV_C r3, -701703430
+	mov rax, -7056770631919985199
+	imul r11
+	xor eax, eax
+	sar rdx, 28
+	sets al
+	add rdx, rax
+	add r11, rdx
+	; IXOR_M r3, L2[r1]
 	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0944856d4h
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_325: ;SHL_64
-	dec edi
-	jz rx_finish
-	xor r11, 0708ab9d1h
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	shl rax, 24
-	mov r13, rax
-
-rx_i_326: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r11, 0d1b27540h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r8
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0b67623c3h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_327: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r9, 09665f98dh
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r15
-	mov r12, rax
-
-rx_i_328: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r12, 0fb9c32adh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r13
-	rol rax, cl
-	mov r9, rax
-
-rx_i_329: ;RET
-	dec edi
-	jz rx_finish
-	xor r11, 0e1110623h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_329
-	xor rax, qword ptr [rsp + 8]
-	mov r11, rax
-	ret 8
-not_taken_ret_329:
-	mov r11, rax
-
-rx_i_330: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r9, 0f6a93f19h
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
+	and eax, 262136
+	xor r11, qword ptr [rsi+rax]
+	; FADD_R f2, a1
+	addpd xmm2, xmm9
+	; ISTORE L1[r5], r7
 	mov eax, r13d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0af8b7117h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_331: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 0bc9bbe4ah
-	mov eax, r9d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm9, xmm0
-
-rx_i_332: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0f253cd4eh
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm6
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0116c919eh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_333: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r14, 0f009758bh
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	xor rax, -175125848
-	mov r11, rax
-
-rx_i_334: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r8, 0dda04168h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	add eax, r13d
-	mov r8, rax
-
-rx_i_335: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r15, 03e6cfb73h
-	mov eax, r15d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r8
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 07ffe4218h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_336: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 0aea0a435h
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm2
-	movaps xmm3, xmm0
-
-rx_i_337: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r8, 03d6c4ab2h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	add eax, r12d
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0dab07c39h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_338: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 0d428a742h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r12
-	mov r11, rax
-
-rx_i_339: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 04596ef73h
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm6
-	movaps xmm2, xmm0
-
-rx_i_340: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 0e51629cch
-	mov ecx, r15d
-	call rx_read_dataset_f
-	subpd xmm0, xmm5
-	movaps xmm5, xmm0
-
-rx_i_341: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 019eb9ea5h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r15d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 024736405h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_342: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 09ccc7abah
-	mov ecx, r9d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm3, xmm0
-
-rx_i_343: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r14, 056f6cf0bh
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	shr rax, 48
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0d9a469a9h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_344: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r10, 03ef9bcc4h
+	and eax, 16376
+	mov qword ptr [rsi+rax], r15
+	; FSUB_R f2, a0
+	subpd xmm2, xmm8
+	; FMUL_R e3, a2
+	mulpd xmm7, xmm10
+	; IADD_R r2, r5
+	add r10, r13
+	; IADD_RC r2, r5, -1056770544
+	lea r10, [r10+r13-1056770544]
+	; ISTORE L2[r2], r3
 	mov eax, r10d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-
-rx_i_345: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r12, 0bbbcdbach
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov rcx, r13
-	mul rcx
-	mov rax, rdx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0ef03b0ddh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_346: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r12, 0ae9d1e96h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	xor rax, r15
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0ed2d3987h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_347: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r14, 070c34d69h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, r10
-	mov r13, rax
-
-rx_i_348: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r13, 0523ff904h
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm3
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 039c35461h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_349: ;XOR_32
-	dec edi
-	jz rx_finish
-	xor r8, 018e0e5ddh
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	xor eax, r15d
-	mov r13, rax
-
-rx_i_350: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 09bd050f0h
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r9d, -980411581
-	jbe short taken_call_350
-	mov rcx, rax
+	and eax, 262136
+	mov qword ptr [rsi+rax], r11
+	; ISMULH_R r7, r1
+	mov rax, r15
+	imul r9
+	mov r15, rdx
+	; IXOR_R r0, r5
+	xor r8, r13
+	; ISTORE L1[r4], r0
 	mov eax, r12d
-	xor eax, 0c5901b43h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_351
-taken_call_350:
-	push rax
-	call rx_i_352
-
-rx_i_351: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r11, 0a3a5906fh
-	mov ecx, r11d
-	call rx_read_dataset_r
-	imul rax, r10
-	mov r13, rax
-
-rx_i_352: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r10, 0afc9af2bh
-	mov ecx, r10d
-	call rx_read_dataset_f
-	addpd xmm0, xmm6
-	movaps xmm2, xmm0
-	mov eax, r10d
-	xor eax, 03bf686f2h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm2
-
-rx_i_353: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r13, 02e65278bh
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0b3c9f7aeh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_354: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r13, 02412fc10h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	mov rcx, r13
-	mul rcx
-	mov rax, rdx
-	mov r13, rax
-
-rx_i_355: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r10, 06bd6e65fh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	imul rax, r14
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0c1062b3ch
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_356: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r10, 01cd85d80h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	imul rax, r10
-	mov r11, rax
-
-rx_i_357: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r10, 0f7daed36h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 820073637
-	mov r11, rax
-
-rx_i_358: ;DIV_64
-	dec edi
-	jz rx_finish
-	xor r13, 088fa6e5ah
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, 1
-	mov edx, r11d
-	test edx, edx
-	cmovne ecx, edx
-	xor edx, edx
-	div rcx
-	mov r9, rax
-
-rx_i_359: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r10, 0714fc2cdh
-	mov ecx, r10d
-	call rx_read_dataset_f
+	and eax, 16376
+	mov qword ptr [rsi+rax], r8
+	; INEG_R r5
+	neg r13
+	; FSUB_R f0, a1
 	subpd xmm0, xmm9
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 0f16b9be3h
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_360: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r10, 0c2d110b5h
-	mov eax, r10d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm8
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_361: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r15, 01d125a7fh
-	mov ecx, r15d
-	call rx_read_dataset_f
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 0ad0b81f5h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_362: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r9, 0ed8954bdh
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, 1082179469
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 04080bf8dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_363: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r12, 09f75887bh
-	mov eax, r12d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm6
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm3, xmm0
-
-rx_i_364: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r11, 0badaf867h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r8
-	mul rcx
-	mov rax, rdx
-	mov r8, rax
-
-rx_i_365: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r15, 02db4444ah
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r9d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0bfd87d37h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_366: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 0bff7218fh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r8d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 0c3d6bcb7h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_367: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 04d14cb3ah
+	; IMUL_R r6, -244261682
+	imul r14, -244261682
+	; IMUL_R r1, r0
+	imul r9, r8
+	; IMUL_9C r3, -985744277
+	lea r11, [r11+r11*8-985744277]
+	; IROR_R r2, r1
 	mov ecx, r9d
-	call rx_read_dataset_f
-	addpd xmm0, xmm9
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 0ad9b92e8h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_368: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r10, 0a14836bah
-	mov ecx, r10d
-	call rx_read_dataset_r
-	imul rax, r10
-	mov r8, rax
-
-rx_i_369: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r9, 053fe22e2h
-	mov eax, r9d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r13
-	mov r9, rax
-
-rx_i_370: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 010e1fb24h
-	mov eax, r15d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm6
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 0a120e0edh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_371: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0ebbd5cc9h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	addpd xmm0, xmm9
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 0c40fe413h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_372: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r10, 098ab79d7h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, r13
-	rol rax, cl
-	mov r9, rax
-
-rx_i_373: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r15, 056438b3h
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm8
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_374: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 0dbcce604h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm2, xmm0
-
-rx_i_375: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r9, 0edea6200h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	add rax, r15
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0ec359be9h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_376: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r14, 05e61b279h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 476136066
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 01c614282h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_377: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r14, 0fc1fb433h
-	mov ecx, r14d
-	call rx_read_dataset_f
-	subpd xmm0, xmm3
-	movaps xmm7, xmm0
-
-rx_i_378: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 082aa21ach
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, 547725353
-	imul rax, rcx
-	mov r15, rax
-
-rx_i_379: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r10, 05dba41fbh
-	mov ecx, r10d
-	call rx_read_dataset_f
-	addpd xmm0, xmm9
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 03a2dc429h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_380: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r11, 0229e3d6eh
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, rax, -1443002912
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0a9fd85e0h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_381: ;SAR_64
-	dec edi
-	jz rx_finish
-	xor r8, 019816ff9h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov rcx, r14
-	sar rax, cl
-	mov r9, rax
-
-rx_i_382: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r14, 036b5b81fh
-	mov ecx, r14d
-	call rx_read_dataset_f
-	addpd xmm0, xmm3
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0a6a2e0b1h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_383: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r15, 05f798ec3h
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm4
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 0c9f5cc22h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_384: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r10, 05b459fd7h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	mov rcx, r11
-	shr rax, cl
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 054439464h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_385: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r15, 0c91749bbh
-	mov ecx, r15d
-	call rx_read_dataset_r
-	imul rax, r12
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0fb9b50b9h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_386: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 0575b4bdch
-	mov ecx, r9d
-	call rx_read_dataset_f
+	ror r10, cl
+	; ISUB_R r4, -1079131550
+	sub r12, -1079131550
+	; FNEG_R f3
+	xorps xmm3, xmm15
+	; COND_R r4, ns(r5, -362284631)
+	xor ecx, ecx
+	cmp r13d, -362284631
+	setns cl
+	add r12, rcx
+	; FSUB_R f2, a0
+	subpd xmm2, xmm8
+	; IXOR_R r4, r5
+	xor r12, r13
+	; FNEG_R f1
+	xorps xmm1, xmm15
+	; FADD_R f0, a0
 	addpd xmm0, xmm8
-	movaps xmm9, xmm0
-
-rx_i_387: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r9, 0d4f7bc6ah
-	mov ecx, r9d
-	call rx_read_dataset_r
-	imul rax, r15
-	mov r9, rax
-
-rx_i_388: ;RET
-	dec edi
-	jz rx_finish
-	xor r8, 08a949356h
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_388
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0a0985cc2h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_388:
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0a0985cc2h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_389: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 06531ad2eh
-	mov eax, r11d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r9d, -350609584
-	jge short taken_call_389
-	mov r14, rax
-	jmp rx_i_390
-taken_call_389:
-	push rax
-	call rx_i_421
-
-rx_i_390: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 02914abeah
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm4
-	movaps xmm3, xmm0
-
-rx_i_391: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0473a41f0h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm6, xmm0
-
-rx_i_392: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r14, 01ebc1f0dh
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	ror rax, 0
-	mov rcx, rax
+	; IADD_RC r3, r3, -173615832
+	lea r11, [r11+r11-173615832]
+	; IMUL_R r0, 928402279
+	imul r8, 928402279
+	; ISUB_R r2, r0
+	sub r10, r8
+	; IXOR_R r6, r3
+	xor r14, r11
+	; ISUB_R r2, 2106401471
+	sub r10, 2106401471
+	; FADD_R f0, a2
+	addpd xmm0, xmm10
+	; IMUL_R r4, r6
+	imul r12, r14
+	; IADD_RC r4, r0, -373491513
+	lea r12, [r12+r8-373491513]
+	; ISDIV_C r0, -1739042721
+	mov rax, 7057121271817449967
+	imul r8
+	xor eax, eax
+	sub rdx, r8
+	sar rdx, 30
+	sets al
+	add rdx, rax
+	add r8, rdx
+	; IADD_R r3, r1
+	add r11, r9
+	; ISUB_M r7, L1[r5]
 	mov eax, r13d
-	xor eax, 08c4a0f0dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_393: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r14, 0742e95b1h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	or eax, 552339548
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 020ec085ch
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_394: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0db885c2ch
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm6, xmm0
-
-rx_i_395: ;IDIV_64
-	dec edi
-	jz rx_finish
-	xor r8, 04ae4fe8ch
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov edx, r13d
-	cmp edx, -1
-	jne short safe_idiv_395
-	mov rcx, rax
-	rol rcx, 1
-	dec rcx
-	jz short result_idiv_395
-safe_idiv_395:
-	mov ecx, 1
-	test edx, edx
-	cmovne ecx, edx
-	movsxd rcx, ecx
-	cqo
-	idiv rcx
-result_idiv_395:
-	mov r8, rax
-
-rx_i_396: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r10, 07b41862bh
-	mov ecx, r10d
-	call rx_read_dataset_f
-	addpd xmm0, xmm7
-	movaps xmm4, xmm0
-
-rx_i_397: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r8, 0916f3819h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	imul rax, r12
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0146db5dfh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_398: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r8, 04eb6fd2ah
-	mov eax, r8d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	rol rax, 44
-	mov rcx, rax
+	and eax, 16376
+	sub r15, qword ptr [rsi+rax]
+	; IMUL_R r1, r2
+	imul r9, r10
+	; ISUB_R r0, 722465116
+	sub r8, 722465116
+	; IADD_RC r0, r0, -1919541169
+	lea r8, [r8+r8-1919541169]
+	; ISUB_M r2, L1[r3]
 	mov eax, r11d
-	xor eax, 0724e7136h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_399: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r11, 0899a98cfh
-	mov ecx, r11d
-	call rx_read_dataset_f
-	divpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm6, xmm0
-
-rx_i_400: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r13, 0aae75db6h
-	mov eax, r13d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	or eax, r11d
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 094ac538ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_401: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r13, 032e81f25h
-	mov eax, r13d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm4
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 03ea60344h
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_402: ;RET
-	dec edi
-	jz rx_finish
-	xor r9, 0fa1a07ffh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_402
-	xor rax, qword ptr [rsp + 8]
-	mov r14, rax
-	ret 8
-not_taken_ret_402:
-	mov r14, rax
-
-rx_i_403: ;IDIV_64
-	dec edi
-	jz rx_finish
-	xor r9, 0e59500f7h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov edx, r12d
-	cmp edx, -1
-	jne short safe_idiv_403
-	mov rcx, rax
-	rol rcx, 1
-	dec rcx
-	jz short result_idiv_403
-safe_idiv_403:
-	mov ecx, 1
-	test edx, edx
-	cmovne ecx, edx
-	movsxd rcx, ecx
-	cqo
-	idiv rcx
-result_idiv_403:
-	mov rcx, rax
+	and eax, 16376
+	sub r10, qword ptr [rsi+rax]
+	; IADD_R r7, -1183581468
+	add r15, -1183581468
+	; FMUL_R e1, a3
+	mulpd xmm5, xmm11
+	; FSUB_R f0, a0
+	subpd xmm0, xmm8
+	; FADD_R f0, a3
+	addpd xmm0, xmm11
+	; IMUL_9C r6, 1241113238
+	lea r14, [r14+r14*8+1241113238]
+	; FSUB_R f3, a3
+	subpd xmm3, xmm11
+	; IADD_M r0, L1[r3]
 	mov eax, r11d
-	xor eax, 01ff394a0h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_404: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r15, 05b8ceb2fh
+	and eax, 16376
+	add r8, qword ptr [rsi+rax]
+	; IROR_R r3, r7
 	mov ecx, r15d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, r8d
-	imul rax, rcx
-	mov r15, rax
-
-rx_i_405: ;RET
-	dec edi
-	jz rx_finish
-	xor r8, 0f61082a3h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_405
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 06b0af6c1h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_405:
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 06b0af6c1h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_406: ;FPROUND
-	dec edi
-	jz rx_finish
-	xor r9, 0af6886b7h
+	ror r11, cl
+	; FADD_R f2, a1
+	addpd xmm2, xmm9
+	; IMUL_M r3, L1[r2]
+	mov eax, r10d
+	and eax, 16376
+	imul r11, qword ptr [rsi+rax]
+	; IMUL_9C r7, -2080412544
+	lea r15, [r15+r15*8-2080412544]
+	; IMUL_R r0, r3
+	imul r8, r11
+	; FADD_R f1, a1
+	addpd xmm1, xmm9
+	; IROR_R r6, 21
+	ror r14, 21
+	; FDIV_M e3, L1[r1]
 	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, rax
-	shl eax, 13
-	and rcx, -2048
-	and eax, 24576
-	cvtsi2sd xmm9, rcx
-	or eax, 40896
-	mov dword ptr [rsp - 8], eax
-	ldmxcsr dword ptr [rsp - 8]
-	mov eax, r9d
-	xor eax, 09862adefh
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_407: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r14, 09699566fh
-	mov ecx, r14d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_408: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r15, 066e79fa6h
-	mov ecx, r15d
-	call rx_read_dataset_r
-	imul rax, r9
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0295004c9h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_409: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r11, 04b6caa9ah
-	mov ecx, r11d
-	call rx_read_dataset_r
-	imul rax, r15
-	mov r8, rax
-
-rx_i_410: ;RET
-	dec edi
-	jz rx_finish
-	xor r15, 0d17f245eh
-	mov eax, r15d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_410
-	xor rax, qword ptr [rsp + 8]
-	mov r8, rax
-	ret 8
-not_taken_ret_410:
-	mov r8, rax
-
-rx_i_411: ;RET
-	dec edi
-	jz rx_finish
-	xor r12, 0364f10e7h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_411
-	xor rax, qword ptr [rsp + 8]
-	mov r12, rax
-	ret 8
-not_taken_ret_411:
-	mov r12, rax
-
-rx_i_412: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r10, 0ac90e7ah
-	mov eax, r10d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm3, xmm0
-	mov eax, r11d
-	xor eax, 0bbd2640ah
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm3
-
-rx_i_413: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r11, 04b6037abh
-	mov eax, r11d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_414: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r14, 06c01554dh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	or rax, r8
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0e973b3b1h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_415: ;DIV_64
-	dec edi
-	jz rx_finish
-	xor r8, 08c3e59a1h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov ecx, -538093385
-	xor edx, edx
-	div rcx
-	mov r9, rax
-
-rx_i_416: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r12, 0f3fafde9h
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm3
-	movaps xmm5, xmm0
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	andps xmm12, xmm14
+	divpd xmm7, xmm12
+	maxpd xmm7, xmm13
+	; FSUB_R f0, a1
+	subpd xmm0, xmm9
+	; FSWAP_R e1
+	shufpd xmm5, xmm5, 1
+	; COND_M r0, no(L1[r5], -1627153829)
+	xor ecx, ecx
 	mov eax, r13d
-	xor eax, 0f84b5382h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_417: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r10, 03c6481fah
-	mov ecx, r10d
-	call rx_read_dataset_r
-	sub rax, r12
-	mov r10, rax
-
-rx_i_418: ;MULH_64
-	dec edi
-	jz rx_finish
-	xor r10, 02bd61c5fh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	mov rcx, r11
-	mul rcx
-	mov rax, rdx
-	mov r10, rax
-
-rx_i_419: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r9, 0b6ab9d32h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	xor rax, r14
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0beeca8dbh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_420: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r9, 0f9690ceah
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm3
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 08f7bb3ech
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_421: ;RET
-	dec edi
-	jz rx_finish
-	xor r12, 01ada0f39h
+	and eax, 16376
+	cmp dword ptr [rsi+rax], -1627153829
+	setno cl
+	add r8, rcx
+	; FADD_R f2, a3
+	addpd xmm2, xmm11
+	; FSUB_R f1, a2
+	subpd xmm1, xmm10
+	; FSUB_M f1, L1[r4]
 	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_421
-	xor rax, qword ptr [rsp + 8]
-	mov r10, rax
-	ret 8
-not_taken_ret_421:
-	mov r10, rax
-
-rx_i_422: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 04dd16ca4h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r10d
-	imul rax, rcx
-	mov r13, rax
-
-rx_i_423: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 04df5ce05h
-	mov ecx, r12d
-	call rx_read_dataset_r
-	imul rax, r10
-	mov rcx, rax
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	subpd xmm1, xmm12
+	; ISTORE L1[r5], r1
+	mov eax, r13d
+	and eax, 16376
+	mov qword ptr [rsi+rax], r9
+	; ISUB_M r2, L2[r7]
 	mov eax, r15d
-	xor eax, 0a5d40d0ah
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_424: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r13, 01ad12ce2h
-	mov ecx, r13d
-	call rx_read_dataset_f
-	addpd xmm0, xmm7
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 0565ae8aah
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_425: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r8, 0a3c5391dh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	movsxd rcx, eax
-	movsxd rax, r10d
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_426: ;AND_64
-	dec edi
-	jz rx_finish
-	xor r12, 09dd55ba0h
+	and eax, 262136
+	sub r10, qword ptr [rsi+rax]
+	; ISTORE L1[r2], r3
+	mov eax, r10d
+	and eax, 16376
+	mov qword ptr [rsi+rax], r11
+	; FADD_R f0, a3
+	addpd xmm0, xmm11
+	; ISUB_M r1, L1[r7]
+	mov eax, r15d
+	and eax, 16376
+	sub r9, qword ptr [rsi+rax]
+	; IDIV_C r5, 624165039
+	mov rax, 15866829597104432181
+	mul r13
+	shr rdx, 29
+	add r13, rdx
+	; FMUL_R e3, a0
+	mulpd xmm7, xmm8
+	; IMUL_R r5, r4
+	imul r13, r12
+	; FMUL_R e3, a1
+	mulpd xmm7, xmm9
+	; FMUL_R e3, a3
+	mulpd xmm7, xmm11
+	; IXOR_R r0, -2064879200
+	xor r8, -2064879200
+	; FADD_R f1, a3
+	addpd xmm1, xmm11
+	; IADD_M r0, L1[r3]
+	mov eax, r11d
+	and eax, 16376
+	add r8, qword ptr [rsi+rax]
+	; ISMULH_R r7, r3
+	mov rax, r15
+	imul r11
+	mov r15, rdx
+	; IMUL_R r5, -1645503310
+	imul r13, -1645503310
+	; IMUL_R r7, r3
+	imul r15, r11
+	; FMUL_R e2, a2
+	mulpd xmm6, xmm10
+	; IADD_R r6, 1769041191
+	add r14, 1769041191
+	; FSUB_M f1, L1[r4]
 	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	and rax, r9
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 0dcca31efh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_427: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r11, 0d6cae9aeh
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, eax
-	mov eax, r11d
-	imul rax, rcx
-	mov rcx, rax
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	subpd xmm1, xmm12
+	; ISTORE L2[r1], r0
 	mov eax, r9d
-	xor eax, 0801190f4h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_428: ;RET
-	dec edi
-	jz rx_finish
-	xor r11, 0f807a961h
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_428
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0e3b86b2fh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_428:
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0e3b86b2fh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_429: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 0650a4102h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, rax, 1990438276
-	mov r15, rax
-
-rx_i_430: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r14, 019cc0e5h
+	and eax, 262136
+	mov qword ptr [rsi+rax], r8
+	; FNEG_R f0
+	xorps xmm0, xmm15
+	; FMUL_R e0, a3
+	mulpd xmm4, xmm11
+	; IMUL_R r2, r7
+	imul r10, r15
+	; IADD_R r5, r1
+	add r13, r9
+	; IROR_R r3, r6
 	mov ecx, r14d
-	call rx_read_dataset_f
+	ror r11, cl
+	; FADD_R f0, a0
 	addpd xmm0, xmm8
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 058891433h
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_431: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0ed17ab58h
+	; FMUL_R e1, a2
+	mulpd xmm5, xmm10
+	; FNEG_R f3
+	xorps xmm3, xmm15
+	; FADD_R f1, a1
+	addpd xmm1, xmm9
+	; IMULH_R r2, r5
+	mov rax, r10
+	mul r13
+	mov r10, rdx
+	; ISTORE L1[r4], r0
+	mov eax, r12d
+	and eax, 16376
+	mov qword ptr [rsi+rax], r8
+	; ISWAP_R r7, r0
+	xchg r15, r8
+	; FSWAP_R f0
+	shufpd xmm0, xmm0, 1
+	; ISUB_R r2, r0
+	sub r10, r8
+	; FSUB_R f1, a3
+	subpd xmm1, xmm11
+	; ISUB_M r5, L1[r3]
+	mov eax, r11d
+	and eax, 16376
+	sub r13, qword ptr [rsi+rax]
+	; IXOR_R r7, r0
+	xor r15, r8
+	; IMUL_R r4, r1
+	imul r12, r9
+	; IADD_RC r0, r2, -1102648763
+	lea r8, [r8+r10-1102648763]
+	; FMUL_R e3, a3
+	mulpd xmm7, xmm11
+	; IXOR_R r4, r1
+	xor r12, r9
+	; IXOR_R r6, r0
+	xor r14, r8
+	; FSQRT_R e1
+	sqrtpd xmm5, xmm5
+	; IMUL_M r6, L2[r1]
+	mov eax, r9d
+	and eax, 262136
+	imul r14, qword ptr [rsi+rax]
+	; ISMULH_M r5, L3[353552]
+	mov rax, r13
+	imul qword ptr [rsi+353552]
+	mov r13, rdx
+	; ISUB_M r1, L1[r6]
+	mov eax, r14d
+	and eax, 16376
+	sub r9, qword ptr [rsi+rax]
+	; FADD_R f0, a3
+	addpd xmm0, xmm11
+	; FMUL_R e3, a3
+	mulpd xmm7, xmm11
+	; FSUB_M f3, L2[r7]
+	mov eax, r15d
+	and eax, 262136
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	subpd xmm3, xmm12
+	; IMUL_R r0, r2
+	imul r8, r10
+	; FMUL_R e1, a0
+	mulpd xmm5, xmm8
+	; COND_R r5, sg(r3, -1392293091)
+	xor ecx, ecx
+	cmp r11d, -1392293091
+	sets cl
+	add r13, rcx
+	; FSWAP_R e3
+	shufpd xmm7, xmm7, 1
+	; IMUL_R r7, r4
+	imul r15, r12
+	; IXOR_R r7, r5
+	xor r15, r13
+	; FMUL_R e3, a3
+	mulpd xmm7, xmm11
+	; IMUL_R r4, r3
+	imul r12, r11
+	; FADD_M f1, L1[r1]
+	mov eax, r9d
+	and eax, 16376
+	cvtdq2pd xmm12, qword ptr [rsi+rax]
+	addpd xmm1, xmm12
+	; IMUL_R r5, r0
+	imul r13, r8
+	; ISUB_R r7, r0
+	sub r15, r8
+	; IADD_M r5, L1[r4]
+	mov eax, r12d
+	and eax, 16376
+	add r13, qword ptr [rsi+rax]
+	; IADD_R r6, r2
+	add r14, r10
+	; FMUL_R e1, a1
+	mulpd xmm5, xmm9
+	; IADD_M r2, L3[1073640]
+	add r10, qword ptr [rsi+1073640]
+	; IMUL_R r3, r2
+	imul r11, r10
+	; IXOR_R r1, r0
+	xor r9, r8
+	; IROR_R r7, r4
 	mov ecx, r12d
-	call rx_read_dataset_f
-	addpd xmm0, xmm5
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 019fe4aadh
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_432: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r10, 01c3b321fh
-	mov ecx, r10d
-	call rx_read_dataset_r
-	sub rax, r10
-	mov r8, rax
-
-rx_i_433: ;ADD_32
-	dec edi
-	jz rx_finish
-	xor r13, 0bbb88499h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	add eax, r12d
-	mov rcx, rax
+	ror r15, cl
+	; FSUB_R f1, a1
+	subpd xmm1, xmm9
+	; IMUL_R r7, r5
+	imul r15, r13
+	; ISUB_R r1, 866191482
+	sub r9, 866191482
+	; IMUL_M r7, L1[r4]
 	mov eax, r12d
-	xor eax, 04722b36fh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_434: ;FPSQRT
-	dec edi
-	jz rx_finish
-	xor r13, 0167edabdh
-	mov ecx, r13d
-	call rx_read_dataset_f
-	andps xmm0, xmm10
-	sqrtpd xmm0, xmm0
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 08c1cfc74h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_435: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r15, 0b940480ah
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r15
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 0758605ffh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_436: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r15, 0bfc3ca8bh
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm2
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0bfa76c43h
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_437: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r8, 098a6bcf7h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	divpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_438: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r10, 0325b38ebh
-	mov ecx, r10d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm4, xmm0
-
-rx_i_439: ;XOR_32
-	dec edi
-	jz rx_finish
-	xor r13, 05e807e81h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	xor eax, r15d
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 0b28e6e01h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_440: ;RET
-	dec edi
-	jz rx_finish
-	xor r10, 062f83728h
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_440
-	xor rax, qword ptr [rsp + 8]
-	mov r9, rax
-	ret 8
-not_taken_ret_440:
-	mov r9, rax
-
-rx_i_441: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r14, 0d18ec075h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	add rax, 529736748
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 01f93242ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_442: ;CALL
-	dec edi
-	jz rx_finish
-	xor r14, 0a53dd1bh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	cmp r15d, 799523062
-	jbe short taken_call_442
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 02fa7c0f6h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_443
-taken_call_442:
-	push rax
-	call rx_i_9
-
-rx_i_443: ;RET
-	dec edi
-	jz rx_finish
-	xor r14, 0232d1285h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_443
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 04f71c419h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_443:
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 04f71c419h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_444: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r8, 042455dd8h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm7
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 0ce416070h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_445: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r13, 09ae009b2h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	add rax, r11
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 084d1f575h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_446: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 01734708eh
-	mov ecx, r12d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, r15d
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 03166163h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_447: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r8, 01596d0e8h
-	mov ecx, r8d
-	call rx_read_dataset_f
-	subpd xmm0, xmm7
-	movaps xmm5, xmm0
-	mov eax, r13d
-	xor eax, 0b384d4afh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm5
-
-rx_i_448: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 0390cfdb0h
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm3
-	movaps xmm9, xmm0
-
-rx_i_449: ;ROR_64
-	dec edi
-	jz rx_finish
-	xor r8, 04f27744bh
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	ror rax, 28
-	mov r8, rax
-
-rx_i_450: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r8, 04e2c76ffh
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov rcx, r12
-	rol rax, cl
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0f6de92ach
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_451: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r8, 0c4d99ac9h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	add rax, -287502157
-	mov r8, rax
-
-rx_i_452: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 040130b88h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_452
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0e27dea25h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_452:
-	mov rcx, rax
-	mov eax, r11d
-	xor eax, 0e27dea25h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_453: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r11, 0a2096aa4h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r14
-	imul rcx
-	mov rax, rdx
-	mov r8, rax
-
-rx_i_454: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r13, 081314291h
-	mov eax, r13d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 07e41c60fh
-	and eax, 2047
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_455: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r8, 059263cdbh
-	mov eax, r8d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	xor rax, r9
-	mov r8, rax
-
-rx_i_456: ;OR_32
-	dec edi
-	jz rx_finish
-	xor r9, 010e8fe6h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	or eax, r11d
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 017f52c3fh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_457: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r9, 09de1a3efh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	sub rax, r10
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 058584136h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_458: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r11, 05c79df6eh
-	mov ecx, r11d
-	call rx_read_dataset_r
-	rol rax, 22
-	mov r14, rax
-
-rx_i_459: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r9, 0346f46adh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	imul rax, rax, 381354340
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 016bb0164h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_460: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r11, 098ab71fch
-	mov ecx, r11d
-	call rx_read_dataset_r
-	sub rax, r14
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 0eb453a97h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_461: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r11, 0c814e926h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r13
-	shr rax, cl
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 062ef5b99h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_462: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r10, 0c64b4a9eh
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, -1734323376
-	mov r15, rax
-
-rx_i_463: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r9, 08c29341h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	sub rax, r15
-	mov r10, rax
-
-rx_i_464: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 06ff587fdh
-	mov ecx, r12d
-	call rx_read_dataset_r
-	imul rax, r15
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0d0673df8h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_465: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 0b62c0003h
-	mov eax, r12d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm5
-	movaps xmm2, xmm0
-
-rx_i_466: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r13, 05c541c42h
-	mov eax, r13d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	mov rax, 282682508
-	imul rax, rcx
-	mov r9, rax
-
-rx_i_467: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0cbb33f81h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm8, xmm0
-
-rx_i_468: ;IDIV_64
-	dec edi
-	jz rx_finish
-	xor r8, 091044dc3h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov edx, -13394825
-	cmp edx, -1
-	jne short safe_idiv_468
-	mov rcx, rax
-	rol rcx, 1
-	dec rcx
-	jz short result_idiv_468
-safe_idiv_468:
-	mov ecx, 1
-	test edx, edx
-	cmovne ecx, edx
-	movsxd rcx, ecx
-	cqo
-	idiv rcx
-result_idiv_468:
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0ff339c77h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_469: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r9, 0c0186beh
-	mov ecx, r9d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, 294019485
-	imul rax, rcx
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 01186619dh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_470: ;XOR_32
-	dec edi
-	jz rx_finish
-	xor r14, 090849e3eh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	xor eax, r11d
-	mov rcx, rax
-	mov eax, r14d
-	xor eax, 090d56b4ch
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_471: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r14, 0cedba9b6h
-	mov eax, r14d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	movsxd rax, r13d
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_472: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 038f4b9d6h
-	mov eax, r9d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r10d, 1738497427
-	jl short taken_call_472
-	mov r10, rax
-	jmp rx_i_473
-taken_call_472:
-	push rax
-	call rx_i_8
-
-rx_i_473: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r14, 01fb7637dh
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, rax, -751043211
-	mov r12, rax
-
-rx_i_474: ;CALL
-	dec edi
-	jz rx_finish
-	xor r9, 0b5c0b4d4h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	cmp r15d, -233120543
-	jo short taken_call_474
-	mov r15, rax
-	jmp rx_i_475
-taken_call_474:
-	push rax
-	call rx_i_69
-
-rx_i_475: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r10, 0910dcdeeh
-	mov eax, r10d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm9
-	movaps xmm7, xmm0
-
-rx_i_476: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r8, 07ab3b5a4h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm2
-	movaps xmm9, xmm0
-
-rx_i_477: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r12, 07a29ec63h
-	mov eax, r12d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm6, xmm0
-	mov eax, r14d
-	xor eax, 0e81fc7a6h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm6
-
-rx_i_478: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r14, 02d3d7e7fh
-	mov ecx, r14d
-	call rx_read_dataset_r
-	imul rax, r10
-	mov r12, rax
-
-rx_i_479: ;MUL_64
-	dec edi
-	jz rx_finish
-	xor r12, 09b49c793h
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	imul rax, r14
-	mov rcx, rax
-	mov eax, r13d
-	xor eax, 0c42735ech
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_480: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r9, 0a9cc4f01h
-	mov eax, r9d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm4
-	movaps xmm6, xmm0
-
-rx_i_481: ;DIV_64
-	dec edi
-	jz rx_finish
-	xor r14, 0225ba1f9h
-	mov eax, r14d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	mov ecx, 1
-	mov edx, r13d
-	test edx, edx
-	cmovne ecx, edx
-	xor edx, edx
-	div rcx
-	mov r12, rax
-
-rx_i_482: ;XOR_64
-	dec edi
-	jz rx_finish
-	xor r14, 044a0f592h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	xor rax, r12
-	mov r11, rax
-
-rx_i_483: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 07f71f219h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	addpd xmm0, xmm6
-	movaps xmm6, xmm0
-
-rx_i_484: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r12, 07027bacdh
-	mov ecx, r12d
-	call rx_read_dataset_r
-	rol rax, 37
-	mov r11, rax
-
-rx_i_485: ;CALL
-	dec edi
-	jz rx_finish
-	xor r13, 03a04647h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp r8d, 554879918
-	jno short taken_call_485
-	mov rcx, rax
-	mov eax, r15d
-	xor eax, 02112cbaeh
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_486
-taken_call_485:
-	push rax
-	call rx_i_58
-
-rx_i_486: ;ADD_64
-	dec edi
-	jz rx_finish
-	xor r15, 0ad072937h
-	mov eax, r15d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	add rax, 942846898
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 03832b3b2h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_487: ;SUB_64
-	dec edi
-	jz rx_finish
-	xor r11, 07f78ad34h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	sub rax, -333279706
-	mov r11, rax
-
-rx_i_488: ;IMULH_64
-	dec edi
-	jz rx_finish
-	xor r12, 0d8b1788eh
-	mov eax, r12d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	mov rcx, 297357073
-	imul rcx
-	mov rax, rdx
-	mov r12, rax
-
-rx_i_489: ;CALL
-	dec edi
-	jz rx_finish
-	xor r10, 0b2ec9f3ah
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r15d, -1127175870
-	jge short taken_call_489
-	mov rcx, rax
-	mov eax, r8d
-	xor eax, 0bcd0a942h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_490
-taken_call_489:
-	push rax
-	call rx_i_75
-
-rx_i_490: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r11, 015c7f598h
-	mov ecx, r11d
-	call rx_read_dataset_f
-	addpd xmm0, xmm9
-	movaps xmm7, xmm0
-
-rx_i_491: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r8, 0902da6bdh
-	mov ecx, r8d
-	call rx_read_dataset_f
-	addpd xmm0, xmm9
-	movaps xmm7, xmm0
-	mov eax, r15d
-	xor eax, 0b0f0fca4h
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm7
-
-rx_i_492: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r9, 0491090d9h
-	mov ecx, r9d
-	call rx_read_dataset_r
-	or rax, r9
-	mov r12, rax
-
-rx_i_493: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r8, 09de81282h
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm9
-	movaps xmm4, xmm0
-
-rx_i_494: ;MUL_32
-	dec edi
-	jz rx_finish
-	xor r10, 0b0d50e46h
-	mov ecx, r10d
-	call rx_read_dataset_r
-	mov ecx, eax
-	mov eax, r11d
-	imul rax, rcx
-	mov r14, rax
-
-rx_i_495: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r11, 0e276cad1h
-	mov eax, r11d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm2
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_496: ;OR_64
-	dec edi
-	jz rx_finish
-	xor r14, 0fe757b73h
-	mov ecx, r14d
-	call rx_read_dataset_r
-	or rax, -359802064
-	mov r9, rax
-
-rx_i_497: ;FPDIV
-	dec edi
-	jz rx_finish
-	xor r8, 08d25742eh
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	divpd xmm0, xmm3
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-
-rx_i_498: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r15, 0e066fd15h
-	mov eax, r15d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	mulpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-	mov eax, r8d
-	xor eax, 09dc5a1f9h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm8
-
-rx_i_499: ;IMUL_32
-	dec edi
-	jz rx_finish
-	xor r12, 08925556bh
-	mov eax, r12d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	movsxd rcx, eax
-	mov rax, -1795485757
-	imul rax, rcx
-	mov r8, rax
-
-rx_i_500: ;CALL
-	dec edi
-	jz rx_finish
-	xor r10, 04bc870ebh
-	mov eax, r10d
-	and eax, 32767
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r13d, 1243939650
-	jl short taken_call_500
-	mov rcx, rax
-	mov eax, r10d
-	xor eax, 04a250342h
-	and eax, 32767
-	mov qword ptr [rsi + rax * 8], rcx
-	jmp rx_i_501
-taken_call_500:
-	push rax
-	call rx_i_511
-
-rx_i_501: ;SHR_64
-	dec edi
-	jz rx_finish
-	xor r8, 07d46c503h
-	mov ecx, r8d
-	call rx_read_dataset_r
-	mov rcx, r10
-	shr rax, cl
-	mov rcx, rax
-	mov eax, r12d
-	xor eax, 03e22874bh
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_502: ;RET
-	dec edi
-	jz rx_finish
-	xor r10, 09e70b20ch
-	mov ecx, r10d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_502
-	xor rax, qword ptr [rsp + 8]
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 08d85312h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-	ret 8
-not_taken_ret_502:
-	mov rcx, rax
-	mov eax, r9d
-	xor eax, 08d85312h
-	and eax, 2047
-	mov qword ptr [rsi + rax * 8], rcx
-
-rx_i_503: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r13, 0442e4850h
-	mov eax, r13d
-	and eax, 32767
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm2
-	movaps xmm9, xmm0
-	mov eax, r9d
-	xor eax, 080465282h
-	and eax, 2047
-	movlpd qword ptr [rsi + rax * 8], xmm9
-
-rx_i_504: ;FPADD
-	dec edi
-	jz rx_finish
-	xor r13, 099d48347h
-	mov eax, r13d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	addpd xmm0, xmm9
-	movaps xmm4, xmm0
-	mov eax, r12d
-	xor eax, 0be8cbb18h
-	and eax, 32767
-	movhpd qword ptr [rsi + rax * 8], xmm4
-
-rx_i_505: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r12, 032c0a28ah
-	mov ecx, r12d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm4
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm8, xmm0
-	mov eax, r8d
-	xor eax, 021b54eaeh
-	and eax, 32767
-	movlpd qword ptr [rsi + rax * 8], xmm8
-
-rx_i_506: ;FPMUL
-	dec edi
-	jz rx_finish
-	xor r9, 0a973d58ch
-	mov ecx, r9d
-	call rx_read_dataset_f
-	mulpd xmm0, xmm9
-	movaps xmm1, xmm0
-	cmpeqpd xmm1, xmm1
-	andps xmm0, xmm1
-	movaps xmm3, xmm0
-
-rx_i_507: ;RET
-	dec edi
-	jz rx_finish
-	xor r10, 0d3b7165ch
-	mov eax, r10d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp rsp, rbp
-	je short not_taken_ret_507
-	xor rax, qword ptr [rsp + 8]
-	mov r14, rax
-	ret 8
-not_taken_ret_507:
-	mov r14, rax
-
-rx_i_508: ;RET
-	dec edi
-	jz rx_finish
-	xor r13, 0da34d818h
-	mov ecx, r13d
-	call rx_read_dataset_r
-	cmp rsp, rbp
-	je short not_taken_ret_508
-	xor rax, qword ptr [rsp + 8]
-	mov r8, rax
-	ret 8
-not_taken_ret_508:
-	mov r8, rax
-
-rx_i_509: ;CALL
-	dec edi
-	jz rx_finish
-	xor r11, 01b2873f2h
-	mov eax, r11d
-	and eax, 2047
-	mov rax, qword ptr [rsi + rax * 8]
-	cmp r8d, 1826115244
-	jno short taken_call_509
-	mov r10, rax
-	jmp rx_i_510
-taken_call_509:
-	push rax
-	call rx_i_42
-
-rx_i_510: ;FPSUB
-	dec edi
-	jz rx_finish
-	xor r8, 0db65513ch
-	mov eax, r8d
-	and eax, 2047
-	cvtdq2pd xmm0, qword ptr [rsi + rax * 8]
-	subpd xmm0, xmm2
-	movaps xmm9, xmm0
-
-rx_i_511: ;ROL_64
-	dec edi
-	jz rx_finish
-	xor r11, 02bd79286h
-	mov ecx, r11d
-	call rx_read_dataset_r
-	mov rcx, r10
-	rol rax, cl
-	mov r11, rax
-
-	jmp rx_i_0
+	and eax, 16376
+	imul r15, qword ptr [rsi+rax]
+	; FADD_R f2, a0
+	addpd xmm2, xmm8
+	; IADD_R r2, r1
+	add r10, r9
diff --git a/src/softAes.h b/src/softAes.h
index 1f7bd99..e4b675e 100644
--- a/src/softAes.h
+++ b/src/softAes.h
@@ -26,3 +26,13 @@ __m128i soft_aeskeygenassist(__m128i key, uint8_t rcon);
 __m128i soft_aesenc(__m128i in, __m128i key);
 
 __m128i soft_aesdec(__m128i in, __m128i key);
+
+template<bool soft>
+inline __m128i aesenc(__m128i in, __m128i key) {
+	return soft ? soft_aesenc(in, key) : _mm_aesenc_si128(in, key);
+}
+
+template<bool soft>
+inline __m128i aesdec(__m128i in, __m128i key) {
+	return soft ? soft_aesdec(in, key) : _mm_aesdec_si128(in, key);
+}
\ No newline at end of file
diff --git a/src/squareHash.S b/src/squareHash.S
new file mode 100644
index 0000000..4cd3b54
--- /dev/null
+++ b/src/squareHash.S
@@ -0,0 +1,17 @@
+.intel_syntax noprefix
+#if defined(__APPLE__)
+.text
+#else
+.section .text
+#endif
+#if defined(__WIN32__) || defined(__APPLE__)
+#define DECL(x) _##x
+#else
+#define DECL(x) x
+#endif
+
+.global DECL(squareHash)
+
+DECL(squareHash):
+	mov rcx, rsi
+	#include "asm/squareHash.inc"
diff --git a/src/squareHash.asm b/src/squareHash.asm
new file mode 100644
index 0000000..4433719
--- /dev/null
+++ b/src/squareHash.asm
@@ -0,0 +1,9 @@
+PUBLIC squareHash
+
+.code
+
+squareHash PROC
+	include asm/squareHash.inc
+squareHash ENDP
+
+END
\ No newline at end of file
diff --git a/src/squareHash.h b/src/squareHash.h
new file mode 100644
index 0000000..05939d7
--- /dev/null
+++ b/src/squareHash.h
@@ -0,0 +1,76 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+/*
+	Based on the original idea by SChernykh:
+	https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-414336613
+*/
+
+#include <stdint.h>
+
+#if !defined(_M_X64) && !defined(__x86_64__)
+
+typedef struct {
+	uint64_t lo;
+	uint64_t hi;
+} uint128_t;
+
+#define LO(x) ((x)&0xffffffff)
+#define HI(x) ((x)>>32)
+static inline uint128_t square128(uint64_t x) {
+	uint64_t xh = HI(x), xl = LO(x);
+	uint64_t xll = xl * xl;
+	uint64_t xlh = xl * xh;
+	uint64_t xhh = xh * xh;
+	uint64_t m1 = 2 * LO(xlh) + HI(xll);
+	uint64_t m2 = 2 * HI(xlh) + LO(xhh) + HI(m1);
+	uint64_t m3 = HI(xhh) + HI(m2);
+
+	uint128_t x2;
+
+	x2.lo = (m1 << 32) + LO(xll);
+	x2.hi = (m3 << 32) + LO(m2);
+
+	return x2;
+}
+#undef LO(x)
+#undef HI(x)
+
+inline uint64_t squareHash(uint64_t x) {
+	x += 1613783669344650115;
+	for (int i = 0; i < 42; ++i) {
+		uint128_t x2 = square128(x);
+		x = x2.lo - x2.hi;
+	}
+	return x;
+}
+
+#else
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+uint64_t squareHash(uint64_t);
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
\ No newline at end of file
diff --git a/src/t1ha/t1ha.h b/src/t1ha/t1ha.h
deleted file mode 100644
index 6b56e16..0000000
--- a/src/t1ha/t1ha.h
+++ /dev/null
@@ -1,723 +0,0 @@
-/*
- *  Copyright (c) 2016-2018 Positive Technologies, https://www.ptsecurity.com,
- *  Fast Positive Hash.
- *
- *  Portions Copyright (c) 2010-2018 Leonid Yuriev <leo@yuriev.ru>,
- *  The 1Hippeus project (t1h).
- *
- *  This software is provided 'as-is', without any express or implied
- *  warranty. In no event will the authors be held liable for any damages
- *  arising from the use of this software.
- *
- *  Permission is granted to anyone to use this software for any purpose,
- *  including commercial applications, and to alter it and redistribute it
- *  freely, subject to the following restrictions:
- *
- *  1. The origin of this software must not be misrepresented; you must not
- *     claim that you wrote the original software. If you use this software
- *     in a product, an acknowledgement in the product documentation would be
- *     appreciated but is not required.
- *  2. Altered source versions must be plainly marked as such, and must not be
- *     misrepresented as being the original software.
- *  3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * t1ha = { Fast Positive Hash, aka "Позитивный Хэш" }
- * by [Positive Technologies](https://www.ptsecurity.ru)
- *
- * Briefly, it is a 64-bit Hash Function:
- *  1. Created for 64-bit little-endian platforms, in predominantly for x86_64,
- *     but portable and without penalties it can run on any 64-bit CPU.
- *  2. In most cases up to 15% faster than City64, xxHash, mum-hash, metro-hash
- *     and all others portable hash-functions (which do not use specific
- *     hardware tricks).
- *  3. Not suitable for cryptography.
- *
- * The Future will Positive. Всё будет хорошо.
- *
- * ACKNOWLEDGEMENT:
- * The t1ha was originally developed by Leonid Yuriev (Леонид Юрьев)
- * for The 1Hippeus project - zerocopy messaging in the spirit of Sparta!
- */
-
-#pragma once
-
-/*****************************************************************************
- *
- * PLEASE PAY ATTENTION TO THE FOLLOWING NOTES
- * about macros definitions which controls t1ha behaviour and/or performance.
- *
- *
- * 1) T1HA_SYS_UNALIGNED_ACCESS = Defines the system/platform/CPU/architecture
- *                                abilities for unaligned data access.
- *
- *    By default, when the T1HA_SYS_UNALIGNED_ACCESS not defined,
- *    it will defined on the basis hardcoded knowledge about of capabilities
- *    of most common CPU architectures. But you could override this
- *    default behavior when build t1ha library itself:
- *
- *      // To disable unaligned access at all.
- *      #define T1HA_SYS_UNALIGNED_ACCESS 0
- *
- *      // To enable unaligned access, but indicate that it significally slow.
- *      #define T1HA_SYS_UNALIGNED_ACCESS 1
- *
- *      // To enable unaligned access, and indicate that it effecient.
- *      #define T1HA_SYS_UNALIGNED_ACCESS 2
- *
- *
- * 2) T1HA_USE_FAST_ONESHOT_READ = Controls the data reads at the end of buffer.
- *
- *    When defined to non-zero, t1ha will use 'one shot' method for reading
- *    up to 8 bytes at the end of data. In this case just the one 64-bit read
- *    will be performed even when the available less than 8 bytes.
- *
- *    This is little bit faster that switching by length of data tail.
- *    Unfortunately this will triggering a false-positive alarms from Valgrind,
- *    AddressSanitizer and other similar tool.
- *
- *    By default, t1ha defines it to 1, but you could override this
- *    default behavior when build t1ha library itself:
- *
- *      // For little bit faster and small code.
- *      #define T1HA_USE_FAST_ONESHOT_READ 1
- *
- *      // For calmness if doubt.
- *      #define T1HA_USE_FAST_ONESHOT_READ 0
- *
- *
- * 3) T1HA0_RUNTIME_SELECT = Controls choice fastest function in runtime.
- *
- *    t1ha library offers the t1ha0() function as the fastest for current CPU.
- *    But actual CPU's features/capabilities and may be significantly different,
- *    especially on x86 platform. Therefore, internally, t1ha0() may require
- *    dynamic dispatching for choice best implementation.
- *
- *    By default, t1ha enables such runtime choice and (may be) corresponding
- *    indirect calls if it reasonable, but you could override this default
- *    behavior when build t1ha library itself:
- *
- *      // To enable runtime choice of fastest implementation.
- *      #define T1HA0_RUNTIME_SELECT 1
- *
- *      // To disable runtime choice of fastest implementation.
- *      #define T1HA0_RUNTIME_SELECT 0
- *
- *    When T1HA0_RUNTIME_SELECT is nonzero the t1ha0_resolve() function could
- *    be used to get actual t1ha0() implementation address at runtime. This is
- *    useful for two cases:
- *      - calling by local pointer-to-function usually is little
- *        bit faster (less overhead) than via a PLT thru the DSO boundary.
- *      - GNU Indirect functions (see below) don't supported by environment
- *        and calling by t1ha0_funcptr is not available and/or expensive.
- *
- * 4) T1HA_USE_INDIRECT_FUNCTIONS = Controls usage of GNU Indirect functions.
- *
- *    In continue of T1HA0_RUNTIME_SELECT the T1HA_USE_INDIRECT_FUNCTIONS
- *    controls usage of ELF indirect functions feature. In general, when
- *    available, this reduces overhead of indirect function's calls though
- *    a DSO-bundary (https://sourceware.org/glibc/wiki/GNU_IFUNC).
- *
- *    By default, t1ha engage GNU Indirect functions when it available
- *    and useful, but you could override this default behavior when build
- *    t1ha library itself:
- *
- *      // To enable use of GNU ELF Indirect functions.
- *      #define T1HA_USE_INDIRECT_FUNCTIONS 1
- *
- *      // To disable use of GNU ELF Indirect functions. This may be useful
- *      // if the actual toolchain or the system's loader don't support ones.
- *      #define T1HA_USE_INDIRECT_FUNCTIONS 0
- *
- * 5) T1HA0_AESNI_AVAILABLE = Controls AES-NI detection and dispatching on x86.
- *
- *    In continue of T1HA0_RUNTIME_SELECT the T1HA0_AESNI_AVAILABLE controls
- *    detection and usage of AES-NI CPU's feature. On the other hand, this
- *    requires compiling parts of t1ha library with certain properly options,
- *    and could be difficult or inconvenient in some cases.
- *
- *    By default, t1ha engade AES-NI for t1ha0() on the x86 platform, but
- *    you could override this default behavior when build t1ha library itself:
- *
- *      // To disable detection and usage of AES-NI instructions for t1ha0().
- *      // This may be useful when you unable to build t1ha library properly
- *      // or known that AES-NI will be unavailable at the deploy.
- *      #define T1HA0_AESNI_AVAILABLE 0
- *
- *      // To force detection and usage of AES-NI instructions for t1ha0(),
- *      // but I don't known reasons to anybody would need this.
- *      #define T1HA0_AESNI_AVAILABLE 1
- *
- * 6) T1HA0_DISABLED, T1HA1_DISABLED, T1HA2_DISABLED = Controls availability of
- *    t1ha functions.
- *
- *    In some cases could be useful to import/use only few of t1ha functions
- *    or just the one. So, this definitions allows disable corresponding parts
- *    of t1ha library.
- *
- *      // To disable t1ha0(), t1ha0_32le(), t1ha0_32be() and all AES-NI.
- *      #define T1HA0_DISABLED
- *
- *      // To disable t1ha1_le() and t1ha1_be().
- *      #define T1HA1_DISABLED
- *
- *      // To disable t1ha2_atonce(), t1ha2_atonce128() and so on.
- *      #define T1HA2_DISABLED
- *
- *****************************************************************************/
-
-#define T1HA_VERSION_MAJOR 2
-#define T1HA_VERSION_MINOR 1
-#define T1HA_VERSION_RELEASE 0
-
-#ifndef __has_attribute
-#define __has_attribute(x) (0)
-#endif
-
-#ifndef __has_include
-#define __has_include(x) (0)
-#endif
-
-#ifndef __GNUC_PREREQ
-#if defined(__GNUC__) && defined(__GNUC_MINOR__)
-#define __GNUC_PREREQ(maj, min)                                                \
-  ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
-#else
-#define __GNUC_PREREQ(maj, min) 0
-#endif
-#endif /* __GNUC_PREREQ */
-
-#ifndef __CLANG_PREREQ
-#ifdef __clang__
-#define __CLANG_PREREQ(maj, min)                                               \
-  ((__clang_major__ << 16) + __clang_minor__ >= ((maj) << 16) + (min))
-#else
-#define __CLANG_PREREQ(maj, min) (0)
-#endif
-#endif /* __CLANG_PREREQ */
-
-#ifndef __LCC_PREREQ
-#ifdef __LCC__
-#define __LCC_PREREQ(maj, min)                                                 \
-  ((__LCC__ << 16) + __LCC_MINOR__ >= ((maj) << 16) + (min))
-#else
-#define __LCC_PREREQ(maj, min) (0)
-#endif
-#endif /* __LCC_PREREQ */
-
-/*****************************************************************************/
-
-#ifdef _MSC_VER
-/* Avoid '16' bytes padding added after data member 't1ha_context::total'
- * and other warnings from std-headers if warning-level > 3. */
-#pragma warning(push, 3)
-#endif
-
-#if defined(__cplusplus) && __cplusplus >= 201103L
-#include <climits>
-#include <cstddef>
-#include <cstdint>
-#else
-#include <limits.h>
-#include <stddef.h>
-#include <stdint.h>
-#endif
-
-/*****************************************************************************/
-
-#if defined(i386) || defined(__386) || defined(__i386) || defined(__i386__) || \
-    defined(i486) || defined(__i486) || defined(__i486__) ||                   \
-    defined(i586) | defined(__i586) || defined(__i586__) || defined(i686) ||   \
-    defined(__i686) || defined(__i686__) || defined(_M_IX86) ||                \
-    defined(_X86_) || defined(__THW_INTEL__) || defined(__I86__) ||            \
-    defined(__INTEL__) || defined(__x86_64) || defined(__x86_64__) ||          \
-    defined(__amd64__) || defined(__amd64) || defined(_M_X64) ||               \
-    defined(_M_AMD64) || defined(__IA32__) || defined(__INTEL__)
-#ifndef __ia32__
-/* LY: define neutral __ia32__ for x86 and x86-64 archs */
-#define __ia32__ 1
-#endif /* __ia32__ */
-#if !defined(__amd64__) && (defined(__x86_64) || defined(__x86_64__) ||        \
-                            defined(__amd64) || defined(_M_X64))
-/* LY: define trusty __amd64__ for all AMD64/x86-64 arch */
-#define __amd64__ 1
-#endif /* __amd64__ */
-#endif /* all x86 */
-
-#if !defined(__BYTE_ORDER__) || !defined(__ORDER_LITTLE_ENDIAN__) ||           \
-    !defined(__ORDER_BIG_ENDIAN__)
-
-/* *INDENT-OFF* */
-/* clang-format off */
-
-#if defined(__GLIBC__) || defined(__GNU_LIBRARY__) || defined(__ANDROID__) ||  \
-    defined(HAVE_ENDIAN_H) || __has_include(<endian.h>)
-#include <endian.h>
-#elif defined(__APPLE__) || defined(__MACH__) || defined(__OpenBSD__) ||       \
-    defined(HAVE_MACHINE_ENDIAN_H) || __has_include(<machine/endian.h>)
-#include <machine/endian.h>
-#elif defined(HAVE_SYS_ISA_DEFS_H) || __has_include(<sys/isa_defs.h>)
-#include <sys/isa_defs.h>
-#elif (defined(HAVE_SYS_TYPES_H) && defined(HAVE_SYS_ENDIAN_H)) ||             \
-    (__has_include(<sys/types.h>) && __has_include(<sys/endian.h>))
-#include <sys/endian.h>
-#include <sys/types.h>
-#elif defined(__bsdi__) || defined(__DragonFly__) || defined(__FreeBSD__) ||   \
-    defined(__NETBSD__) || defined(__NetBSD__) ||                              \
-    defined(HAVE_SYS_PARAM_H) || __has_include(<sys/param.h>)
-#include <sys/param.h>
-#endif /* OS */
-
-/* *INDENT-ON* */
-/* clang-format on */
-
-#if defined(__BYTE_ORDER) && defined(__LITTLE_ENDIAN) && defined(__BIG_ENDIAN)
-#define __ORDER_LITTLE_ENDIAN__ __LITTLE_ENDIAN
-#define __ORDER_BIG_ENDIAN__ __BIG_ENDIAN
-#define __BYTE_ORDER__ __BYTE_ORDER
-#elif defined(_BYTE_ORDER) && defined(_LITTLE_ENDIAN) && defined(_BIG_ENDIAN)
-#define __ORDER_LITTLE_ENDIAN__ _LITTLE_ENDIAN
-#define __ORDER_BIG_ENDIAN__ _BIG_ENDIAN
-#define __BYTE_ORDER__ _BYTE_ORDER
-#else
-#define __ORDER_LITTLE_ENDIAN__ 1234
-#define __ORDER_BIG_ENDIAN__ 4321
-
-#if defined(__LITTLE_ENDIAN__) ||                                              \
-    (defined(_LITTLE_ENDIAN) && !defined(_BIG_ENDIAN)) ||                      \
-    defined(__ARMEL__) || defined(__THUMBEL__) || defined(__AARCH64EL__) ||    \
-    defined(__MIPSEL__) || defined(_MIPSEL) || defined(__MIPSEL) ||            \
-    defined(_M_ARM) || defined(_M_ARM64) || defined(__e2k__) ||                \
-    defined(__elbrus_4c__) || defined(__elbrus_8c__) || defined(__bfin__) ||   \
-    defined(__BFIN__) || defined(__ia64__) || defined(_IA64) ||                \
-    defined(__IA64__) || defined(__ia64) || defined(_M_IA64) ||                \
-    defined(__itanium__) || defined(__ia32__) || defined(__CYGWIN__) ||        \
-    defined(_WIN64) || defined(_WIN32) || defined(__TOS_WIN__) ||              \
-    defined(__WINDOWS__)
-#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
-
-#elif defined(__BIG_ENDIAN__) ||                                               \
-    (defined(_BIG_ENDIAN) && !defined(_LITTLE_ENDIAN)) ||                      \
-    defined(__ARMEB__) || defined(__THUMBEB__) || defined(__AARCH64EB__) ||    \
-    defined(__MIPSEB__) || defined(_MIPSEB) || defined(__MIPSEB) ||            \
-    defined(__m68k__) || defined(M68000) || defined(__hppa__) ||               \
-    defined(__hppa) || defined(__HPPA__) || defined(__sparc__) ||              \
-    defined(__sparc) || defined(__370__) || defined(__THW_370__) ||            \
-    defined(__s390__) || defined(__s390x__) || defined(__SYSC_ZARCH__)
-#define __BYTE_ORDER__ __ORDER_BIG_ENDIAN__
-
-#else
-#error __BYTE_ORDER__ should be defined.
-#endif /* Arch */
-
-#endif
-#endif /* __BYTE_ORDER__ || __ORDER_LITTLE_ENDIAN__ || __ORDER_BIG_ENDIAN__ */
-
-/*****************************************************************************/
-
-#ifndef __dll_export
-#if defined(_WIN32) || defined(_WIN64) || defined(__CYGWIN__)
-#if defined(__GNUC__) || __has_attribute(dllexport)
-#define __dll_export __attribute__((dllexport))
-#elif defined(_MSC_VER)
-#define __dll_export __declspec(dllexport)
-#else
-#define __dll_export
-#endif
-#elif defined(__GNUC__) || __has_attribute(visibility)
-#define __dll_export __attribute__((visibility("default")))
-#else
-#define __dll_export
-#endif
-#endif /* __dll_export */
-
-#ifndef __dll_import
-#if defined(_WIN32) || defined(_WIN64) || defined(__CYGWIN__)
-#if defined(__GNUC__) || __has_attribute(dllimport)
-#define __dll_import __attribute__((dllimport))
-#elif defined(_MSC_VER)
-#define __dll_import __declspec(dllimport)
-#else
-#define __dll_import
-#endif
-#else
-#define __dll_import
-#endif
-#endif /* __dll_import */
-
-#ifndef __force_inline
-#ifdef _MSC_VER
-#define __force_inline __forceinline
-#elif __GNUC_PREREQ(3, 2) || __has_attribute(always_inline)
-#define __force_inline __inline __attribute__((always_inline))
-#else
-#define __force_inline __inline
-#endif
-#endif /* __force_inline */
-
-#ifndef T1HA_API
-#if defined(t1ha_EXPORTS)
-#define T1HA_API __dll_export
-#elif defined(t1ha_IMPORTS)
-#define T1HA_API __dll_import
-#else
-#define T1HA_API
-#endif
-#endif /* T1HA_API */
-
-#if defined(_MSC_VER) && defined(__ia32__)
-#define T1HA_ALIGN_PREFIX __declspec(align(32)) /* required only for SIMD */
-#else
-#define T1HA_ALIGN_PREFIX
-#endif /* _MSC_VER */
-
-#if defined(__GNUC__) && defined(__ia32__)
-#define T1HA_ALIGN_SUFFIX                                                      \
-  __attribute__((aligned(32))) /* required only for SIMD */
-#else
-#define T1HA_ALIGN_SUFFIX
-#endif /* GCC x86 */
-
-#ifndef T1HA_USE_INDIRECT_FUNCTIONS
-/* GNU ELF indirect functions usage control. For more info please see
- * https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
- * and https://sourceware.org/glibc/wiki/GNU_IFUNC */
-#if __has_attribute(ifunc) &&                                                  \
-    defined(__ELF__) /* ifunc is broken on Darwin/OSX */
-/* Use ifunc/gnu_indirect_function if corresponding attribute is available,
- * Assuming compiler will generate properly code even when
- * the -fstack-protector-all and/or the -fsanitize=address are enabled. */
-#define T1HA_USE_INDIRECT_FUNCTIONS 1
-#elif defined(__ELF__) && !defined(__SANITIZE_ADDRESS__) &&                    \
-    !defined(__SSP_ALL__)
-/* ifunc/gnu_indirect_function will be used on ELF, but only if both
- * -fstack-protector-all and -fsanitize=address are NOT enabled. */
-#define T1HA_USE_INDIRECT_FUNCTIONS 1
-#else
-#define T1HA_USE_INDIRECT_FUNCTIONS 0
-#endif
-#endif /* T1HA_USE_INDIRECT_FUNCTIONS */
-
-#if __GNUC_PREREQ(4, 0)
-#pragma GCC visibility push(hidden)
-#endif /* __GNUC_PREREQ(4,0) */
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-typedef union T1HA_ALIGN_PREFIX t1ha_state256 {
-  uint8_t bytes[32];
-  uint32_t u32[8];
-  uint64_t u64[4];
-  struct {
-    uint64_t a, b, c, d;
-  } n;
-} t1ha_state256_t T1HA_ALIGN_SUFFIX;
-
-typedef struct t1ha_context {
-  t1ha_state256_t state;
-  t1ha_state256_t buffer;
-  size_t partial;
-  uint64_t total;
-} t1ha_context_t;
-
-#ifdef _MSC_VER
-#pragma warning(pop)
-#endif
-
-/******************************************************************************
- *
- * Self-testing API.
- *
- * Unfortunately, some compilers (exactly only Microsoft Visual C/C++) has
- * a bugs which leads t1ha-functions to produce wrong results. This API allows
- * check the correctness of the actual code in runtime.
- *
- * All check-functions returns 0 on success, or -1 in case the corresponding
- * hash-function failed verification. PLEASE, always perform such checking at
- * initialization of your code, if you using MSVC or other troubleful compilers.
- */
-
-T1HA_API int t1ha_selfcheck__all_enabled(void);
-
-#ifndef T1HA2_DISABLED
-T1HA_API int t1ha_selfcheck__t1ha2_atonce(void);
-T1HA_API int t1ha_selfcheck__t1ha2_atonce128(void);
-T1HA_API int t1ha_selfcheck__t1ha2_stream(void);
-T1HA_API int t1ha_selfcheck__t1ha2(void);
-#endif /* T1HA2_DISABLED */
-
-#ifndef T1HA1_DISABLED
-T1HA_API int t1ha_selfcheck__t1ha1_le(void);
-T1HA_API int t1ha_selfcheck__t1ha1_be(void);
-T1HA_API int t1ha_selfcheck__t1ha1(void);
-#endif /* T1HA1_DISABLED */
-
-#ifndef T1HA0_DISABLED
-T1HA_API int t1ha_selfcheck__t1ha0_32le(void);
-T1HA_API int t1ha_selfcheck__t1ha0_32be(void);
-T1HA_API int t1ha_selfcheck__t1ha0(void);
-
-/* Define T1HA0_AESNI_AVAILABLE to 0 for disable AES-NI support. */
-#ifndef T1HA0_AESNI_AVAILABLE
-#if defined(__e2k__) ||                                                        \
-    (defined(__ia32__) && (!defined(_M_IX86) || _MSC_VER > 1800))
-#define T1HA0_AESNI_AVAILABLE 1
-#else
-#define T1HA0_AESNI_AVAILABLE 0
-#endif
-#endif /* ifndef T1HA0_AESNI_AVAILABLE */
-
-#if T1HA0_AESNI_AVAILABLE
-T1HA_API int t1ha_selfcheck__t1ha0_ia32aes_noavx(void);
-T1HA_API int t1ha_selfcheck__t1ha0_ia32aes_avx(void);
-#ifndef __e2k__
-T1HA_API int t1ha_selfcheck__t1ha0_ia32aes_avx2(void);
-#endif
-#endif /* if T1HA0_AESNI_AVAILABLE */
-#endif /* T1HA0_DISABLED */
-
-/******************************************************************************
- *
- *  t1ha2 = 64 and 128-bit, SLIGHTLY MORE ATTENTION FOR QUALITY AND STRENGTH.
- *
- *    - The recommended version of "Fast Positive Hash" with good quality
- *      for checksum, hash tables and fingerprinting.
- *    - Portable and extremely efficiency on modern 64-bit CPUs.
- *      Designed for 64-bit little-endian platforms,
- *      in other cases will runs slowly.
- *    - Great quality of hashing and still faster than other non-t1ha hashes.
- *      Provides streaming mode and 128-bit result.
- *
- * Note: Due performance reason 64- and 128-bit results are completely
- *       different each other, i.e. 64-bit result is NOT any part of 128-bit.
- */
-#ifndef T1HA2_DISABLED
-
-/* The at-once variant with 64-bit result */
-T1HA_API uint64_t t1ha2_atonce(const void *data, size_t length, uint64_t seed);
-
-/* The at-once variant with 128-bit result.
- * Argument `extra_result` is NOT optional and MUST be valid.
- * The high 64-bit part of 128-bit hash will be always unconditionally
- * stored to the address given by `extra_result` argument. */
-T1HA_API uint64_t t1ha2_atonce128(uint64_t *__restrict extra_result,
-                                  const void *__restrict data, size_t length,
-                                  uint64_t seed);
-
-/* The init/update/final trinity for streaming.
- * Return 64 or 128-bit result depentently from `extra_result` argument. */
-T1HA_API void t1ha2_init(t1ha_context_t *ctx, uint64_t seed_x, uint64_t seed_y);
-T1HA_API void t1ha2_update(t1ha_context_t *__restrict ctx,
-                           const void *__restrict data, size_t length);
-
-/* Argument `extra_result` is optional and MAY be NULL.
- *  - If `extra_result` is NOT NULL then the 128-bit hash will be calculated,
- *    and high 64-bit part of it will be stored to the address given
- *    by `extra_result` argument.
- *  - Otherwise the 64-bit hash will be calculated
- *    and returned from function directly.
- *
- * Note: Due performance reason 64- and 128-bit results are completely
- *       different each other, i.e. 64-bit result is NOT any part of 128-bit. */
-T1HA_API uint64_t t1ha2_final(t1ha_context_t *__restrict ctx,
-                              uint64_t *__restrict extra_result /* optional */);
-
-#endif /* T1HA2_DISABLED */
-
-/******************************************************************************
- *
- *  t1ha1 = 64-bit, BASELINE FAST PORTABLE HASH:
- *
- *    - Runs faster on 64-bit platforms in other cases may runs slowly.
- *    - Portable and stable, returns same 64-bit result
- *      on all architectures and CPUs.
- *    - Unfortunately it fails the "strict avalanche criteria",
- *      see test results at https://github.com/demerphq/smhasher.
- *
- *      This flaw is insignificant for the t1ha1() purposes and imperceptible
- *      from a practical point of view.
- *      However, nowadays this issue has resolved in the next t1ha2(),
- *      that was initially planned to providing a bit more quality.
- */
-#ifndef T1HA1_DISABLED
-
-/* The little-endian variant. */
-T1HA_API uint64_t t1ha1_le(const void *data, size_t length, uint64_t seed);
-
-/* The big-endian variant. */
-T1HA_API uint64_t t1ha1_be(const void *data, size_t length, uint64_t seed);
-
-#endif /* T1HA1_DISABLED */
-
-/******************************************************************************
- *
- *  t1ha0 = 64-bit, JUST ONLY FASTER:
- *
- *    - Provides fast-as-possible hashing for current CPU, including
- *      32-bit systems and engaging the available hardware acceleration.
- *    - It is a facade that selects most quick-and-dirty hash
- *      for the current processor. For instance, on IA32 (x86) actual function
- *      will be selected in runtime, depending on current CPU capabilities
- *
- * BE CAREFUL!!!  THIS IS MEANS:
- *
- *   1. The quality of hash is a subject for tradeoffs with performance.
- *      So, the quality and strength of t1ha0() may be lower than t1ha1(),
- *      especially on 32-bit targets, but then much faster.
- *      However, guaranteed that it passes all SMHasher tests.
- *
- *   2. No warranty that the hash result will be same for particular
- *      key on another machine or another version of libt1ha.
- *
- *      Briefly, such hash-results and their derivatives, should be
- *      used only in runtime, but should not be persist or transferred
- *      over a network.
- *
- *
- *  When T1HA0_RUNTIME_SELECT is nonzero the t1ha0_resolve() function could
- *  be used to get actual t1ha0() implementation address at runtime. This is
- *  useful for two cases:
- *    - calling by local pointer-to-function usually is little
- *      bit faster (less overhead) than via a PLT thru the DSO boundary.
- *    - GNU Indirect functions (see below) don't supported by environment
- *      and calling by t1ha0_funcptr is not available and/or expensive.
- */
-
-#ifndef T1HA0_DISABLED
-
-/* The little-endian variant for 32-bit CPU. */
-uint64_t t1ha0_32le(const void *data, size_t length, uint64_t seed);
-/* The big-endian variant for 32-bit CPU. */
-uint64_t t1ha0_32be(const void *data, size_t length, uint64_t seed);
-
-/* Define T1HA0_AESNI_AVAILABLE to 0 for disable AES-NI support. */
-#ifndef T1HA0_AESNI_AVAILABLE
-#if defined(__e2k__) ||                                                        \
-    (defined(__ia32__) && (!defined(_M_IX86) || _MSC_VER > 1800))
-#define T1HA0_AESNI_AVAILABLE 1
-#else
-#define T1HA0_AESNI_AVAILABLE 0
-#endif
-#endif /* T1HA0_AESNI_AVAILABLE */
-
-/* Define T1HA0_RUNTIME_SELECT to 0 for disable dispatching t1ha0 at runtime. */
-#ifndef T1HA0_RUNTIME_SELECT
-#if T1HA0_AESNI_AVAILABLE && !defined(__e2k__)
-#define T1HA0_RUNTIME_SELECT 1
-#else
-#define T1HA0_RUNTIME_SELECT 0
-#endif
-#endif /* T1HA0_RUNTIME_SELECT */
-
-#if !T1HA0_RUNTIME_SELECT && !defined(T1HA0_USE_DEFINE)
-#if defined(__LCC__)
-#define T1HA0_USE_DEFINE 1
-#else
-#define T1HA0_USE_DEFINE 0
-#endif
-#endif /* T1HA0_USE_DEFINE */
-
-#if T1HA0_AESNI_AVAILABLE
-uint64_t t1ha0_ia32aes_noavx(const void *data, size_t length, uint64_t seed);
-uint64_t t1ha0_ia32aes_avx(const void *data, size_t length, uint64_t seed);
-#ifndef __e2k__
-uint64_t t1ha0_ia32aes_avx2(const void *data, size_t length, uint64_t seed);
-#endif
-#endif /* T1HA0_AESNI_AVAILABLE */
-
-#if T1HA0_RUNTIME_SELECT
-typedef uint64_t (*t1ha0_function_t)(const void *, size_t, uint64_t);
-T1HA_API t1ha0_function_t t1ha0_resolve(void);
-#if T1HA_USE_INDIRECT_FUNCTIONS
-T1HA_API uint64_t t1ha0(const void *data, size_t length, uint64_t seed);
-#else
-/* Otherwise function pointer will be used.
- * Unfortunately this may cause some overhead calling. */
-T1HA_API extern uint64_t (*t1ha0_funcptr)(const void *data, size_t length,
-                                          uint64_t seed);
-static __force_inline uint64_t t1ha0(const void *data, size_t length,
-                                     uint64_t seed) {
-  return t1ha0_funcptr(data, length, seed);
-}
-#endif /* T1HA_USE_INDIRECT_FUNCTIONS */
-
-#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-
-#if T1HA0_USE_DEFINE
-
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-#define t1ha0 t1ha2_atonce
-#else
-#define t1ha0 t1ha1_be
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-#define t1ha0 t1ha0_32be
-#endif /* 32/64 */
-
-#else /* T1HA0_USE_DEFINE */
-
-static __force_inline uint64_t t1ha0(const void *data, size_t length,
-                                     uint64_t seed) {
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-  return t1ha2_atonce(data, length, seed);
-#else
-  return t1ha1_be(data, length, seed);
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-  return t1ha0_32be(data, length, seed);
-#endif /* 32/64 */
-}
-
-#endif /* !T1HA0_USE_DEFINE */
-
-#else /* !T1HA0_RUNTIME_SELECT && __BYTE_ORDER__ != __ORDER_BIG_ENDIAN__ */
-
-#if T1HA0_USE_DEFINE
-
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-#define t1ha0 t1ha2_atonce
-#else
-#define t1ha0 t1ha1_le
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-#define t1ha0 t1ha0_32le
-#endif /* 32/64 */
-
-#else
-
-static __force_inline uint64_t t1ha0(const void *data, size_t length,
-                                     uint64_t seed) {
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-  return t1ha2_atonce(data, length, seed);
-#else
-  return t1ha1_le(data, length, seed);
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-  return t1ha0_32le(data, length, seed);
-#endif /* 32/64 */
-}
-
-#endif /* !T1HA0_USE_DEFINE */
-
-#endif /* !T1HA0_RUNTIME_SELECT */
-
-#endif /* T1HA0_DISABLED */
-
-#ifdef __cplusplus
-}
-#endif
-
-#if __GNUC_PREREQ(4, 0)
-#pragma GCC visibility pop
-#endif /* __GNUC_PREREQ(4,0) */
diff --git a/src/t1ha/t1ha2.c b/src/t1ha/t1ha2.c
deleted file mode 100644
index b05d64c..0000000
--- a/src/t1ha/t1ha2.c
+++ /dev/null
@@ -1,329 +0,0 @@
-/*
- *  Copyright (c) 2016-2018 Positive Technologies, https://www.ptsecurity.com,
- *  Fast Positive Hash.
- *
- *  Portions Copyright (c) 2010-2018 Leonid Yuriev <leo@yuriev.ru>,
- *  The 1Hippeus project (t1h).
- *
- *  This software is provided 'as-is', without any express or implied
- *  warranty. In no event will the authors be held liable for any damages
- *  arising from the use of this software.
- *
- *  Permission is granted to anyone to use this software for any purpose,
- *  including commercial applications, and to alter it and redistribute it
- *  freely, subject to the following restrictions:
- *
- *  1. The origin of this software must not be misrepresented; you must not
- *     claim that you wrote the original software. If you use this software
- *     in a product, an acknowledgement in the product documentation would be
- *     appreciated but is not required.
- *  2. Altered source versions must be plainly marked as such, and must not be
- *     misrepresented as being the original software.
- *  3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * t1ha = { Fast Positive Hash, aka "Позитивный Хэш" }
- * by [Positive Technologies](https://www.ptsecurity.ru)
- *
- * Briefly, it is a 64-bit Hash Function:
- *  1. Created for 64-bit little-endian platforms, in predominantly for x86_64,
- *     but portable and without penalties it can run on any 64-bit CPU.
- *  2. In most cases up to 15% faster than City64, xxHash, mum-hash, metro-hash
- *     and all others portable hash-functions (which do not use specific
- *     hardware tricks).
- *  3. Not suitable for cryptography.
- *
- * The Future will Positive. Всё будет хорошо.
- *
- * ACKNOWLEDGEMENT:
- * The t1ha was originally developed by Leonid Yuriev (Леонид Юрьев)
- * for The 1Hippeus project - zerocopy messaging in the spirit of Sparta!
- */
-
-#ifndef T1HA2_DISABLED
-#include "t1ha_bits.h"
-//#include "t1ha_selfcheck.h"
-
-static __always_inline void init_ab(t1ha_state256_t *s, uint64_t x,
-                                    uint64_t y) {
-  s->n.a = x;
-  s->n.b = y;
-}
-
-static __always_inline void init_cd(t1ha_state256_t *s, uint64_t x,
-                                    uint64_t y) {
-  s->n.c = rot64(y, 23) + ~x;
-  s->n.d = ~y + rot64(x, 19);
-}
-
-/* TODO: C++ template in the next version */
-#define T1HA2_UPDATE(ENDIANNES, ALIGNESS, state, v)                            \
-  do {                                                                         \
-    t1ha_state256_t *const s = state;                                          \
-    const uint64_t w0 = fetch64_##ENDIANNES##_##ALIGNESS(v + 0);               \
-    const uint64_t w1 = fetch64_##ENDIANNES##_##ALIGNESS(v + 1);               \
-    const uint64_t w2 = fetch64_##ENDIANNES##_##ALIGNESS(v + 2);               \
-    const uint64_t w3 = fetch64_##ENDIANNES##_##ALIGNESS(v + 3);               \
-                                                                               \
-    const uint64_t d02 = w0 + rot64(w2 + s->n.d, 56);                          \
-    const uint64_t c13 = w1 + rot64(w3 + s->n.c, 19);                          \
-    s->n.d ^= s->n.b + rot64(w1, 38);                                          \
-    s->n.c ^= s->n.a + rot64(w0, 57);                                          \
-    s->n.b ^= prime_6 * (c13 + w2);                                            \
-    s->n.a ^= prime_5 * (d02 + w3);                                            \
-  } while (0)
-
-static __always_inline void squash(t1ha_state256_t *s) {
-  s->n.a ^= prime_6 * (s->n.c + rot64(s->n.d, 23));
-  s->n.b ^= prime_5 * (rot64(s->n.c, 19) + s->n.d);
-}
-
-/* TODO: C++ template in the next version */
-#define T1HA2_LOOP(ENDIANNES, ALIGNESS, state, data, len)                      \
-  do {                                                                         \
-    const void *detent = (const uint8_t *)data + len - 31;                     \
-    do {                                                                       \
-      const uint64_t *v = (const uint64_t *)data;                              \
-      data = (const uint64_t *)data + 4;                                       \
-      prefetch(data);                                                          \
-      T1HA2_UPDATE(le, ALIGNESS, state, v);                                    \
-    } while (likely(data < detent));                                           \
-  } while (0)
-
-/* TODO: C++ template in the next version */
-#define T1HA2_TAIL_AB(ENDIANNES, ALIGNESS, state, data, len)                   \
-  do {                                                                         \
-    t1ha_state256_t *const s = state;                                          \
-    const uint64_t *v = (const uint64_t *)data;                                \
-    switch (len) {                                                             \
-    default:                                                                   \
-      mixup64(&s->n.a, &s->n.b, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_4);                                                        \
-    /* fall through */                                                         \
-    case 24:                                                                   \
-    case 23:                                                                   \
-    case 22:                                                                   \
-    case 21:                                                                   \
-    case 20:                                                                   \
-    case 19:                                                                   \
-    case 18:                                                                   \
-    case 17:                                                                   \
-      mixup64(&s->n.b, &s->n.a, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_3);                                                        \
-    /* fall through */                                                         \
-    case 16:                                                                   \
-    case 15:                                                                   \
-    case 14:                                                                   \
-    case 13:                                                                   \
-    case 12:                                                                   \
-    case 11:                                                                   \
-    case 10:                                                                   \
-    case 9:                                                                    \
-      mixup64(&s->n.a, &s->n.b, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_2);                                                        \
-    /* fall through */                                                         \
-    case 8:                                                                    \
-    case 7:                                                                    \
-    case 6:                                                                    \
-    case 5:                                                                    \
-    case 4:                                                                    \
-    case 3:                                                                    \
-    case 2:                                                                    \
-    case 1:                                                                    \
-      mixup64(&s->n.b, &s->n.a, tail64_##ENDIANNES##_##ALIGNESS(v, len),       \
-              prime_1);                                                        \
-    /* fall through */                                                         \
-    case 0:                                                                    \
-      return final64(s->n.a, s->n.b);                                          \
-    }                                                                          \
-  } while (0)
-
-/* TODO: C++ template in the next version */
-#define T1HA2_TAIL_ABCD(ENDIANNES, ALIGNESS, state, data, len)                 \
-  do {                                                                         \
-    t1ha_state256_t *const s = state;                                          \
-    const uint64_t *v = (const uint64_t *)data;                                \
-    switch (len) {                                                             \
-    default:                                                                   \
-      mixup64(&s->n.a, &s->n.d, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_4);                                                        \
-    /* fall through */                                                         \
-    case 24:                                                                   \
-    case 23:                                                                   \
-    case 22:                                                                   \
-    case 21:                                                                   \
-    case 20:                                                                   \
-    case 19:                                                                   \
-    case 18:                                                                   \
-    case 17:                                                                   \
-      mixup64(&s->n.b, &s->n.a, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_3);                                                        \
-    /* fall through */                                                         \
-    case 16:                                                                   \
-    case 15:                                                                   \
-    case 14:                                                                   \
-    case 13:                                                                   \
-    case 12:                                                                   \
-    case 11:                                                                   \
-    case 10:                                                                   \
-    case 9:                                                                    \
-      mixup64(&s->n.c, &s->n.b, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_2);                                                        \
-    /* fall through */                                                         \
-    case 8:                                                                    \
-    case 7:                                                                    \
-    case 6:                                                                    \
-    case 5:                                                                    \
-    case 4:                                                                    \
-    case 3:                                                                    \
-    case 2:                                                                    \
-    case 1:                                                                    \
-      mixup64(&s->n.d, &s->n.c, tail64_##ENDIANNES##_##ALIGNESS(v, len),       \
-              prime_1);                                                        \
-    /* fall through */                                                         \
-    case 0:                                                                    \
-      return final128(s->n.a, s->n.b, s->n.c, s->n.d, extra_result);           \
-    }                                                                          \
-  } while (0)
-
-static __always_inline uint64_t final128(uint64_t a, uint64_t b, uint64_t c,
-                                         uint64_t d, uint64_t *h) {
-  mixup64(&a, &b, rot64(c, 41) ^ d, prime_0);
-  mixup64(&b, &c, rot64(d, 23) ^ a, prime_6);
-  mixup64(&c, &d, rot64(a, 19) ^ b, prime_5);
-  mixup64(&d, &a, rot64(b, 31) ^ c, prime_4);
-  *h = c + d;
-  return a ^ b;
-}
-
-//------------------------------------------------------------------------------
-
-uint64_t t1ha2_atonce(const void *data, size_t length, uint64_t seed) {
-  t1ha_state256_t state;
-  init_ab(&state, seed, length);
-
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT
-  if (unlikely(length > 32)) {
-    init_cd(&state, seed, length);
-    T1HA2_LOOP(le, unaligned, &state, data, length);
-    squash(&state);
-    length &= 31;
-  }
-  T1HA2_TAIL_AB(le, unaligned, &state, data, length);
-#else
-  const bool misaligned = (((uintptr_t)data) & (ALIGNMENT_64 - 1)) != 0;
-  if (misaligned) {
-    if (unlikely(length > 32)) {
-      init_cd(&state, seed, length);
-      T1HA2_LOOP(le, unaligned, &state, data, length);
-      squash(&state);
-      length &= 31;
-    }
-    T1HA2_TAIL_AB(le, unaligned, &state, data, length);
-  } else {
-    if (unlikely(length > 32)) {
-      init_cd(&state, seed, length);
-      T1HA2_LOOP(le, aligned, &state, data, length);
-      squash(&state);
-      length &= 31;
-    }
-    T1HA2_TAIL_AB(le, aligned, &state, data, length);
-  }
-#endif
-}
-
-uint64_t t1ha2_atonce128(uint64_t *__restrict extra_result,
-                         const void *__restrict data, size_t length,
-                         uint64_t seed) {
-  t1ha_state256_t state;
-  init_ab(&state, seed, length);
-  init_cd(&state, seed, length);
-
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT
-  if (unlikely(length > 32)) {
-    T1HA2_LOOP(le, unaligned, &state, data, length);
-    length &= 31;
-  }
-  T1HA2_TAIL_ABCD(le, unaligned, &state, data, length);
-#else
-  const bool misaligned = (((uintptr_t)data) & (ALIGNMENT_64 - 1)) != 0;
-  if (misaligned) {
-    if (unlikely(length > 32)) {
-      T1HA2_LOOP(le, unaligned, &state, data, length);
-      length &= 31;
-    }
-    T1HA2_TAIL_ABCD(le, unaligned, &state, data, length);
-  } else {
-    if (unlikely(length > 32)) {
-      T1HA2_LOOP(le, aligned, &state, data, length);
-      length &= 31;
-    }
-    T1HA2_TAIL_ABCD(le, aligned, &state, data, length);
-  }
-#endif
-}
-
-//------------------------------------------------------------------------------
-
-void t1ha2_init(t1ha_context_t *ctx, uint64_t seed_x, uint64_t seed_y) {
-  init_ab(&ctx->state, seed_x, seed_y);
-  init_cd(&ctx->state, seed_x, seed_y);
-  ctx->partial = 0;
-  ctx->total = 0;
-}
-
-void t1ha2_update(t1ha_context_t *__restrict ctx, const void *__restrict data,
-                  size_t length) {
-  ctx->total += length;
-
-  if (ctx->partial) {
-    const size_t left = 32 - ctx->partial;
-    const size_t chunk = (length >= left) ? left : length;
-    memcpy(ctx->buffer.bytes + ctx->partial, data, chunk);
-    ctx->partial += chunk;
-    if (ctx->partial < 32) {
-      assert(left >= length);
-      return;
-    }
-    ctx->partial = 0;
-    data = (const uint8_t *)data + chunk;
-    length -= chunk;
-    T1HA2_UPDATE(le, aligned, &ctx->state, ctx->buffer.u64);
-  }
-
-  if (length >= 32) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT
-    T1HA2_LOOP(le, unaligned, &ctx->state, data, length);
-#else
-    const bool misaligned = (((uintptr_t)data) & (ALIGNMENT_64 - 1)) != 0;
-    if (misaligned) {
-      T1HA2_LOOP(le, unaligned, &ctx->state, data, length);
-    } else {
-      T1HA2_LOOP(le, aligned, &ctx->state, data, length);
-    }
-#endif
-    length &= 31;
-  }
-
-  if (length)
-    memcpy(ctx->buffer.bytes, data, ctx->partial = length);
-}
-
-uint64_t t1ha2_final(t1ha_context_t *__restrict ctx,
-                     uint64_t *__restrict extra_result) {
-  uint64_t bits = (ctx->total << 3) ^ (UINT64_C(1) << 63);
-#if __BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__
-  bits = bswap64(bits);
-#endif
-  t1ha2_update(ctx, &bits, 8);
-
-  if (likely(!extra_result)) {
-    squash(&ctx->state);
-    T1HA2_TAIL_AB(le, aligned, &ctx->state, ctx->buffer.u64, ctx->partial);
-  }
-
-  T1HA2_TAIL_ABCD(le, aligned, &ctx->state, ctx->buffer.u64, ctx->partial);
-}
-
-#endif /* T1HA2_DISABLED */
diff --git a/src/t1ha/t1ha_bits.h b/src/t1ha/t1ha_bits.h
deleted file mode 100644
index 7c47851..0000000
--- a/src/t1ha/t1ha_bits.h
+++ /dev/null
@@ -1,1226 +0,0 @@
-/*
- *  Copyright (c) 2016-2018 Positive Technologies, https://www.ptsecurity.com,
- *  Fast Positive Hash.
- *
- *  Portions Copyright (c) 2010-2018 Leonid Yuriev <leo@yuriev.ru>,
- *  The 1Hippeus project (t1h).
- *
- *  This software is provided 'as-is', without any express or implied
- *  warranty. In no event will the authors be held liable for any damages
- *  arising from the use of this software.
- *
- *  Permission is granted to anyone to use this software for any purpose,
- *  including commercial applications, and to alter it and redistribute it
- *  freely, subject to the following restrictions:
- *
- *  1. The origin of this software must not be misrepresented; you must not
- *     claim that you wrote the original software. If you use this software
- *     in a product, an acknowledgement in the product documentation would be
- *     appreciated but is not required.
- *  2. Altered source versions must be plainly marked as such, and must not be
- *     misrepresented as being the original software.
- *  3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * t1ha = { Fast Positive Hash, aka "Позитивный Хэш" }
- * by [Positive Technologies](https://www.ptsecurity.ru)
- *
- * Briefly, it is a 64-bit Hash Function:
- *  1. Created for 64-bit little-endian platforms, in predominantly for x86_64,
- *     but portable and without penalties it can run on any 64-bit CPU.
- *  2. In most cases up to 15% faster than City64, xxHash, mum-hash, metro-hash
- *     and all others portable hash-functions (which do not use specific
- *     hardware tricks).
- *  3. Not suitable for cryptography.
- *
- * The Future will Positive. Всё будет хорошо.
- *
- * ACKNOWLEDGEMENT:
- * The t1ha was originally developed by Leonid Yuriev (Леонид Юрьев)
- * for The 1Hippeus project - zerocopy messaging in the spirit of Sparta!
- */
-
-#pragma once
-
-#if defined(_MSC_VER)
-#pragma warning(disable : 4201) /* nameless struct/union */
-#if _MSC_VER > 1800
-#pragma warning(disable : 4464) /* relative include path contains '..' */
-#endif                          /* 1800 */
-#endif                          /* MSVC */
-#include "t1ha.h"
-
-#ifndef T1HA_USE_FAST_ONESHOT_READ
-/* Define it to 1 for little bit faster code.
- * Unfortunately this may triggering a false-positive alarms from Valgrind,
- * AddressSanitizer and other similar tool.
- * So, define it to 0 for calmness if doubt. */
-#define T1HA_USE_FAST_ONESHOT_READ 1
-#endif /* T1HA_USE_FAST_ONESHOT_READ */
-
-/*****************************************************************************/
-
-#include <assert.h>  /* for assert() */
-#include <stdbool.h> /* for bool */
-#include <string.h>  /* for memcpy() */
-
-#if __BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__ &&                               \
-    __BYTE_ORDER__ != __ORDER_BIG_ENDIAN__
-#error Unsupported byte order.
-#endif
-
-#define T1HA_UNALIGNED_ACCESS__UNABLE 0
-#define T1HA_UNALIGNED_ACCESS__SLOW 1
-#define T1HA_UNALIGNED_ACCESS__EFFICIENT 2
-
-#ifndef T1HA_SYS_UNALIGNED_ACCESS
-#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
-#define T1HA_SYS_UNALIGNED_ACCESS T1HA_UNALIGNED_ACCESS__EFFICIENT
-#elif defined(__ia32__)
-#define T1HA_SYS_UNALIGNED_ACCESS T1HA_UNALIGNED_ACCESS__EFFICIENT
-#elif defined(__e2k__)
-#define T1HA_SYS_UNALIGNED_ACCESS T1HA_UNALIGNED_ACCESS__SLOW
-#elif defined(__ARM_FEATURE_UNALIGNED)
-#define T1HA_SYS_UNALIGNED_ACCESS T1HA_UNALIGNED_ACCESS__EFFICIENT
-#else
-#define T1HA_SYS_UNALIGNED_ACCESS T1HA_UNALIGNED_ACCESS__UNABLE
-#endif
-#endif /* T1HA_SYS_UNALIGNED_ACCESS */
-
-#define ALIGNMENT_16 2
-#define ALIGNMENT_32 4
-#if UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul
-#define ALIGNMENT_64 8
-#else
-#define ALIGNMENT_64 4
-#endif
-
-#ifndef PAGESIZE
-#define PAGESIZE 4096
-#endif /* PAGESIZE */
-
-/***************************************************************************/
-
-#ifndef __has_builtin
-#define __has_builtin(x) (0)
-#endif
-
-#ifndef __has_warning
-#define __has_warning(x) (0)
-#endif
-
-#ifndef __has_feature
-#define __has_feature(x) (0)
-#endif
-
-#ifndef __has_extension
-#define __has_extension(x) (0)
-#endif
-
-#if __has_feature(address_sanitizer)
-#define __SANITIZE_ADDRESS__ 1
-#endif
-
-#ifndef __optimize
-#if defined(__clang__) && !__has_attribute(optimize)
-#define __optimize(ops)
-#elif defined(__GNUC__) || __has_attribute(optimize)
-#define __optimize(ops) __attribute__((optimize(ops)))
-#else
-#define __optimize(ops)
-#endif
-#endif /* __optimize */
-
-#ifndef __cold
-#if defined(__OPTIMIZE__)
-#if defined(__e2k__)
-#define __cold __optimize(1) __attribute__((cold))
-#elif defined(__clang__) && !__has_attribute(cold)
-/* just put infrequently used functions in separate section */
-#define __cold __attribute__((section("text.unlikely"))) __optimize("Os")
-#elif defined(__GNUC__) || __has_attribute(cold)
-#define __cold __attribute__((cold)) __optimize("Os")
-#else
-#define __cold __optimize("Os")
-#endif
-#else
-#define __cold
-#endif
-#endif /* __cold */
-
-#if __GNUC_PREREQ(4, 4) || defined(__clang__)
-
-#if defined(__ia32__) || defined(__e2k__)
-#include <x86intrin.h>
-#endif
-
-#if defined(__ia32__) && !defined(__cpuid_count)
-#include <cpuid.h>
-#endif
-
-#if defined(__e2k__)
-#include <e2kbuiltin.h>
-#endif
-
-#ifndef likely
-#define likely(cond) __builtin_expect(!!(cond), 1)
-#endif
-
-#ifndef unlikely
-#define unlikely(cond) __builtin_expect(!!(cond), 0)
-#endif
-
-#if __GNUC_PREREQ(4, 5) || __has_builtin(__builtin_unreachable)
-#define unreachable() __builtin_unreachable()
-#endif
-
-#define bswap64(v) __builtin_bswap64(v)
-#define bswap32(v) __builtin_bswap32(v)
-#if __GNUC_PREREQ(4, 8) || __has_builtin(__builtin_bswap16)
-#define bswap16(v) __builtin_bswap16(v)
-#endif
-
-#if !defined(__maybe_unused) && (__GNUC_PREREQ(4, 3) || __has_attribute(unused))
-#define __maybe_unused __attribute__((unused))
-#endif
-
-#if !defined(__always_inline) &&                                               \
-    (__GNUC_PREREQ(3, 2) || __has_attribute(always_inline))
-#define __always_inline __inline __attribute__((always_inline))
-#endif
-
-#if defined(__e2k__)
-
-#if __iset__ >= 3
-#define mul_64x64_high(a, b) __builtin_e2k_umulhd(a, b)
-#endif /* __iset__ >= 3 */
-
-#if __iset__ >= 5
-static __maybe_unused __always_inline unsigned
-e2k_add64carry_first(uint64_t base, uint64_t addend, uint64_t *sum) {
-  *sum = base + addend;
-  return (unsigned)__builtin_e2k_addcd_c(base, addend, 0);
-}
-#define add64carry_first(base, addend, sum)                                    \
-  e2k_add64carry_first(base, addend, sum)
-
-static __maybe_unused __always_inline unsigned
-e2k_add64carry_next(unsigned carry, uint64_t base, uint64_t addend,
-                    uint64_t *sum) {
-  *sum = __builtin_e2k_addcd(base, addend, carry);
-  return (unsigned)__builtin_e2k_addcd_c(base, addend, carry);
-}
-#define add64carry_next(carry, base, addend, sum)                              \
-  e2k_add64carry_next(carry, base, addend, sum)
-
-static __maybe_unused __always_inline void e2k_add64carry_last(unsigned carry,
-                                                               uint64_t base,
-                                                               uint64_t addend,
-                                                               uint64_t *sum) {
-  *sum = __builtin_e2k_addcd(base, addend, carry);
-}
-#define add64carry_last(carry, base, addend, sum)                              \
-  e2k_add64carry_last(carry, base, addend, sum)
-#endif /* __iset__ >= 5 */
-
-#define fetch64_be_aligned(ptr) ((uint64_t)__builtin_e2k_ld_64s_be(ptr))
-#define fetch32_be_aligned(ptr) ((uint32_t)__builtin_e2k_ld_32u_be(ptr))
-
-#endif /* __e2k__ Elbrus */
-
-#elif defined(_MSC_VER)
-
-#if _MSC_FULL_VER < 190024234 && defined(_M_IX86)
-#pragma message(                                                               \
-    "For AES-NI at least \"Microsoft C/C++ Compiler\" version 19.00.24234 (Visual Studio 2015 Update 3) is required.")
-#endif
-#if _MSC_FULL_VER < 191526730
-#pragma message(                                                               \
-    "It is recommended to use \"Microsoft C/C++ Compiler\" version 19.15.26730 (Visual Studio 2017 15.8) or newer.")
-#endif
-#if _MSC_FULL_VER < 180040629
-#error At least "Microsoft C/C++ Compiler" version 18.00.40629 (Visual Studio 2013 Update 5) is required.
-#endif
-
-#pragma warning(push, 1)
-
-#include <intrin.h>
-#include <stdlib.h>
-#define likely(cond) (cond)
-#define unlikely(cond) (cond)
-#define unreachable() __assume(0)
-#define bswap64(v) _byteswap_uint64(v)
-#define bswap32(v) _byteswap_ulong(v)
-#define bswap16(v) _byteswap_ushort(v)
-#define rot64(v, s) _rotr64(v, s)
-#define rot32(v, s) _rotr(v, s)
-#define __always_inline __forceinline
-
-#if defined(_M_X64) || defined(_M_IA64)
-#pragma intrinsic(_umul128)
-#define mul_64x64_128(a, b, ph) _umul128(a, b, ph)
-#pragma intrinsic(_addcarry_u64)
-#define add64carry_first(base, addend, sum) _addcarry_u64(0, base, addend, sum)
-#define add64carry_next(carry, base, addend, sum)                              \
-  _addcarry_u64(carry, base, addend, sum)
-#define add64carry_last(carry, base, addend, sum)                              \
-  (void)_addcarry_u64(carry, base, addend, sum)
-#endif
-
-#if defined(_M_ARM64) || defined(_M_X64) || defined(_M_IA64)
-#pragma intrinsic(__umulh)
-#define mul_64x64_high(a, b) __umulh(a, b)
-#endif
-
-#if defined(_M_IX86)
-#pragma intrinsic(__emulu)
-#define mul_32x32_64(a, b) __emulu(a, b)
-
-#if _MSC_VER >= 1915 /* LY: workaround for SSA-optimizer bug */
-#pragma intrinsic(_addcarry_u32)
-#define add32carry_first(base, addend, sum) _addcarry_u32(0, base, addend, sum)
-#define add32carry_next(carry, base, addend, sum)                              \
-  _addcarry_u32(carry, base, addend, sum)
-#define add32carry_last(carry, base, addend, sum)                              \
-  (void)_addcarry_u32(carry, base, addend, sum)
-
-static __forceinline char
-msvc32_add64carry_first(uint64_t base, uint64_t addend, uint64_t *sum) {
-  uint32_t *const sum32 = (uint32_t *)sum;
-  const uint32_t base_32l = (uint32_t)base;
-  const uint32_t base_32h = (uint32_t)(base >> 32);
-  const uint32_t addend_32l = (uint32_t)addend;
-  const uint32_t addend_32h = (uint32_t)(addend >> 32);
-  return add32carry_next(add32carry_first(base_32l, addend_32l, sum32),
-                         base_32h, addend_32h, sum32 + 1);
-}
-#define add64carry_first(base, addend, sum)                                    \
-  msvc32_add64carry_first(base, addend, sum)
-
-static __forceinline char msvc32_add64carry_next(char carry, uint64_t base,
-                                                 uint64_t addend,
-                                                 uint64_t *sum) {
-  uint32_t *const sum32 = (uint32_t *)sum;
-  const uint32_t base_32l = (uint32_t)base;
-  const uint32_t base_32h = (uint32_t)(base >> 32);
-  const uint32_t addend_32l = (uint32_t)addend;
-  const uint32_t addend_32h = (uint32_t)(addend >> 32);
-  return add32carry_next(add32carry_next(carry, base_32l, addend_32l, sum32),
-                         base_32h, addend_32h, sum32 + 1);
-}
-#define add64carry_next(carry, base, addend, sum)                              \
-  msvc32_add64carry_next(carry, base, addend, sum)
-
-static __forceinline void msvc32_add64carry_last(char carry, uint64_t base,
-                                                 uint64_t addend,
-                                                 uint64_t *sum) {
-  uint32_t *const sum32 = (uint32_t *)sum;
-  const uint32_t base_32l = (uint32_t)base;
-  const uint32_t base_32h = (uint32_t)(base >> 32);
-  const uint32_t addend_32l = (uint32_t)addend;
-  const uint32_t addend_32h = (uint32_t)(addend >> 32);
-  add32carry_last(add32carry_next(carry, base_32l, addend_32l, sum32), base_32h,
-                  addend_32h, sum32 + 1);
-}
-#define add64carry_last(carry, base, addend, sum)                              \
-  msvc32_add64carry_last(carry, base, addend, sum)
-#endif /* _MSC_FULL_VER >= 190024231 */
-
-#elif defined(_M_ARM)
-#define mul_32x32_64(a, b) _arm_umull(a, b)
-#endif
-
-#pragma warning(pop)
-#pragma warning(disable : 4514) /* 'xyz': unreferenced inline function         \
-                                   has been removed */
-#pragma warning(disable : 4710) /* 'xyz': function not inlined */
-#pragma warning(disable : 4711) /* function 'xyz' selected for                 \
-                                   automatic inline expansion */
-#pragma warning(disable : 4127) /* conditional expression is constant */
-#pragma warning(disable : 4702) /* unreachable code */
-#endif                          /* Compiler */
-
-#ifndef likely
-#define likely(cond) (cond)
-#endif
-#ifndef unlikely
-#define unlikely(cond) (cond)
-#endif
-#ifndef __maybe_unused
-#define __maybe_unused
-#endif
-#ifndef __always_inline
-#define __always_inline __inline
-#endif
-#ifndef unreachable
-#define unreachable()                                                          \
-  do {                                                                         \
-  } while (1)
-#endif
-
-#ifndef bswap64
-#if defined(bswap_64)
-#define bswap64 bswap_64
-#elif defined(__bswap_64)
-#define bswap64 __bswap_64
-#else
-static __always_inline uint64_t bswap64(uint64_t v) {
-  return v << 56 | v >> 56 | ((v << 40) & UINT64_C(0x00ff000000000000)) |
-         ((v << 24) & UINT64_C(0x0000ff0000000000)) |
-         ((v << 8) & UINT64_C(0x000000ff00000000)) |
-         ((v >> 8) & UINT64_C(0x00000000ff000000)) |
-         ((v >> 24) & UINT64_C(0x0000000000ff0000)) |
-         ((v >> 40) & UINT64_C(0x000000000000ff00));
-}
-#endif
-#endif /* bswap64 */
-
-#ifndef bswap32
-#if defined(bswap_32)
-#define bswap32 bswap_32
-#elif defined(__bswap_32)
-#define bswap32 __bswap_32
-#else
-static __always_inline uint32_t bswap32(uint32_t v) {
-  return v << 24 | v >> 24 | ((v << 8) & UINT32_C(0x00ff0000)) |
-         ((v >> 8) & UINT32_C(0x0000ff00));
-}
-#endif
-#endif /* bswap32 */
-
-#ifndef bswap16
-#if defined(bswap_16)
-#define bswap16 bswap_16
-#elif defined(__bswap_16)
-#define bswap16 __bswap_16
-#else
-static __always_inline uint16_t bswap16(uint16_t v) { return v << 8 | v >> 8; }
-#endif
-#endif /* bswap16 */
-
-#ifndef read_unaligned
-#if defined(__GNUC__) || __has_attribute(packed)
-typedef struct {
-  uint8_t unaligned_8;
-  uint16_t unaligned_16;
-  uint32_t unaligned_32;
-  uint64_t unaligned_64;
-} __attribute__((packed)) t1ha_unaligned_proxy;
-#define read_unaligned(ptr, bits)                                              \
-  (((const t1ha_unaligned_proxy *)((const uint8_t *)(ptr)-offsetof(            \
-        t1ha_unaligned_proxy, unaligned_##bits)))                              \
-       ->unaligned_##bits)
-#elif defined(_MSC_VER)
-#pragma warning(                                                               \
-    disable : 4235) /* nonstandard extension used: '__unaligned'               \
-                     * keyword not supported on this architecture */
-#define read_unaligned(ptr, bits) (*(const __unaligned uint##bits##_t *)(ptr))
-#else
-#pragma pack(push, 1)
-typedef struct {
-  uint8_t unaligned_8;
-  uint16_t unaligned_16;
-  uint32_t unaligned_32;
-  uint64_t unaligned_64;
-} t1ha_unaligned_proxy;
-#pragma pack(pop)
-#define read_unaligned(ptr, bits)                                              \
-  (((const t1ha_unaligned_proxy *)((const uint8_t *)(ptr)-offsetof(            \
-        t1ha_unaligned_proxy, unaligned_##bits)))                              \
-       ->unaligned_##bits)
-#endif
-#endif /* read_unaligned */
-
-#ifndef read_aligned
-#if __GNUC_PREREQ(4, 8) || __has_builtin(__builtin_assume_aligned)
-#define read_aligned(ptr, bits)                                                \
-  (*(const uint##bits##_t *)__builtin_assume_aligned(ptr, ALIGNMENT_##bits))
-#elif (__GNUC_PREREQ(3, 3) || __has_attribute(aligned)) && !defined(__clang__)
-#define read_aligned(ptr, bits)                                                \
-  (*(const uint##bits##_t __attribute__((aligned(ALIGNMENT_##bits))) *)(ptr))
-#elif __has_attribute(assume_aligned)
-
-static __always_inline const
-    uint16_t *__attribute__((assume_aligned(ALIGNMENT_16)))
-    cast_aligned_16(const void *ptr) {
-  return (const uint16_t *)ptr;
-}
-static __always_inline const
-    uint32_t *__attribute__((assume_aligned(ALIGNMENT_32)))
-    cast_aligned_32(const void *ptr) {
-  return (const uint32_t *)ptr;
-}
-static __always_inline const
-    uint64_t *__attribute__((assume_aligned(ALIGNMENT_64)))
-    cast_aligned_64(const void *ptr) {
-  return (const uint64_t *)ptr;
-}
-
-#define read_aligned(ptr, bits) (*cast_aligned_##bits(ptr))
-
-#elif defined(_MSC_VER)
-#define read_aligned(ptr, bits)                                                \
-  (*(const __declspec(align(ALIGNMENT_##bits)) uint##bits##_t *)(ptr))
-#else
-#define read_aligned(ptr, bits) (*(const uint##bits##_t *)(ptr))
-#endif
-#endif /* read_aligned */
-
-#ifndef prefetch
-#if (__GNUC_PREREQ(4, 0) || __has_builtin(__builtin_prefetch)) &&              \
-    !defined(__ia32__)
-#define prefetch(ptr) __builtin_prefetch(ptr)
-#elif defined(_M_ARM64) || defined(_M_ARM)
-#define prefetch(ptr) __prefetch(ptr)
-#else
-#define prefetch(ptr)                                                          \
-  do {                                                                         \
-    (void)(ptr);                                                               \
-  } while (0)
-#endif
-#endif /* prefetch */
-
-#if __has_warning("-Wconstant-logical-operand")
-#if defined(__clang__)
-#pragma clang diagnostic ignored "-Wconstant-logical-operand"
-#elif defined(__GNUC__)
-#pragma GCC diagnostic ignored "-Wconstant-logical-operand"
-#else
-#pragma warning disable "constant-logical-operand"
-#endif
-#endif /* -Wconstant-logical-operand */
-
-#if __has_warning("-Wtautological-pointer-compare")
-#if defined(__clang__)
-#pragma clang diagnostic ignored "-Wtautological-pointer-compare"
-#elif defined(__GNUC__)
-#pragma GCC diagnostic ignored "-Wtautological-pointer-compare"
-#else
-#pragma warning disable "tautological-pointer-compare"
-#endif
-#endif /* -Wtautological-pointer-compare */
-
-/***************************************************************************/
-
-#if __GNUC_PREREQ(4, 0)
-#pragma GCC visibility push(hidden)
-#endif /* __GNUC_PREREQ(4,0) */
-
-/*---------------------------------------------------------- Little Endian */
-
-#ifndef fetch16_le_aligned
-static __always_inline uint16_t fetch16_le_aligned(const void *v) {
-  assert(((uintptr_t)v) % ALIGNMENT_16 == 0);
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  return read_aligned(v, 16);
-#else
-  return bswap16(read_aligned(v, 16));
-#endif
-}
-#endif /* fetch16_le_aligned */
-
-#ifndef fetch16_le_unaligned
-static __always_inline uint16_t fetch16_le_unaligned(const void *v) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__UNABLE
-  const uint8_t *p = (const uint8_t *)v;
-  return p[0] | (uint16_t)p[1] << 8;
-#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  return read_unaligned(v, 16);
-#else
-  return bswap16(read_unaligned(v, 16));
-#endif
-}
-#endif /* fetch16_le_unaligned */
-
-#ifndef fetch32_le_aligned
-static __always_inline uint32_t fetch32_le_aligned(const void *v) {
-  assert(((uintptr_t)v) % ALIGNMENT_32 == 0);
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  return read_aligned(v, 32);
-#else
-  return bswap32(read_aligned(v, 32));
-#endif
-}
-#endif /* fetch32_le_aligned */
-
-#ifndef fetch32_le_unaligned
-static __always_inline uint32_t fetch32_le_unaligned(const void *v) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__UNABLE
-  return fetch16_le_unaligned(v) |
-         (uint32_t)fetch16_le_unaligned((const uint8_t *)v + 2) << 16;
-#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  return read_unaligned(v, 32);
-#else
-  return bswap32(read_unaligned(v, 32));
-#endif
-}
-#endif /* fetch32_le_unaligned */
-
-#ifndef fetch64_le_aligned
-static __always_inline uint64_t fetch64_le_aligned(const void *v) {
-  assert(((uintptr_t)v) % ALIGNMENT_64 == 0);
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  return read_aligned(v, 64);
-#else
-  return bswap64(read_aligned(v, 64));
-#endif
-}
-#endif /* fetch64_le_aligned */
-
-#ifndef fetch64_le_unaligned
-static __always_inline uint64_t fetch64_le_unaligned(const void *v) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__UNABLE
-  return fetch32_le_unaligned(v) |
-         (uint64_t)fetch32_le_unaligned((const uint8_t *)v + 4) << 32;
-#elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  return read_unaligned(v, 64);
-#else
-  return bswap64(read_unaligned(v, 64));
-#endif
-}
-#endif /* fetch64_le_unaligned */
-
-static __always_inline uint64_t tail64_le_aligned(const void *v, size_t tail) {
-  const uint8_t *const p = (const uint8_t *)v;
-#if T1HA_USE_FAST_ONESHOT_READ && !defined(__SANITIZE_ADDRESS__)
-  /* We can perform a 'oneshot' read, which is little bit faster. */
-  const unsigned shift = ((8 - tail) & 7) << 3;
-  return fetch64_le_aligned(p) & ((~UINT64_C(0)) >> shift);
-#else
-  uint64_t r = 0;
-  switch (tail & 7) {
-  default:
-    unreachable();
-/* fall through */
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  /* For most CPUs this code is better when not needed byte reordering. */
-  case 0:
-    return fetch64_le_aligned(p);
-  case 7:
-    r = (uint64_t)p[6] << 8;
-  /* fall through */
-  case 6:
-    r += p[5];
-    r <<= 8;
-  /* fall through */
-  case 5:
-    r += p[4];
-    r <<= 32;
-  /* fall through */
-  case 4:
-    return r + fetch32_le_aligned(p);
-  case 3:
-    r = (uint64_t)p[2] << 16;
-  /* fall through */
-  case 2:
-    return r + fetch16_le_aligned(p);
-  case 1:
-    return p[0];
-#else
-  case 0:
-    r = p[7] << 8;
-  /* fall through */
-  case 7:
-    r += p[6];
-    r <<= 8;
-  /* fall through */
-  case 6:
-    r += p[5];
-    r <<= 8;
-  /* fall through */
-  case 5:
-    r += p[4];
-    r <<= 8;
-  /* fall through */
-  case 4:
-    r += p[3];
-    r <<= 8;
-  /* fall through */
-  case 3:
-    r += p[2];
-    r <<= 8;
-  /* fall through */
-  case 2:
-    r += p[1];
-    r <<= 8;
-  /* fall through */
-  case 1:
-    return r + p[0];
-#endif
-  }
-#endif /* T1HA_USE_FAST_ONESHOT_READ */
-}
-
-#if T1HA_USE_FAST_ONESHOT_READ &&                                              \
-    T1HA_SYS_UNALIGNED_ACCESS != T1HA_UNALIGNED_ACCESS__UNABLE &&              \
-    defined(PAGESIZE) && PAGESIZE > 42 && !defined(__SANITIZE_ADDRESS__)
-#define can_read_underside(ptr, size)                                          \
-  (((PAGESIZE - (size)) & (uintptr_t)(ptr)) != 0)
-#endif /* T1HA_USE_FAST_ONESHOT_READ */
-
-static __always_inline uint64_t tail64_le_unaligned(const void *v,
-                                                    size_t tail) {
-  const uint8_t *p = (const uint8_t *)v;
-#if defined(can_read_underside) &&                                             \
-    (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul)
-  /* On some systems (e.g. x86_64) we can perform a 'oneshot' read, which
-   * is little bit faster. Thanks Marcin Żukowski <marcin.zukowski@gmail.com>
-   * for the reminder. */
-  const unsigned offset = (8 - tail) & 7;
-  const unsigned shift = offset << 3;
-  if (likely(can_read_underside(p, 8))) {
-    p -= offset;
-    return fetch64_le_unaligned(p) >> shift;
-  }
-  return fetch64_le_unaligned(p) & ((~UINT64_C(0)) >> shift);
-#else
-  uint64_t r = 0;
-  switch (tail & 7) {
-  default:
-    unreachable();
-/* fall through */
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT &&           \
-    __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-  /* For most CPUs this code is better when not needed
-   * copying for alignment or byte reordering. */
-  case 0:
-    return fetch64_le_unaligned(p);
-  case 7:
-    r = (uint64_t)p[6] << 8;
-  /* fall through */
-  case 6:
-    r += p[5];
-    r <<= 8;
-  /* fall through */
-  case 5:
-    r += p[4];
-    r <<= 32;
-  /* fall through */
-  case 4:
-    return r + fetch32_le_unaligned(p);
-  case 3:
-    r = (uint64_t)p[2] << 16;
-  /* fall through */
-  case 2:
-    return r + fetch16_le_unaligned(p);
-  case 1:
-    return p[0];
-#else
-  /* For most CPUs this code is better than a
-   * copying for alignment and/or byte reordering. */
-  case 0:
-    r = p[7] << 8;
-  /* fall through */
-  case 7:
-    r += p[6];
-    r <<= 8;
-  /* fall through */
-  case 6:
-    r += p[5];
-    r <<= 8;
-  /* fall through */
-  case 5:
-    r += p[4];
-    r <<= 8;
-  /* fall through */
-  case 4:
-    r += p[3];
-    r <<= 8;
-  /* fall through */
-  case 3:
-    r += p[2];
-    r <<= 8;
-  /* fall through */
-  case 2:
-    r += p[1];
-    r <<= 8;
-  /* fall through */
-  case 1:
-    return r + p[0];
-#endif
-  }
-#endif /* can_read_underside */
-}
-
-/*------------------------------------------------------------- Big Endian */
-
-#ifndef fetch16_be_aligned
-static __maybe_unused __always_inline uint16_t
-fetch16_be_aligned(const void *v) {
-  assert(((uintptr_t)v) % ALIGNMENT_16 == 0);
-#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  return read_aligned(v, 16);
-#else
-  return bswap16(read_aligned(v, 16));
-#endif
-}
-#endif /* fetch16_be_aligned */
-
-#ifndef fetch16_be_unaligned
-static __maybe_unused __always_inline uint16_t
-fetch16_be_unaligned(const void *v) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__UNABLE
-  const uint8_t *p = (const uint8_t *)v;
-  return (uint16_t)p[0] << 8 | p[1];
-#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  return read_unaligned(v, 16);
-#else
-  return bswap16(read_unaligned(v, 16));
-#endif
-}
-#endif /* fetch16_be_unaligned */
-
-#ifndef fetch32_be_aligned
-static __maybe_unused __always_inline uint32_t
-fetch32_be_aligned(const void *v) {
-  assert(((uintptr_t)v) % ALIGNMENT_32 == 0);
-#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  return read_aligned(v, 32);
-#else
-  return bswap32(read_aligned(v, 32));
-#endif
-}
-#endif /* fetch32_be_aligned */
-
-#ifndef fetch32_be_unaligned
-static __maybe_unused __always_inline uint32_t
-fetch32_be_unaligned(const void *v) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__UNABLE
-  return (uint32_t)fetch16_be_unaligned(v) << 16 |
-         fetch16_be_unaligned((const uint8_t *)v + 2);
-#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  return read_unaligned(v, 32);
-#else
-  return bswap32(read_unaligned(v, 32));
-#endif
-}
-#endif /* fetch32_be_unaligned */
-
-#ifndef fetch64_be_aligned
-static __maybe_unused __always_inline uint64_t
-fetch64_be_aligned(const void *v) {
-  assert(((uintptr_t)v) % ALIGNMENT_64 == 0);
-#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  return read_aligned(v, 64);
-#else
-  return bswap64(read_aligned(v, 64));
-#endif
-}
-#endif /* fetch64_be_aligned */
-
-#ifndef fetch64_be_unaligned
-static __maybe_unused __always_inline uint64_t
-fetch64_be_unaligned(const void *v) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__UNABLE
-  return (uint64_t)fetch32_be_unaligned(v) << 32 |
-         fetch32_be_unaligned((const uint8_t *)v + 4);
-#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  return read_unaligned(v, 64);
-#else
-  return bswap64(read_unaligned(v, 64));
-#endif
-}
-#endif /* fetch64_be_unaligned */
-
-static __maybe_unused __always_inline uint64_t tail64_be_aligned(const void *v,
-                                                                 size_t tail) {
-  const uint8_t *const p = (const uint8_t *)v;
-#if T1HA_USE_FAST_ONESHOT_READ && !defined(__SANITIZE_ADDRESS__)
-  /* We can perform a 'oneshot' read, which is little bit faster. */
-  const unsigned shift = ((8 - tail) & 7) << 3;
-  return fetch64_be_aligned(p) >> shift;
-#else
-  switch (tail & 7) {
-  default:
-    unreachable();
-/* fall through */
-#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  /* For most CPUs this code is better when not byte reordering. */
-  case 1:
-    return p[0];
-  case 2:
-    return fetch16_be_aligned(p);
-  case 3:
-    return (uint32_t)fetch16_be_aligned(p) << 8 | p[2];
-  case 4:
-    return fetch32_be_aligned(p);
-  case 5:
-    return (uint64_t)fetch32_be_aligned(p) << 8 | p[4];
-  case 6:
-    return (uint64_t)fetch32_be_aligned(p) << 16 | fetch16_be_aligned(p + 4);
-  case 7:
-    return (uint64_t)fetch32_be_aligned(p) << 24 |
-           (uint32_t)fetch16_be_aligned(p + 4) << 8 | p[6];
-  case 0:
-    return fetch64_be_aligned(p);
-#else
-  case 1:
-    return p[0];
-  case 2:
-    return p[1] | (uint32_t)p[0] << 8;
-  case 3:
-    return p[2] | (uint32_t)p[1] << 8 | (uint32_t)p[0] << 16;
-  case 4:
-    return p[3] | (uint32_t)p[2] << 8 | (uint32_t)p[1] << 16 |
-           (uint32_t)p[0] << 24;
-  case 5:
-    return p[4] | (uint32_t)p[3] << 8 | (uint32_t)p[2] << 16 |
-           (uint32_t)p[1] << 24 | (uint64_t)p[0] << 32;
-  case 6:
-    return p[5] | (uint32_t)p[4] << 8 | (uint32_t)p[3] << 16 |
-           (uint32_t)p[2] << 24 | (uint64_t)p[1] << 32 | (uint64_t)p[0] << 40;
-  case 7:
-    return p[6] | (uint32_t)p[5] << 8 | (uint32_t)p[4] << 16 |
-           (uint32_t)p[3] << 24 | (uint64_t)p[2] << 32 | (uint64_t)p[1] << 40 |
-           (uint64_t)p[0] << 48;
-  case 0:
-    return p[7] | (uint32_t)p[6] << 8 | (uint32_t)p[5] << 16 |
-           (uint32_t)p[4] << 24 | (uint64_t)p[3] << 32 | (uint64_t)p[2] << 40 |
-           (uint64_t)p[1] << 48 | (uint64_t)p[0] << 56;
-#endif
-  }
-#endif /* T1HA_USE_FAST_ONESHOT_READ */
-}
-
-static __maybe_unused __always_inline uint64_t
-tail64_be_unaligned(const void *v, size_t tail) {
-  const uint8_t *p = (const uint8_t *)v;
-#if defined(can_read_underside) &&                                             \
-    (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul)
-  /* On some systems (e.g. x86_64) we can perform a 'oneshot' read, which
-   * is little bit faster. Thanks Marcin Żukowski <marcin.zukowski@gmail.com>
-   * for the reminder. */
-  const unsigned offset = (8 - tail) & 7;
-  const unsigned shift = offset << 3;
-  if (likely(can_read_underside(p, 8))) {
-    p -= offset;
-    return fetch64_be_unaligned(p) & ((~UINT64_C(0)) >> shift);
-  }
-  return fetch64_be_unaligned(p) >> shift;
-#else
-  switch (tail & 7) {
-  default:
-    unreachable();
-/* fall through */
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT &&           \
-    __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-  /* For most CPUs this code is better when not needed
-   * copying for alignment or byte reordering. */
-  case 1:
-    return p[0];
-  case 2:
-    return fetch16_be_unaligned(p);
-  case 3:
-    return (uint32_t)fetch16_be_unaligned(p) << 8 | p[2];
-  case 4:
-    return fetch32_be(p);
-  case 5:
-    return (uint64_t)fetch32_be_unaligned(p) << 8 | p[4];
-  case 6:
-    return (uint64_t)fetch32_be_unaligned(p) << 16 |
-           fetch16_be_unaligned(p + 4);
-  case 7:
-    return (uint64_t)fetch32_be_unaligned(p) << 24 |
-           (uint32_t)fetch16_be_unaligned(p + 4) << 8 | p[6];
-  case 0:
-    return fetch64_be_unaligned(p);
-#else
-  /* For most CPUs this code is better than a
-   * copying for alignment and/or byte reordering. */
-  case 1:
-    return p[0];
-  case 2:
-    return p[1] | (uint32_t)p[0] << 8;
-  case 3:
-    return p[2] | (uint32_t)p[1] << 8 | (uint32_t)p[0] << 16;
-  case 4:
-    return p[3] | (uint32_t)p[2] << 8 | (uint32_t)p[1] << 16 |
-           (uint32_t)p[0] << 24;
-  case 5:
-    return p[4] | (uint32_t)p[3] << 8 | (uint32_t)p[2] << 16 |
-           (uint32_t)p[1] << 24 | (uint64_t)p[0] << 32;
-  case 6:
-    return p[5] | (uint32_t)p[4] << 8 | (uint32_t)p[3] << 16 |
-           (uint32_t)p[2] << 24 | (uint64_t)p[1] << 32 | (uint64_t)p[0] << 40;
-  case 7:
-    return p[6] | (uint32_t)p[5] << 8 | (uint32_t)p[4] << 16 |
-           (uint32_t)p[3] << 24 | (uint64_t)p[2] << 32 | (uint64_t)p[1] << 40 |
-           (uint64_t)p[0] << 48;
-  case 0:
-    return p[7] | (uint32_t)p[6] << 8 | (uint32_t)p[5] << 16 |
-           (uint32_t)p[4] << 24 | (uint64_t)p[3] << 32 | (uint64_t)p[2] << 40 |
-           (uint64_t)p[1] << 48 | (uint64_t)p[0] << 56;
-#endif
-  }
-#endif /* can_read_underside */
-}
-
-/***************************************************************************/
-
-#ifndef rot64
-static __always_inline uint64_t rot64(uint64_t v, unsigned s) {
-  return (v >> s) | (v << (64 - s));
-}
-#endif /* rot64 */
-
-#ifndef mul_32x32_64
-static __always_inline uint64_t mul_32x32_64(uint32_t a, uint32_t b) {
-  return a * (uint64_t)b;
-}
-#endif /* mul_32x32_64 */
-
-#ifndef add64carry_first
-static __maybe_unused __always_inline unsigned
-add64carry_first(uint64_t base, uint64_t addend, uint64_t *sum) {
-#if __has_builtin(__builtin_addcll)
-  unsigned long long carryout;
-  *sum = __builtin_addcll(base, addend, 0, &carryout);
-  return (unsigned)carryout;
-#else
-  *sum = base + addend;
-  return *sum < addend;
-#endif /* __has_builtin(__builtin_addcll) */
-}
-#endif /* add64carry_fist */
-
-#ifndef add64carry_next
-static __maybe_unused __always_inline unsigned
-add64carry_next(unsigned carry, uint64_t base, uint64_t addend, uint64_t *sum) {
-#if __has_builtin(__builtin_addcll)
-  unsigned long long carryout;
-  *sum = __builtin_addcll(base, addend, carry, &carryout);
-  return (unsigned)carryout;
-#else
-  *sum = base + addend + carry;
-  return *sum < addend || (carry && *sum == addend);
-#endif /* __has_builtin(__builtin_addcll) */
-}
-#endif /* add64carry_next */
-
-#ifndef add64carry_last
-static __maybe_unused __always_inline void
-add64carry_last(unsigned carry, uint64_t base, uint64_t addend, uint64_t *sum) {
-#if __has_builtin(__builtin_addcll)
-  unsigned long long carryout;
-  *sum = __builtin_addcll(base, addend, carry, &carryout);
-  (void)carryout;
-#else
-  *sum = base + addend + carry;
-#endif /* __has_builtin(__builtin_addcll) */
-}
-#endif /* add64carry_last */
-
-#ifndef mul_64x64_128
-static __maybe_unused __always_inline uint64_t mul_64x64_128(uint64_t a,
-                                                             uint64_t b,
-                                                             uint64_t *h) {
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  __uint128_t r = (__uint128_t)a * (__uint128_t)b;
-  /* modern GCC could nicely optimize this */
-  *h = (uint64_t)(r >> 64);
-  return (uint64_t)r;
-#elif defined(mul_64x64_high)
-  *h = mul_64x64_high(a, b);
-  return a * b;
-#else
-  /* performs 64x64 to 128 bit multiplication */
-  const uint64_t ll = mul_32x32_64((uint32_t)a, (uint32_t)b);
-  const uint64_t lh = mul_32x32_64(a >> 32, (uint32_t)b);
-  const uint64_t hl = mul_32x32_64((uint32_t)a, b >> 32);
-  const uint64_t hh = mul_32x32_64(a >> 32, b >> 32);
-
-  /* Few simplification are possible here for 32-bit architectures,
-   * but thus we would lost compatibility with the original 64-bit
-   * version.  Think is very bad idea, because then 32-bit t1ha will
-   * still (relatively) very slowly and well yet not compatible. */
-  uint64_t l;
-  add64carry_last(add64carry_first(ll, lh << 32, &l), hh, lh >> 32, h);
-  add64carry_last(add64carry_first(l, hl << 32, &l), *h, hl >> 32, h);
-  return l;
-#endif
-}
-#endif /* mul_64x64_128() */
-
-#ifndef mul_64x64_high
-static __maybe_unused __always_inline uint64_t mul_64x64_high(uint64_t a,
-                                                              uint64_t b) {
-  uint64_t h;
-  mul_64x64_128(a, b, &h);
-  return h;
-}
-#endif /* mul_64x64_high */
-
-/***************************************************************************/
-
-/* 'magic' primes */
-static const uint64_t prime_0 = UINT64_C(0xEC99BF0D8372CAAB);
-static const uint64_t prime_1 = UINT64_C(0x82434FE90EDCEF39);
-static const uint64_t prime_2 = UINT64_C(0xD4F06DB99D67BE4B);
-static const uint64_t prime_3 = UINT64_C(0xBD9CACC22C6E9571);
-static const uint64_t prime_4 = UINT64_C(0x9C06FAF4D023E3AB);
-static const uint64_t prime_5 = UINT64_C(0xC060724A8424F345);
-static const uint64_t prime_6 = UINT64_C(0xCB5AF53AE3AAAC31);
-
-/* xor high and low parts of full 128-bit product */
-static __maybe_unused __always_inline uint64_t mux64(uint64_t v,
-                                                     uint64_t prime) {
-  uint64_t l, h;
-  l = mul_64x64_128(v, prime, &h);
-  return l ^ h;
-}
-
-static __always_inline uint64_t final64(uint64_t a, uint64_t b) {
-  uint64_t x = (a + rot64(b, 41)) * prime_0;
-  uint64_t y = (rot64(a, 23) + b) * prime_6;
-  return mux64(x ^ y, prime_5);
-}
-
-static __always_inline void mixup64(uint64_t *__restrict a,
-                                    uint64_t *__restrict b, uint64_t v,
-                                    uint64_t prime) {
-  uint64_t h;
-  *a ^= mul_64x64_128(*b + v, prime, &h);
-  *b += h;
-}
-
-/***************************************************************************/
-
-typedef union t1ha_uint128 {
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  __uint128_t v;
-#endif
-  struct {
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-    uint64_t l, h;
-#else
-    uint64_t h, l;
-#endif
-  };
-} t1ha_uint128_t;
-
-static __always_inline t1ha_uint128_t not128(const t1ha_uint128_t v) {
-  t1ha_uint128_t r;
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = ~v.v;
-#else
-  r.l = ~v.l;
-  r.h = ~v.h;
-#endif
-  return r;
-}
-
-static __always_inline t1ha_uint128_t left128(const t1ha_uint128_t v,
-                                              unsigned s) {
-  t1ha_uint128_t r;
-  assert(s < 128);
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = v.v << s;
-#else
-  r.l = (s < 64) ? v.l << s : 0;
-  r.h = (s < 64) ? (v.h << s) | (s ? v.l >> (64 - s) : 0) : v.l << (s - 64);
-#endif
-  return r;
-}
-
-static __always_inline t1ha_uint128_t right128(const t1ha_uint128_t v,
-                                               unsigned s) {
-  t1ha_uint128_t r;
-  assert(s < 128);
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = v.v >> s;
-#else
-  r.l = (s < 64) ? (s ? v.h << (64 - s) : 0) | (v.l >> s) : v.h >> (s - 64);
-  r.h = (s < 64) ? v.h >> s : 0;
-#endif
-  return r;
-}
-
-static __always_inline t1ha_uint128_t or128(t1ha_uint128_t x,
-                                            t1ha_uint128_t y) {
-  t1ha_uint128_t r;
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = x.v | y.v;
-#else
-  r.l = x.l | y.l;
-  r.h = x.h | y.h;
-#endif
-  return r;
-}
-
-static __always_inline t1ha_uint128_t xor128(t1ha_uint128_t x,
-                                             t1ha_uint128_t y) {
-  t1ha_uint128_t r;
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = x.v ^ y.v;
-#else
-  r.l = x.l ^ y.l;
-  r.h = x.h ^ y.h;
-#endif
-  return r;
-}
-
-static __always_inline t1ha_uint128_t rot128(t1ha_uint128_t v, unsigned s) {
-  s &= 127;
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  v.v = (v.v << (128 - s)) | (v.v >> s);
-  return v;
-#else
-  return s ? or128(left128(v, 128 - s), right128(v, s)) : v;
-#endif
-}
-
-static __always_inline t1ha_uint128_t add128(t1ha_uint128_t x,
-                                             t1ha_uint128_t y) {
-  t1ha_uint128_t r;
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = x.v + y.v;
-#else
-  add64carry_last(add64carry_first(x.l, y.l, &r.l), x.h, y.h, &r.h);
-#endif
-  return r;
-}
-
-static __always_inline t1ha_uint128_t mul128(t1ha_uint128_t x,
-                                             t1ha_uint128_t y) {
-  t1ha_uint128_t r;
-#if defined(__SIZEOF_INT128__) ||                                              \
-    (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128)
-  r.v = x.v * y.v;
-#else
-  r.l = mul_64x64_128(x.l, y.l, &r.h);
-  r.h += x.l * y.h + y.l * x.h;
-#endif
-  return r;
-}
-
-/***************************************************************************/
-
-#if T1HA0_AESNI_AVAILABLE && defined(__ia32__)
-uint64_t t1ha_ia32cpu_features(void);
-
-static __always_inline bool t1ha_ia32_AESNI_avail(uint64_t ia32cpu_features) {
-  /* check for AES-NI */
-  return (ia32cpu_features & UINT32_C(0x02000000)) != 0;
-}
-
-static __always_inline bool t1ha_ia32_AVX_avail(uint64_t ia32cpu_features) {
-  /* check for any AVX */
-  return (ia32cpu_features & UINT32_C(0x1A000000)) == UINT32_C(0x1A000000);
-}
-
-static __always_inline bool t1ha_ia32_AVX2_avail(uint64_t ia32cpu_features) {
-  /* check for 'Advanced Vector Extensions 2' */
-  return ((ia32cpu_features >> 32) & 32) != 0;
-}
-
-#endif /* T1HA0_AESNI_AVAILABLE && __ia32__ */
diff --git a/src/virtualMemory.cpp b/src/virtualMemory.cpp
new file mode 100644
index 0000000..f324e95
--- /dev/null
+++ b/src/virtualMemory.cpp
@@ -0,0 +1,112 @@
+/*
+Copyright (c) 2018 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "virtualMemory.hpp"
+
+#include <stdexcept>
+
+#ifdef _WIN32
+#include <windows.h>
+#else
+#ifdef __APPLE__
+#include <mach/vm_statistics.h>
+#endif
+#include <sys/types.h>
+#include <sys/mman.h>
+#ifndef MAP_ANONYMOUS
+#define MAP_ANONYMOUS MAP_ANON
+#endif
+#endif
+
+#ifdef _WIN32
+std::string getErrorMessage(const char* function) {
+	LPSTR messageBuffer = nullptr;
+	size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
+		NULL, GetLastError(), MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&messageBuffer, 0, NULL);
+	std::string message(messageBuffer, size);
+	LocalFree(messageBuffer);
+	return std::string(function) + std::string(": ") + message;
+}
+
+void setPrivilege(const char* pszPrivilege, BOOL bEnable) {
+	HANDLE           hToken;
+	TOKEN_PRIVILEGES tp;
+	BOOL             status;
+	DWORD            error;
+
+	if (!OpenProcessToken(GetCurrentProcess(), TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, &hToken))
+		throw std::runtime_error(getErrorMessage("OpenProcessToken"));
+
+	if (!LookupPrivilegeValue(NULL, pszPrivilege, &tp.Privileges[0].Luid))
+		throw std::runtime_error(getErrorMessage("LookupPrivilegeValue"));
+
+	tp.PrivilegeCount = 1;
+
+	if (bEnable)
+		tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+	else
+		tp.Privileges[0].Attributes = 0;
+
+	status = AdjustTokenPrivileges(hToken, FALSE, &tp, 0, (PTOKEN_PRIVILEGES)NULL, 0);
+
+	error = GetLastError();
+	if (!status || (error != ERROR_SUCCESS))
+		throw std::runtime_error(getErrorMessage("AdjustTokenPrivileges"));
+
+	if (!CloseHandle(hToken))
+		throw std::runtime_error(getErrorMessage("CloseHandle"));
+}
+#endif
+
+void* allocExecutableMemory(std::size_t bytes) {
+	void* mem;
+#ifdef _WIN32
+	mem = VirtualAlloc(nullptr, bytes, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
+	if (mem == nullptr)
+		throw std::runtime_error(getErrorMessage("allocExecutableMemory - VirtualAlloc"));
+#else
+	mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (mem == MAP_FAILED)
+		throw std::runtime_error("allocExecutableMemory - mmap failed");
+#endif
+	return mem;
+}
+
+constexpr std::size_t align(std::size_t pos, uint32_t align) {
+	return ((pos - 1) / align + 1) * align;
+}
+
+void* allocLargePagesMemory(std::size_t bytes) {
+	void* mem;
+#ifdef _WIN32
+	setPrivilege("SeLockMemoryPrivilege", 1);
+	mem = VirtualAlloc(NULL, align(bytes, 2 * 1024 * 1024), MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES, PAGE_READWRITE);
+	if (mem == nullptr)
+		throw std::runtime_error(getErrorMessage("allocLargePagesMemory - VirtualAlloc"));
+#else
+#ifdef __APPLE__
+	mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, VM_FLAGS_SUPERPAGE_SIZE_2MB, 0);
+#else
+	mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1, 0);
+#endif
+	if (mem == MAP_FAILED)
+		throw std::runtime_error("allocLargePagesMemory - mmap failed");
+#endif
+	return mem;
+}
\ No newline at end of file
diff --git a/src/virtualMemory.hpp b/src/virtualMemory.hpp
new file mode 100644
index 0000000..c80d33e
--- /dev/null
+++ b/src/virtualMemory.hpp
@@ -0,0 +1,25 @@
+/*
+Copyright (c) 2018 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#pragma once
+
+#include <cstddef>
+
+void* allocExecutableMemory(std::size_t);
+void* allocLargePagesMemory(std::size_t);
\ No newline at end of file