mirror of
https://git.wownero.com/wownero/RandomWOW.git
synced 2024-08-15 00:23:14 +00:00
Added DRAM buffer option to rx2c
This commit is contained in:
parent
2ea440d0f5
commit
8f3b145fe6
2 changed files with 35 additions and 19 deletions
|
@ -18,10 +18,10 @@ The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory
|
||||||
*The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
|
*The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
|
||||||
|
|
||||||
#### MMU
|
#### MMU
|
||||||
The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). When the current block has been consumed, reading jumps to a random block. The address of the next block is calculated 8 reads before the current block is exhausted to enable efficient prefetching. The MMU uses three internal registers:
|
The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). When the current block has been consumed, reading jumps to a random block. The address of the next block is calculated 16 reads before the current block is exhausted to enable efficient prefetching. The MMU uses three internal registers:
|
||||||
* **m0** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
|
* **m0** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
|
||||||
* **m1** - Address of the next block to be read from memory (32-bit, 256-byte aligned).
|
* **m1** - Address of the next block to be read from memory (32-bit, 256-byte aligned).
|
||||||
* **mx** - Random 32-bit counter that determines the address of the next block. After each read, the read address is mixed with the counter: `mx ^= addr`. When the 24th quadword of the current block is read (the value of the `m0` register ends with `0xC0`), the value of the `mx` register is copied into register `m1` and the last 8 bits of `m1` are cleared.
|
* **mx** - Random 32-bit counter that determines the address of the next block. After each read, the read address is mixed with the counter: `mx ^= addr`. When the 16th quadword of the current block is read (the value of the `m0` register ends with `0x80`), the value of the `mx` register is copied into register `m1` and the last 8 bits of `m1` are cleared.
|
||||||
|
|
||||||
*When the value of the `m1` register is changed, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
|
*When the value of the `m1` register is changed, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
|
||||||
|
|
||||||
|
@ -169,7 +169,7 @@ A 32-bit address mask that is used to calculate the write address for the C oper
|
||||||
|147-157|ROR_64|no|64|6|A >>> B|64|
|
|147-157|ROR_64|no|64|6|A >>> B|64|
|
||||||
|
|
||||||
##### 32-bit operations
|
##### 32-bit operations
|
||||||
Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations are 32 bits long and bits 32-63 of C are zero.
|
Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are zero.
|
||||||
|
|
||||||
##### Multiplication
|
##### Multiplication
|
||||||
There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
|
There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
|
||||||
|
@ -246,7 +246,7 @@ The RET instruction behaves like "not taken" when the stack is empty. Taken RET
|
||||||
The program is initialized from a 256-bit seed value `S`.
|
The program is initialized from a 256-bit seed value `S`.
|
||||||
1. A [pcg32](http://www.pcg-random.org/) random number generator is initialized with state `S[63:0]`.
|
1. A [pcg32](http://www.pcg-random.org/) random number generator is initialized with state `S[63:0]`.
|
||||||
2. The generator is used to generate random 128 bytes `R1`.
|
2. The generator is used to generate random 128 bytes `R1`.
|
||||||
3. Integer registers `r0`-`r7` are initialized using bytes 0-63 bytes of `R1`.
|
3. Integer registers `r0`-`r7` are initialized using bytes 0-63 of `R1`.
|
||||||
4. Floating point registers `f0`-`f7` are initialized using bytes 64-127 of `R1` interpreted as 8 64-bit signed integers converted to a double precision floating point format.
|
4. Floating point registers `f0`-`f7` are initialized using bytes 64-127 of `R1` interpreted as 8 64-bit signed integers converted to a double precision floating point format.
|
||||||
5. The initial value of the `m0` register is set to `S[95:64]` and the the last 8 bits are cleared (256-byte aligned).
|
5. The initial value of the `m0` register is set to `S[95:64]` and the the last 8 bits are cleared (256-byte aligned).
|
||||||
6. `S` is expanded into 10 AES round keys `K0`-`K9`.
|
6. `S` is expanded into 10 AES round keys `K0`-`K9`.
|
||||||
|
|
|
@ -100,6 +100,14 @@ def getRegister(num, type):
|
||||||
return registers.get(type).format(num)
|
return registers.get(type).format(num)
|
||||||
|
|
||||||
def writeInitialValues(file):
|
def writeInitialValues(file):
|
||||||
|
file.write("#ifdef RAM\n")
|
||||||
|
file.write("\tmmu.buffer = (char*)malloc(DRAM_SIZE);\n")
|
||||||
|
file.write("\tif(!mmu.buffer) {\n")
|
||||||
|
file.write('\t\tprintf("DRAM buffer allocation failed\\n");\n')
|
||||||
|
file.write("\t\treturn 1; }\n")
|
||||||
|
file.write("\t\taesInitialize((__m128i*)aesKey, (__m128i*)aesSeed, (__m128i*)mmu.buffer, DRAM_SIZE);\n")
|
||||||
|
file.write('\t\tprintf("DRAM buffer initialized successfully\\n");\n')
|
||||||
|
file.write("#endif\n")
|
||||||
file.write("\tclock_t clockStart = clock(), clockEnd;\n")
|
file.write("\tclock_t clockStart = clock(), clockEnd;\n")
|
||||||
for i in range(8):
|
for i in range(8):
|
||||||
file.write("\tr{0} = *(uint64_t*)(aesSeed + {1});\n".format(i, i * 8))
|
file.write("\tr{0} = *(uint64_t*)(aesSeed + {1});\n".format(i, i * 8))
|
||||||
|
@ -108,7 +116,7 @@ def writeInitialValues(file):
|
||||||
file.write("\tmmu.m0 = (aesKey[9] << 8) | (aesKey[10] << 16) | (aesKey[11] << 24);\n")
|
file.write("\tmmu.m0 = (aesKey[9] << 8) | (aesKey[10] << 16) | (aesKey[11] << 24);\n")
|
||||||
file.write("\taesInitialize((__m128i*)aesKey, (__m128i*)aesSeed, (__m128i*)scratchpad, SCRATCHPAD_SIZE);\n")
|
file.write("\taesInitialize((__m128i*)aesKey, (__m128i*)aesSeed, (__m128i*)scratchpad, SCRATCHPAD_SIZE);\n")
|
||||||
file.write("\tmmu.mx = 0;\n")
|
file.write("\tmmu.mx = 0;\n")
|
||||||
file.write("\tmmu.sp = 0;\n")
|
file.write("\tsp = 0;\n")
|
||||||
file.write("\tic = {0};\n".format(INSTRUCTION_COUNT))
|
file.write("\tic = {0};\n".format(INSTRUCTION_COUNT))
|
||||||
file.write("\tmxcsr = (_mm_getcsr() | _MM_FLUSH_ZERO_ON) & ~_MM_ROUND_MASK; //flush denormals to zero, round to nearest\n")
|
file.write("\tmxcsr = (_mm_getcsr() | _MM_FLUSH_ZERO_ON) & ~_MM_ROUND_MASK; //flush denormals to zero, round to nearest\n")
|
||||||
file.write("\t_mm_setcsr(mxcsr);\n")
|
file.write("\t_mm_setcsr(mxcsr);\n")
|
||||||
|
@ -619,7 +627,7 @@ def writeMain(file):
|
||||||
file.write(('__attribute__((optimize("Os"))) int main() {\n'
|
file.write(('__attribute__((optimize("Os"))) int main() {\n'
|
||||||
" register uint64_t r0, r1, r2, r3, r4, r5, r6, r7;\n"
|
" register uint64_t r0, r1, r2, r3, r4, r5, r6, r7;\n"
|
||||||
" register double f0, f1, f2, f3, f4, f5, f6, f7;\n"
|
" register double f0, f1, f2, f3, f4, f5, f6, f7;\n"
|
||||||
" register uint64_t ic;\n"
|
" register uint64_t ic, sp;\n"
|
||||||
" convertible_t scratchpad[SCRATCHPAD_LENGTH];\n"
|
" convertible_t scratchpad[SCRATCHPAD_LENGTH];\n"
|
||||||
" stack_t stack[STACK_LENGTH];\n"
|
" stack_t stack[STACK_LENGTH];\n"
|
||||||
" mmu_t mmu;\n"
|
" mmu_t mmu;\n"
|
||||||
|
@ -637,6 +645,7 @@ def writeProlog(file):
|
||||||
"typedef uint32_t addr_t;\n"
|
"typedef uint32_t addr_t;\n"
|
||||||
"typedef unsigned __int128 uint128_t;\n"
|
"typedef unsigned __int128 uint128_t;\n"
|
||||||
"typedef __int128 int128_t;\n"
|
"typedef __int128 int128_t;\n"
|
||||||
|
"typedef unsigned char byte;\n"
|
||||||
"typedef union {\n"
|
"typedef union {\n"
|
||||||
" double f64;\n"
|
" double f64;\n"
|
||||||
" int64_t i64;\n"
|
" int64_t i64;\n"
|
||||||
|
@ -652,8 +661,11 @@ def writeProlog(file):
|
||||||
" addr_t m0;\n"
|
" addr_t m0;\n"
|
||||||
" addr_t m1;\n"
|
" addr_t m1;\n"
|
||||||
" addr_t mx;\n"
|
" addr_t mx;\n"
|
||||||
" uint32_t sp;\n"
|
"#ifdef RAM\n"
|
||||||
|
" const char* buffer;\n"
|
||||||
|
"#endif\n"
|
||||||
"} mmu_t;\n"
|
"} mmu_t;\n"
|
||||||
|
"#define DRAM_SIZE (1UL << 32)\n"
|
||||||
"#define SCRATCHPAD_SIZE (256 * 1024)\n"
|
"#define SCRATCHPAD_SIZE (256 * 1024)\n"
|
||||||
"#define SCRATCHPAD_LENGTH (SCRATCHPAD_SIZE / sizeof(convertible_t))\n"
|
"#define SCRATCHPAD_LENGTH (SCRATCHPAD_SIZE / sizeof(convertible_t))\n"
|
||||||
"#define SCRATCHPAD_MASK14 (16 * 1024 / sizeof(convertible_t) - 1)\n"
|
"#define SCRATCHPAD_MASK14 (16 * 1024 / sizeof(convertible_t) - 1)\n"
|
||||||
|
@ -661,22 +673,26 @@ def writeProlog(file):
|
||||||
"#define SCRATCHPAD_16K(x) scratchpad[(x) & SCRATCHPAD_MASK14]\n"
|
"#define SCRATCHPAD_16K(x) scratchpad[(x) & SCRATCHPAD_MASK14]\n"
|
||||||
"#define SCRATCHPAD_256K(x) scratchpad[(x) & SCRATCHPAD_MASK18]\n"
|
"#define SCRATCHPAD_256K(x) scratchpad[(x) & SCRATCHPAD_MASK18]\n"
|
||||||
"#define STACK_LENGTH (32 * 1024)\n"
|
"#define STACK_LENGTH (32 * 1024)\n"
|
||||||
"#define DRAM(x) __rolq(6364136223846793005ULL*(x)+1442695040888963407ULL,32)\n"
|
"#ifdef RAM\n"
|
||||||
"//#define PREFETCH(x) _mm_prefetch(x, _MM_HINT_T0)\n"
|
"#define DRAM_READ(mmu) (convertible_t)*(uint64_t*)((mmu)->buffer + (mmu)->m0)\n"
|
||||||
|
"#define PREFETCH(mmu) _mm_prefetch(((mmu)->buffer + (mmu)->m1), _MM_HINT_T0)\n"
|
||||||
|
"#else\n"
|
||||||
|
"#define DRAM_READ(mmu) (convertible_t)(uint64_t)__rolq(6364136223846793005ULL*((mmu)->m0)+1442695040888963407ULL,32)\n"
|
||||||
"#define PREFETCH(x)\n"
|
"#define PREFETCH(x)\n"
|
||||||
"#define PUSH_VALUE(x) stack[mmu.sp++].value = x\n"
|
"#endif\n"
|
||||||
"#define PUSH_ADDRESS(x) stack[mmu.sp++].address = x\n"
|
"#define PUSH_VALUE(x) stack[sp++].value = x\n"
|
||||||
"#define STACK_IS_EMPTY() (mmu.sp == 0)\n"
|
"#define PUSH_ADDRESS(x) stack[sp++].address = x\n"
|
||||||
"#define POP_VALUE() stack[--mmu.sp].value\n"
|
"#define STACK_IS_EMPTY() (sp == 0)\n"
|
||||||
"#define POP_ADDRESS() stack[--mmu.sp].address\n"
|
"#define POP_VALUE() stack[--sp].value\n"
|
||||||
|
"#define POP_ADDRESS() stack[--sp].address\n"
|
||||||
"static convertible_t readDram(mmu_t* mmu, addr_t addr) {\n"
|
"static convertible_t readDram(mmu_t* mmu, addr_t addr) {\n"
|
||||||
" convertible_t data;\n"
|
" convertible_t data;\n"
|
||||||
" data.u64 = DRAM(mmu->m0); //TODO\n"
|
" data = DRAM_READ(mmu);\n"
|
||||||
" mmu->m0 += 8;\n"
|
" mmu->m0 += 8;\n"
|
||||||
" mmu->mx ^= addr;\n"
|
" mmu->mx ^= addr;\n"
|
||||||
" if((mmu->m0 & 255) == 192) {\n"
|
" if((mmu->m0 & 255) == 128) {\n"
|
||||||
" mmu->m1 = mmu->mx & 0xFFFFFF00;\n"
|
" mmu->m1 = mmu->mx & 0xFFFFFF00;\n"
|
||||||
" PREFETCH(mmu->m1); //TODO\n"
|
" PREFETCH(mmu);\n"
|
||||||
" }\n"
|
" }\n"
|
||||||
" if((mmu->m0 & 255) == 0)\n"
|
" if((mmu->m0 & 255) == 0)\n"
|
||||||
" mmu->m0 = mmu->m1;\n"
|
" mmu->m0 = mmu->m1;\n"
|
||||||
|
@ -772,8 +788,8 @@ def writeProlog(file):
|
||||||
|
|
||||||
with sys.stdout as file:
|
with sys.stdout as file:
|
||||||
writeProlog(file)
|
writeProlog(file)
|
||||||
file.write("const unsigned char aesKey[32] = {{ {0} }};\n".format(genBytes(32)))
|
file.write("const byte aesKey[32] = {{ {0} }};\n".format(genBytes(32)))
|
||||||
file.write("const unsigned char aesSeed[128] = {{ {0} }};\n".format(genBytes(128)))
|
file.write("const byte aesSeed[128] = {{ {0} }};\n".format(genBytes(128)))
|
||||||
writeMain(file)
|
writeMain(file)
|
||||||
writeInitialValues(file)
|
writeInitialValues(file)
|
||||||
for i in range(PROGRAM_SIZE):
|
for i in range(PROGRAM_SIZE):
|
||||||
|
|
Loading…
Reference in a new issue