Technical Design: VM
The content of this section is largely copied from the Valida Working ISA Spec. This is a work in progress.
Architecture
A Valida zkVM consists of a CPU and several coprocessors, which are connected with communication buses. A basic example of a machine layout, omitting some standard chips for simplicity, would be
Communication buses are iplemented usinpermutation arguments (either grand product or multi-set checks), and may be multiplexed for efficiency when only one of a subset of buses will be used in a given cycle.
There are multiple VM configurations. The "Core" configuration is always present, and provides instructions for basic control flow and memory access. Additional configurations, such as "Field Arithmetic" or "Additional Jump" build upon the core configuration and offer additional instructions.
Instruction format
Program ROM
Memory
Memory is comprised of word-addressable cells. A given cell contains 4 field elements, each of which are typically used to store a single byte (arbitrary field elements can also be stored). All core and ALU-related instructions operate on cells (i.e. any operand address is word aligned -- a multiple of 4). In the VM compiler, the address of newly added local variables in the stack is word aligned.
For example, a U32 is represented in memory by its byte decomposition (4 elements). To initialize a U32 from an immediate value, we use the SETL16
instruction (see the complete instruction list below), which sets the first two bytes in memory. To initialize a U32 value greater than 16 bits, we can also call the SETH16
instruction to set the upper two bytes.
Immediate Values
Registers
Our zkVM does not operate on general purpose registers. Instead, instructions refer to variables local to the call frame, i.e. relative to the current frame pointer fp
.
Notation
The following notation is used throughout this document:
Operand values: opa
, opb
, opc
denote the value encoded in the operand a, b, or c of the current instruction.
CPU registers: fp
, pc
denote the value of the current frame pointer and program counter, respectively.
Relative addressing: [a]
denote the cell value at address a
offset from fp
, i.e. fp + a
. Variables local to the call frame are denoted in this form. Note that we are omitting fp
in the expression here, but that the first dereference of an operand is always relative to the frame pointer.
Absolute addressing: [[a]]
denotes the cell value at absolute address [a]
. Heap-allocated values are denoted in this form.
Instruction list
Listed below are the instructions offered in each configuration.
Core
Field arithmetic
Note that field arithmetic instructions only operate on the first element in a cell, which represents a field element instead of a single byte.
U32 Arithmetic
Bitwise
Output
Byte Manipulation
Note: These will not be supported in the initial version.
Heap allocation
Notes:
Fixed configurable stack size (e.g. 8MB), growing in opposite direction of the heap.
Allocate-only malloc (no de-allocation using
free
)
Assembly
Instructions
Calling convention / stack frame
We follow the RISC-V convention and grows the stack downwards. For a function call, the arguments are pushed onto the stack in reverse order. We only allow statically sized allocation on the stack, unlike traditional architectures where alloca
can be used to allocate dynamically. All dynamic allocation will be compiled to heap allocations. We have a special-purpose register dedicated to holding the frame pointer, but not a special-purpose register dedicated to holding the stack pointer.
Note that:
Functions arguments are stored at fp+16, fp+20, ...
Return FP offset (the value to add to FP to get to the value of FP before the call) is stored at fp+12
Return value is stored at fp+8
Return address is stored at fp+4
Local variables are stored at fp, fp-4, fp-8, ...
Pseudo instructions
Implementing MEMCPY/SET/MOVE
Memcpy will require roughly 2 cycles per word. We can follow this memcpy implementation on RISC-V.
Heap allocations
Example programs
Multiply stored immediates and return
Multiply arguments and return
The stack at the time of executing mul
inside mul
(line 11) looks like:
Trace
Main CPU
Memory
U32 Arithmetic
TODO: Replace this trace table and associated constraints with more efficient nondeterministic methods
Instruction Trace
Each instruction is encoded as 6 field elements
Core
Field arithmetic
U32 Arithmetic
Bitwise
Instruction decoding
Trace cells are also allocated for each selector. In each cycle, the opcode is decoded into selector flags, including but not limited to the following, which are grouped by type (not configuration) below for convenience. All flags are binary values, except for the instruction code.
Instruction code
Operand modifiers
Field arithmetic
Memory
Call
Jumps
U32 arithmetic
Binary operations
Design notes
Frontend target
We are writing a compiler from LLVM IR to our ISA.
ZK stack
This is a STARK-based zkVM. We are using Plonky3 to implement the polynomial IOP and PCS.
Field choice
We plan to use the 32-bit field defined by p = 2^31 - 1
, which should give very good performance on GPUs or with most vector instruction sets.
Registers
Our VM has no general purpose registers, since memory is cheap.
Memory
We will use a conventional R/W memory.
Tables
The CPU can do up to three memory operations per cycle, to support binary operations involving two reads and one write.
If we used a single-trace model, we could support this by adding columns for 6 memory operations in each row of our trace: 3 for the chronological memory log and 3 for the (address, timestamp)
ordered memory log.
Instead, we make the memory a separate table (i.e. a separate STARK which gets connected with a permutation argument). We also use multi-table support to implement other coprocessors that are wasteful to include in the main CPU, as their operations may not be used during most cycles (e.g. Keccak).
Continuations
TODO: Explain the permutation-based continuation implementation.
Lookups
Initially, we will support lookups only against prover-supplied tables. The main use case is range checks. To perform a 16-bit range check, for example, we would have the prover send a table containing [0 .. 2^16 - 1]
in order. (If the trace was not already 2^16, we would pad it. If it was longer than 2^16, the prover would include some duplicates.) We would then use constraints to enforce that this table starts at 0, ends at 2^16 - 1, and increments by 0 or 1.
Preprocessed tables can also be useful, particularly for bitwise operations like xor. However, we will not support them initially because they require non-succinct preprocessing.
Floating point arithmetic
Fast floating point arithmetic doesn't seem important for our anticipated use cases, so we do not plan to support compiling source programs which use floating point primitives.
Last updated