Fibonacci (vs. Jolt)

As of May 2024

Results

The raw data for this study is available in an external Google spreadsheet.

Iterative Fibonacci

The following tables records the means, medians, and standard deviations of the various sample groups. All measurements are denoted in seconds.

The following table displays the ratios between the mean measurements for Valida and Jolt of the user time and the wall clock time. A number greater than 1 indicates that Valida is faster; a number less than 1 indicates that Jolt is faster.

Analysis

Serial Valida was more efficient (user time) than Jolt and parallel Valida, under all conditions. Parallel Valida was faster (wall clock time) than Jolt and serial Valida, under all conditions.

The wall clock and user time advantages of parallel Valida over Jolt declined as n increased, but remained significant in all of the problem sizes studied. The same can be said of the user time advantage of serial Valida over Jolt. The reason for this is because the increase in n increases proving time only slightly for both VMs. However, because Valida is so fast, the slight increase is of a higher percentage than the rest of the proving time. For Jolt, it takes longer to prove the program and thus a slight increase increases the proving time in a lesser proportion.

The most dramatic advantages observed were in the serial user time. This indicates that serial Valida is a particularly efficient prover, in terms of energy consumption / computing resource usage.

Limitations

This study only looked at executions of one example. Looking at a wider variety of algorithms would provide much better insight into the relative performance of these two systems. This is true because the ISAs and compilers used in the three study groups (namely Valida and Jolt) are substantially different from one another. Hypothetically, the cost of proving is roughly a function of the size of the trace, and which specific instructions are executed does not make a big difference. To the extent that hypothesis is true and accurate, then testing a single program may provide adequate insight into the cost of proving as long as traces of a wide enough variety of lengths are considered.

This study only looked at proof generation on a single machine. This does not tell how these proof systems would perform on different machines with different processors, amounts of RAM, etc. It also does not provide insight into how these proof systems might perform with different implementations on specialized vector processing hardware such as GPUs or Fabric Cryptography’s VPUs.

The Valida implementation is incomplete in that the STARK is under-constrained. It is possible that the changes required to make the STARK fully constrained will substantially affect the performance characteristics of the prover. We as Lita Foundation do not expect a dramatic change in the performance of the Valida prover to result from the changes required to make the STARK fully constrained, but this risk exists.

This study did not compare to Jolt in a single threaded proving mode, but only to Jolt in a multi threaded proving mode. It would be interesting to see if serial Jolt proving is more efficient (user time) than parallel Jolt proving, and if so by how much.

So far, minimal attention has been given to the efficiency of the code generated by the Valida compiler backend. A lot of effort will need to be made to optimize the efficiency of the code generated by the Valida compiler. However, even given this the compiled program still performs significantly better than Jolt.

This study used a small sample size and ran only the most rudimentary of statistics over the raw data. This type of study cannot be expected to give a full statistical picture of the underlying population being sampled from, especially with regard to outliers.

The method of estimating the Jolt prover user time could end up overestimating it if the build process is not entirely single-threaded. The method, of subtracting the sum of the build wall clock times from the user time, would be about exactly correct if we assume that the builds are single-threaded and the amount of work done by cargo other than building and proving is negligible in comparison to the costs of building and proving.

This method of estimating Jolt prover user time also does not account for the system time (i.e., time spent in system calls running in kernel mode), which could result in overestimating or underestimating the true Jolt prover user time. There was 1–2 seconds of system time in runs of the Jolt prover, and some of this could be in the build time and some in the prover time. There was more system time when running larger proofs, suggesting that a non-negligible amount of system time is spent in the prover.

See Section 1.5 (“Accuracy”) of the GNU time manual for limitations on the accuracy of GNU time. This manual may be accessible on your system by running info time in the shell (if you have GNU time installed on your system).

Because Valida currently only supports 32 bits integers, n cannot go higher than 47 in this study, although Jolt supports up to 128 bits integers. The programs compared, besides being in different languages, are also notably different in that the Jolt example uses 128-bit integers, whereas the Valida example uses 32 bit integers. This is a possible confounder which could help to explain the different results for Valida vs Jolt in this study. However, using 32-bit integers in place of 128-bit integers in the Jolt example did not produce significantly different results for Jolt, in the small amount of testing that Lita did on this possible confounder.


Methodology

Scope of benchmarks: proving systems & test program

These benchmarks compare the proving performance of Valida to that of Jolt and itself. They do a three-way comparison between (a) the Valida prover running in single-threaded mode, (b) the Valida prover running in multi-threaded mode, and (c) Jolt running in multi-threaded mode.

These benchmarks compare iterative implementations of the Fibonacci sequence.

This is the C source code for the iterative benchmark:

int main () {
  int n = 25;
  int a = 0;
  int b = 1;
  int sum;
  unsigned i;
  for (i = 1; i < n; i++) {
    sum = a + b;
    a = b;
    b = sum;
    };
  return 0;
}

The values in the first line of the main function body are replaced with different values in different instances in order to test the proving performance on problems of different sizes. This is the Rust source code for the same example:

In guest/src/lib.rs:

#![cfg_attr(feature = "guest", no_std)]
#![no_main]

#[jolt::provable]
fn fib(n: u32) -> u128 {
    let mut a: u128 = 0;
    let mut b: u128 = 1;
    let mut sum: u128;
    for _ in 1..n {
        sum = a + b;
        a = b;
        b = sum;
    }

    b
}

In src/main.rs:

pub fn main() {
    let (prove_fib, verify_fib) = guest::build_fib();

    let (output, proof) = prove_fib(25); // change to 46 for n = 46
    let is_valid = verify_fib(proof);

    // println!("output: {}", output);
    // println!("valid: {}", is_valid);
}

This Rust code is specialized for compilation to run as a Jolt program, it is a slight modification of their quick start program. Ideally, this study would have used the same source program to run on both Valida and Jolt, in order to form a more direct comparison. However, although Jolt supports C/C++, it was not clear to me how to run a C/C++ program on it. Since the Valida compiler does not yet support Rust, the most direct readily available comparison was to run two programs implementing the same algorithm, one in Rust and one in C.

The two programs have exactly the same executions. At the time of writing, Valida does not support writing the outputs to stdout, so we removed the last two lines of Jolt's program that writes to stdout for fairness.

Metrics

There are two key timing metrics this study pays attention to: user time and wall clock time. User time is equal to the amount of computer time expended by the program code. Wall clock time refers to the amount of time elapsed from the start to the finish of the computation. If, for example, a program fully utilizes 5 cores or threads for 5 seconds, that would be 25 seconds of user time and 5 seconds of wall clock time.

It’s clear that wall clock time is interesting as a metric because it’s related to how long a user needs to wait for their proof. It’s assumed that some users demand fast delivery of proofs.

User time is interesting as a proxy for the energy consumed by running a computation. In a situation where all cores are fully utilized almost all the time, the energy consumption of a CPU should remain relatively constant. In this situation, the user time spent by a CPU-bound, predominantly user-mode code execution should be roughly linearly proportional to the number of watt-hours (i.e., energy) consumed by that code execution.

Economizing on energy consumption, or the usage of computing resources, should be interesting from the point of view of any proving operations which have enough scale that their energy bill (or their cloud compute bill) for proving is a considerable operational cost. For these reasons, user time is an interesting metric. Lower costs of proving allow service providers to pass on lower proving prices to cost-sensitive customers.

Procedure

The steps for generating a proof of execution are specific to the zk-VM and were taken directly from the official documentation for each respective zk-VM. The following zk-VM versions were used:

See the official documentation for each zk-VM for exact details, and below for a brief summary of the steps to generate a proof of execution and measure its user time and wall clock time.

Build the Valida compiler toolchain from source. The process of doing this is described in the section titled Building and running the Valida backend. This includes building clang (the C compiler), lld (the linker).

#!/usr/bin/env bash
# excluded in the benchmarking time
./llvm-valida/build/bin/clang -c -target delendum ./llvm-valida/DelendumEntryPoint.c -o ./llvm-valida/build/DelendumEntryPoint.o
./llvm-valida/build/bin/clang -c -target delendum ./llvm-test-suite/SingleSource/UnitTests/fib_benchmark.c -o ./buildValidaTests/fib_benchmark.o
./llvm-valida/build/bin/ld.lld --script=./llvm-valida/valida.ld -o ./buildValidaTests/fib_benchmark.out ./buildValidaTests/fib_benchmark.o

# included in the benchmarking time, was ran as a bash script
RAYON_NUM_THREADS=32 ./valida/target/release/valida run ./buildValidaTests/fib_benchmark.out ./buildValidaTests/fib_benchmark.log
RAYON_NUM_THREADS=32 ./valida/target/release/valida prove ./buildValidaTests/fib_benchmark.out ./buildValidaTests/fib_benchmark.proof
RAYON_NUM_THREADS=32 ./valida/target/release/valida verify ./buildValidaTests/fib_benchmark.out ./buildValidaTests/fib_benchmark.proof

To prove the execution of the Rust Fibonacci program in Jolt, Follow the quick start instructions. Run the prover using cargo run --release

In order to measure the run time of a program, this study used GNU time. The test system The test system has a AMD Ryzen 9 7950X 16-Core Processor, with 32 threads, and 124.9 GB DDR5 RAM. During the tests there was no other programs running on the system other than the tests themselves. Tests were performed sequentially, one after another.

For some unknown reason, running cargo run --release on the test system on the Rust Fibonacci prover caused the host program to be recompiled every time, which is not supposed to happen. This made it a little harder to measure the execution time of the Jolt prover. The build time listed in the output is therefore subtracted from the recorded time such that only the execution, proving, and verifying times are counted.

Tests performed & statistics ran

The tests include n = 25 and 46. Ten samples were taken for each of the following groups: Valida (single-threaded, set with RAYON_NUM_THREADS=1), Valida (multi-threaded, set with RAYON_NUM_THREADS=32), and Rust on Jolt. The first few samples are discarded to make sure the compilation time is not included. Each sample included measurements of the user time and wall clock time. For each group of ten samples, the means, medians, and standard deviations of the user time and wall clock time were computed. The ratio of the mean user time for Jolt over the mean user time for Valida was computed for each of the two conditions in which Valida was run (single-threaded and multi-threaded). Same for the wall clock time.

Last updated