Fibonacci (vs. RISC Zero)
As of May 2024
Results
This note gives a benchmark comparison of proving execution of a iterative fibonacci program, on Valida and on RISC Zero. In this benchmark, single-core Valida proving was estimated to be about 709-1024 times more efficient than multi-core RISC Zero proving, in terms of the time and energy expended on computing the proof. Multi-core Valida proving was estimated to be about 46-55 times faster than multi-core RISC Zero proving on this example. These are not the best results that could be obtained on this problem with Valida, since much work remains to be done on optimizing the prover and the code generated by the compiler.
Iterative Fibonacci
The following tables records the means, medians, and standard deviations of the various sample groups. All measurements are denoted in seconds. The raw data for this study is available in an external Google spreadsheet.
n
Prover
Measure
Mean
Median
Standard deviation
25
Valida serial
User t.
0.0722
0.0715
0.00312
25
Valida parallel
User t.
0.3603
0.274
0.0518
25
Risc0
User t.
73.951
73.953
0.0931
25
Valida serial
Wall clock t.
0.0772
0.0775
0.0014
25
Valida parallel
Wall clock t.
0.051
0.051
0.00141
25
Risc0
Wall clock t.
2.831
2.827
0.0325
46
Valida serial
User t.
0.1042
0.1035
0.0038
46
Valida parallel
User t.
0.4476
0.449
0.0552
46
Risc0
User t.
73.933
73.938
0.1088
46
Valida serial
Wall clock t.
0.1119
0.1115
0.00137
46
Valida parallel
Wall clock t.
0.0594
0.059
0.00117
46
Risc0
Wall clock t.
2.828
2.824
0.0153
The following table displays the ratios between the mean measurements for Valida and Risc0 of the user time and the wall clock time. A number greater than 1 indicates that Valida is faster; a number less than 1 indicates that Risc0 is faster.
n
Measure
Valida condition
Valida advantage
25
User time
Serial
1024.26
25
User time
Parallel
205.25
25
Wall clock time
Serial
36.67
25
Wall clock time
Parallel
55.51
46
User time
Serial
709.53
46
User time
Parallel
165.18
46
Wall clock time
Serial
25.27
46
Wall clock time
Parallel
46.61
Analysis
Serial Valida was more efficient (user time) than Risc0 and parallel Valida, under all conditions. Parallel Valida was faster (wall clock time) than Risc0 and serial Valida, under all conditions.
The wall clock and user time advantages of parallel Valida over Risc0 declined as n increased, but remained significant in all of the problem sizes studied. The same can be said of the user time advantage of serial Valida over Risc0. The reason for this is because the increase in n increases proving time only slightly for both VMs. However, because Valida is so fast, the slight increase is of a higher percentage than the rest of the proving time. For Risc0, it takes longer to prove the program and thus a slight increase increases the proving time in a lesser proportion.
The most dramatic advantages observed were in the serial user time. This indicates that serial Valida is a particularly efficient prover, in terms of energy consumption / computing resource usage.
Limitations
This study only looked at executions of one example. Looking at a wider variety of algorithms would provide much better insight into the relative performance of these two systems. This is true because the ISAs and compilers used in the three study groups (namely Valida and Risc0) are substantially different from one another. Hypothetically, the cost of proving is roughly a function of the size of the trace, and which specific instructions are executed does not make a big difference. To the extent that hypothesis is true and accurate, then testing a single program may provide adequate insight into the cost of proving as long as traces of a wide enough variety of lengths are considered.
This study only looked at proof generation on a single machine. This does not tell how these proof systems would perform on different machines with different processors, amounts of RAM, etc. It also does not provide insight into how these proof systems might perform with different implementations on specialized vector processing hardware such as GPUs or Fabric Cryptography’s VPUs.
The Valida implementation is incomplete in that the STARK is under-constrained. It is possible that the changes required to make the STARK fully constrained will substantially affect the performance characteristics of the prover. We as Lita Foundation do not expect a dramatic change in the performance of the Valida prover to result from the changes required to make the STARK fully constrained, but this risk exists.
This study did not compare to Risc0 in a single threaded proving mode, but only to Risc0 in a multi threaded proving mode. It would be interesting to see if serial Risc0 proving is more efficient (user time) than parallel Risc0 proving, and if so by how much.
So far, minimal attention has been given to the efficiency of the code generated by the Valida compiler backend. A lot of effort will need to be made to optimize the efficiency of the code generated by the Valida compiler. However, even given this the compiled program still performs significantly better than Risc0.
This study used a small sample size and ran only the most rudimentary of statistics over the raw data. This type of study cannot be expected to give a full statistical picture of the underlying population being sampled from, especially with regard to outliers.
The method of estimating the Risc0 prover user time could end up overestimating it if the build process is not entirely single-threaded. The method, of subtracting the sum of the build wall clock times from the user time, would be about exactly correct if we assume that the builds are single-threaded and the amount of work done by cargo other than building and proving is negligible in comparison to the costs of building and proving.
This method of estimating Risc0 prover user time also does not account for the system time (i.e., time spent in system calls running in kernel mode), which could result in overestimating or underestimating the true Risc0 prover user time. There was 1–2 seconds of system time in runs of the Risc0 prover, and some of this could be in the build time and some in the prover time. There was more system time when running larger proofs, suggesting that a non-negligible amount of system time is spent in the prover.
See Section 1.5 (“Accuracy”) of the GNU time manual for limitations on the accuracy of GNU time. This manual may be accessible on your system by running info time in the shell (if you have GNU time installed on your system).
Because Valida currently only supports 32 bits integers, n cannot go higher than 47 in this study, although Risc0 supports up to 128 bits integers. The programs compared, besides being in different languages, are also notably different in that the Risc0 example uses 128-bit integers, whereas the Valida example uses 32 bit integers. This is a possible confounder which could help to explain the different results for Valida vs Risc0 in this study. However, using 32-bit integers in place of 128-bit integers in the Risc0 example did not produce significantly different results for Risc0, in the small amount of testing that Lita did on this possible confounder.
Methodology
Scope of benchmarks: proving systems & test program
These benchmarks compare the proving performance of Valida to that of Risc0 and itself. They do a three-way comparison between (a) the Valida prover running in single-threaded mode, (b) the Valida prover running in multi-threaded mode, and (c) Risc0 running in multi-threaded mode.
These benchmarks compare iterative implementations of the Fibonacci sequence.
This is the C source code for the iterative benchmark:
The values in the first line of the main function body are replaced with different values in different instances in order to test the proving performance on problems of different sizes. The Rust implementation that was compiled to run on Risc0 is available in this repository. Ideally, this study would have used the same source program to run on both Valida and Risc0, in order to form a more direct comparison. However, Risc0 supports but the Valida compiler only supports C at the moment, the most direct readily available comparison was to run two programs implementing the same algorithm, one in Rust and one in C.
The two programs have exactly the same executions, except that the C program does not write out the results but the Rust program does. This is because at the time of writing, Valida does not support writing the outputs to stdout. To ensure fairness, the time used to print two lines was estimated from the mean of 10 recorded time differences with the printing included and not, using Jolt's equivalent Rust program. The time used to print was found to be negligible.
Metrics
There are two key timing metrics this study pays attention to: user time and wall clock time. User time is equal to the amount of computer time expended by the program code. Wall clock time refers to the amount of time elapsed from the start to the finish of the computation. If, for example, a program fully utilizes 5 cores or threads for 5 seconds, that would be 25 seconds of user time and 5 seconds of wall clock time.
It’s clear that wall clock time is interesting as a metric because it’s related to how long a user needs to wait for their proof. It’s assumed that some users demand fast delivery of proofs.
User time is interesting as a proxy for the energy consumed by running a computation. In a situation where all cores are fully utilized almost all the time, the energy consumption of a CPU should remain relatively constant. In this situation, the user time spent by a CPU-bound, predominantly user-mode code execution should be roughly linearly proportional to the number of watt-hours (i.e., energy) consumed by that code execution.
Economizing on energy consumption, or the usage of computing resources, should be interesting from the point of view of any proving operations which have enough scale that their energy bill (or their cloud compute bill) for proving is a considerable operational cost. For these reasons, user time is an interesting metric. Lower costs of proving allow service providers to pass on lower proving prices to cost-sensitive customers.
Procedure
The steps for generating a proof of execution are specific to the zk-VM and were taken directly from the official documentation for each respective zk-VM. The following zk-VM versions were used:
Risc0: git commit 701b3d97bb3a387326004e866bb9d3fbd4c1649c
See the official documentation for each zk-VM for exact details, and below for a brief summary of the steps to generate a proof of execution and measure its user time and wall clock time.
Build the Valida compiler toolchain from source. The process of doing this is described in the section titled Building and running the Valida backend. This includes building clang (the C compiler), lld (the linker).
The Rust implementation that was compiled to run on Risc0 is available in this repository.
To prove the execution of the Rust Fibonacci program in Risc0, run the following:
In order to measure the run time of a program, this study used GNU time. The test system has a AMD Ryzen 9 7950X 16-Core Processor, with 32 threads, and 124.9 GB DDR5 RAM. During the tests there was no other programs running on the system other than the tests themselves. Tests were performed sequentially, one after another.
For some unknown reason, running cargo run --release on the test system on the Rust Fibonacci prover caused the host program to be recompiled every time, which is not supposed to happen. This made it a little harder to measure the execution time of the Risc0 prover. The build time listed in the output is therefore subtracted from the recorded time such that only the execution, proving, and verifying times are counted.
Tests performed & statistics ran
The tests include n = 25 and 46. Ten samples were taken for each of the following groups: Valida (single-threaded, set with RAYON_NUM_THREADS=1
), Valida (multi-threaded, set with RAYON_NUM_THREADS=32
), and Rust on Risc0. Each sample included measurements of the user time and wall clock time. The first few samples are discarded to make sure the compilation time is not included. For each group of ten samples, the means, medians, and standard deviations of the user time and wall clock time were computed. The ratio of the mean user time for Risc0 over the mean user time for Valida was computed for each of the two conditions in which Valida was run (single-threaded and multi-threaded). Same for the wall clock time.
Last updated