Fibonacci (Rust vs. C)

Other benchmarks in these docs compare the performance of two different zk-VMs. Different from those, this one compares the performance of an iterative algorithm for computing Fibonacci numbers, implemented in Rust and in C, both running on Valida.

Results

The following tables records the means, medians, and standard deviations of the various sample groups. All measurements are denoted in seconds.

Prover

Language

Measure

Mean

Median

Standard Deviation

Valida serial

User t.

0.035

0.002

Valida serial

Rust

User t.

0.046

0.047

0.002

Valida serial

Wall clock t.

0.039

0.0002

Valida serial

Rust

Wall clock t.

0.05

0.051

0.0003

Valida parallel

User t.

0.171

0.173

0.01

Valida parallel

Rust

User t.

0.22

0.009

Valida parallel

Wall clock t.

0.023

0.001

Valida parallel

Rust

Wall clock t.

0.065

0.0003

Valida serial

User t.

0.035

0.034

0.002

Valida serial

Rust

User t.

0.062

0.002

Valida serial

Wall clock t.

0.039

0.0002

Valida serial

Rust

Wall clock t.

0.065

0.0003

Valida parallel

User t.

0.169

0.171

0.012

Valida parallel

Rust

User t.

0.246

0.025

0.012

Valida parallel

Wall clock t.

0.023

0.001

Valida parallel

Rust

Wall clock t.

0.029

0.001

The following table displays the ratios between the mean measurements for Rust and C of the user time and wall clock time. A number greater than 1 indicates that C is faster. The closer a number is to 1, the closer the performance of Rust on Valida approaches the performance of C on Valida, for that measurement.

Measure

Valida prover

C advantage

User time

Serial

1.3

Wall clock time

Serial

1.3

User time

Parallel

1.3

Wall clock time

Parallel

2.8

User time

Serial

1.8

Wall clock time

Serial

1.7

User time

Parallel

1.4

Wall clock time

Parallel

1.3

Analysis

This benchmark looks at two implementations of the same algorithm, one in C and one in Rust. Both implementations are compiled using the LLVM-Valida code generation backend, and the maximum optimization level. This benchmark runs both compiled implementations on Valida. It measures the proving time (both user time and wall clock time) for the given inputs. This includes the time taken to execute the programs in the VM.

It is Lita's understanding that the complexity of proving a program execution is mainly related to the number of instructions executed. In this case, the Rust example compiles to a program which has to execute more instructions to solve the same problems. This is generally to be expected. Compiled C programs tend to run faster, in fewer clock cycles, than compiled Rust programs executing the same algorithms. This is a phenomenon that is observed across a wide variety of benchmarks, most of which are unrelated to Valida or zk-VM proving.

In general, the performance of compiled Rust code can be expected to at best approach the performance of compiled C code executing the same algorithm. In some cases, the performance of Rust approaches the performance of C, while in other cases, C may outperform by a wider margin. It can be expected that only rarely will a Rust implementation of an algorithm outperform an efficient C implementation of the same algorithm. The results of this benchmark are in line with this general expectation. They exhibit a C advantage factor of 1.3–2.8.

Limitations

This study only looked at executions of one example algorithm on two different input values. Looking at a wider variety of algorithms would provide much better insight into the relative performance that can be expected of these two systems.

This study only looked at proof generation on a single machine. This does not tell how these proof systems would perform on different machines with different processors, amounts of RAM, etc.

The Valida implementation is incomplete in that the STARK is under-constrained. It is possible that the changes required to make the STARK fully constrained will substantially affect the performance characteristics of the prover. We as Lita Foundation do not expect a dramatic change in the performance of the Valida prover to result from the changes required to make the STARK fully constrained, but this risk exists.

So far, minimal attention has been given to the efficiency of the code generated by the Valida compiler backend. A lot of effort will need to be made to optimize the efficiency of the code generated by the Valida compiler.

This study used a small sample size and ran only the most rudimentary of statistics over the raw data. This type of study cannot be expected to give a full statistical picture of the underlying population being sampled from, especially with regard to outliers.

See Section 1.5 (“Accuracy”) of the GNU time manual for limitations on the accuracy of GNU time. This manual may be accessible on your system by running info time in the shell (if you have GNU time installed on your system).

Methodology

Programs being measured

These benchmarks compare the proving performance of Valida on two example programs, one in C and one in Rust, both implementing the same algorithm. This makes for a four-way comparison between the two programs, each running on the Valida prover in single-threaded mode, and the Valida prover in multi-threaded mode. These benchmarks compare iterative implementations of the Fibonacci sequence.

This is the C program used in this benchmark:

int main () {
  int n = 25;
  int a = 0;
  int b = 1;
  int sum;
  unsigned i;
  for (i = 1; i < n; i++) {
    sum = a + b;
    a = b;
    b = sum;
    };
  __builtin_delendum_write(a);
  __builtin_delendum_write(b);
  return 0;
}

The values in the first line of the main function body are replaced with different values in different instances in order to test the proving performance on problems of different sizes.

This is the Rust program used in this benchmark:

#![no_std]
#![feature(start)]

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

#[start]
fn main(_argc: isize, _argv: *const *const u8) -> isize {

    extern { fn read_stdin() -> u8;}
    extern { fn write_stdout(n: u8);}
    fn transform_u32_to_array_of_u8(x:u32) -> [u8;4] {
        let b1 : u8 = ((x >> 24) & 0xff) as u8;
        let b2 : u8 = ((x >> 16) & 0xff) as u8;
        let b3 : u8 = ((x >> 8) & 0xff) as u8;
        let b4 : u8 = (x & 0xff) as u8;
        return [b4, b3, b2, b1]
    }
    unsafe {
    let n = read_stdin();
    let mut a: u32 = 0;
    let mut b: u32 = 1;
    let mut sum: u32;
    for _ in 1..n {
        sum = a + b;
        a = b;
        b = sum;
        }
    let b_as_bytes: [u8; 4] = transform_u32_to_array_of_u8(b);
    for i in 0..4 {
        write_stdout(b_as_bytes[i]);
        }
    }
    return 0;
}

These two programs are algorithmically different in that the Rust program reads n from the first byte in the input tape, whereas the C program has n hard-coded. The reason that the Rust program needs to read n from the input tape is that if n is known at compile time, the Rust compiler will optimize away the computation of the nth Fibonacci number. This difference is assumed not to make much of a difference to the results.

Metrics

There are two key timing metrics this study pays attention to: user time and wall clock time. User time is equal to the amount of computer time expended by the program code. Wall clock time refers to the amount of time elapsed from the start to the finish of the computation. If, for example, a program fully utilizes 5 cores or threads for 5 seconds, that would be 25 seconds of user time and 5 seconds of wall clock time.

It’s clear that wall clock time is interesting as a metric because it’s related to how long a user needs to wait for their proof. It’s assumed that some users demand fast delivery of proofs.

User time is interesting as a proxy for the energy consumed by running a computation. In a situation where all cores are fully utilized almost all the time, the energy consumption of a CPU should remain relatively constant. In this situation, the user time spent by a CPU-bound, predominantly user-mode code execution should be roughly linearly proportional to the number of watt-hours (i.e., energy) consumed by that code execution.

Economizing on energy consumption, or the usage of computing resources, should be interesting from the point of view of any proving operations which have enough scale that their energy bill (or their cloud compute bill) for proving is a considerable operational cost. For these reasons, user time is an interesting metric. Lower costs of proving allow service providers to pass on lower proving prices to cost-sensitive customers.

Procedure

The steps for compiling C and Rust programs and generating proofs of execution are documented in Getting Started with LLVM-Valida. To run Valida in single-core mode, this study set the environment variable RAYON_NUM_THREADS=1 . To run Valida in multi-core mode, this study set the environment variable RAYON_NUM_THREADS=48.

In order to measure the run time of a program, this study used GNU time. The test system was a z1d.metal AWS instance with 48 vCPUs and 384GB of RAM. During the tests there were no other programs running on the system other than the tests themselves, and the AWS instance was bare-metal to avoid interference from other VMs running on the same hypervisor. Tests were performed sequentially, one after another.

Tests performed & statistics ran

The tests include n = 25 and 46. Thirty samples were taken for each of the following groups: C on Valida (single-threaded), Rust on Valida (single-threaded), C on Valida (multi-threaded), and Rust on Valida (multi-threaded). Each sample included measurements of the user time and wall clock time. For each group of samples, the means, medians, and standard deviations of the user time and wall clock time were computed. The ratio of the mean user time for Rust over the mean user time for C was computed for each of the two conditions in which Valida was run (single-threaded and multi-threaded). Same for the wall clock time.

PreviousFibonacci (vs. Jolt)NextSHA-256 (vs. RISC Zero)

Last updated 11 months ago