Guidelines on Benchmarking and Rust

This post covers:


Lots of libraries advertise how performant they are with phrases like “blazingly fast”, “lightning fast”, “10x faster than y” – oftentimes written in the project’s main description. If performance is a library’s main selling point then I expect for there to be instructions for reproducible benchmarks and lucid visualizations. Nothing less. Else it’s an all talk and no action situation, especially because great benchmark frameworks exist in nearly all languages.

I find performance touting libraries without a benchmark foundation analogous to GUI libraries without screenshots.

This post mainly focuses on creating satisfactory benchmarks in Rust, but the main points here can be extrapolated.

Use Criterion

If there is one thing to takeaway from this post: benchmark with Criterion.

Never written a Rust benchmark? Use Criterion.

Only written benchmarks against Rust’s built in bench harness? Switch to Criterion:

Get started with Criterion

When running benchmarks, the commandline output will look something like:

sixtyfour_bits/bitter_byte_checked
                        time:   [1.1052 us 1.1075 us 1.1107 us]
                        thrpt:  [6.7083 GiB/s 6.7274 GiB/s 6.7416 GiB/s]
                 change:
                        time:   [-1.0757% -0.0366% +0.8695%] (p = 0.94 > 0.05)
                        thrpt:  [-0.8621% +0.0367% +1.0874%]
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

This output is good for contributors in pull requests or issues, but I better not see this in a project’s readme! Criterion generates reports automatically that are 100x better than console output.

Criterion Reports

Below is a criterion generated plot from one of my projects: bitter. I’m only including one of the nearly 1800 graphics generated by criterion, the one chosen captures the heart of a single benchmark measuring Rust bit parsing libraries across read sizes (in bits).

Gnuplot Produced by GNUPLOT 5.2 patchlevel 2 0 5 10 15 20 25 30 0 5000 10000 15000 20000 25000 30000 35000 Average time (us) Input Size (Bytes) bit-reading: Comparison bitter-checked bitter-checked gnuplot_plot_2 bitter-unchecked bitter-unchecked gnuplot_plot_4 bitterv1 bitterv1 gnuplot_plot_6 bitreader bitreader gnuplot_plot_8 bitstream-io bitstream-io gnuplot_plot_10 nom nom gnuplot_plot_12

This chart shows the mean measured time for each function as the input (or the size of the input) increases.

Out of all the auto-generated graphics, I would consider this the only visualization that could be displayed for a more general audience, but I still wouldn’t use it this way. This chart lacks context, and it’s not clear what graphic is trying to convey. I’d even be worried about one drawing inappropriate conclusions (pop quiz time: there is a superior library for all parameters, which one is it?).

It’s my opinion that the graphics that criterion generates are perfect for contributors of the project as there is no dearth of info. Criterion generates graphics that break down mean, median, standard deviation, MAD, etc, which are invaluable when trying to pinpoint areas of improvement.

As a comparison, here is the graphic I created using the same data:

Creating our own visualization for better understanding

Creating our own visualization for better understanding

It may be hard to believe that the same data, but here are the improvements:

These add context that Criterion shouldn’t be expected to know. I recommend spending the time to dress reports up before presenting it to a wider audience.

Profiling and Criterion

Criterion does a great job comparing performance of implementations, but we have to rely on profiling tools to show us why one is faster than the other. We’ll be using the venerable valgrind, which doesn’t have a great cross platform story, so I’ll be sticking to linux for this.

# Create the benchmark executable with debugging symbols, but do not run it. We
# don't want valgrind to profile the compiler, so we have the "--no-run" flag. We
# also need debugging symbols so valgrind can track down source code
# appropriately. It blows my mind to this day that compiling with optimizations +
# debugging symbols is a thing. For so long I thought they were mutually
# exclusive.
RUSTFLAGS="-g" cargo bench  --no-run

# Now find the created benchmark executable. I tend to prefix my benchmark
# names with 'bench' to easily identify them
ls -lhtr ./target/release

# Let's say this was the executable
BENCH="./target/release/bench_bits-430e4e0f5d361f1f"

# Now identify a single test that you want profiled. Test identifiers are
# printed in the console output, so I'll use the one that I posted earlier
T_ID="sixtyfour_bits/bitter_byte_checked"

# Have valgrind profile criterion running our benchmark for 10 seconds
valgrind --tool=callgrind \
         --dump-instr=yes \
         --collect-jumps=yes \
         --simulate-cache=yes \
         $BENCH --bench --profile-time 10 $T_ID

# valgrind outputs a callgrind.out.<pid>. We can analyze this with kcachegrind
kcachegrind

And we can navigate in kcachegrind to lines of code with the most instructions executed in them, and typically execution time scales with instructions executed.

Profiling benchmark run in KCachegrind

Profiling benchmark run in KCachegrind

Don’t worry if nothing stands out. I just wanted to take a screenshot of what a profiling result looks like (with the assembly of the line highlighted below). The goal of profiling is to receive a better inclination of the code base. Hopefully you’ll find hidden hot spots, fix them, and then see the improvement on the next criterion run.

While I’ve only focussed on Criterion, valgrind, kcachegrind – your needs may be better suited by flame graphs and flamer.

Make everything reproducible

Creating a benchmark and reports mean nothing if they are ephemeral, as no one else can reproduce what you did including yourself when your memory fades.

cargo clean
RUSTFLAGS="-C target-cpu=native" cargo bench -- bit-reading
find ./target -wholename "*/new/raw.csv" -print0 | \
  xargs -0 xsv cat rows > assets/benchmark-data.csv

General Tips

[dev-dependencies.bitterv1]
package = "bitter"
version = "=0.1.0"

and reference it like:

extern crate bitterv1;
The graph contains data that is too cramped to make any meaingful interpretations

The graph contains data that is too cramped to make any meaingful interpretations

We can fix that with a tasteful table

A table can help clarify the data

A table can help clarify the data

Now users can quickly quantify performance at all sizes (well… to the closest power of 2). Being able to see a trend with shading is a bonus here.

Conclusion

In summary:

Comments

If you'd like to leave a comment, please email [email protected]

2019-11-19 - Derek Rhodes

Thanks for the introduction to these tools, it’s been a great help.