Welcome to NBSoftSolutions, home of the software development company and writings of its main developer: Nick Babcock. If you would like to contact NBSoftSolutions, please see the Contact section of the about page.

Parsing Performance Improvement with Tapes and Spatial Locality

There’s a format that I need to parse in Rust. It’s analogous to JSON but with a few twists:

    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
  • The core field appears multiple times and not sequentially
  • The documents can be largish (100MB)
  • The document should be able to be deserialized by serde into something like
struct MyDocument {
  core: Vec<String>,
  nums: Vec<u8>,

Unfortunately we have to choose one or the other:

  • Buffer the entire document into memory so that we can group fields logically for serde, as serde doesn’t support aggregating multiple occurrences of a field
  • Not support serde deserialization but parse the document iteratively

Since I consider serde deserialization a must have feature, I resigned myself to buffering the entire file in memory. To the end user, nothing is as ergonomic and time saving as serde deserialization. My initial designs had a nice iterative approach, combining the best things about quick-xml, rust-csv, and pulldown-cmark. I was even going to write a post about designing a nice pull based parser in rust showing it can be a great foundational start for higher level parsers, but stopped short when the performance fell short.

Some serde deserializer like quick-xml / serde-xml-rs hack around the serde limitation by only supporting deserializing multiple fields that appear consecutively. This allows them to still iteratively read the file. Though it remains an open issue to support non-consecutive fields

Representing a JSON Document

Since we have the whole document in memory, we need a format that we can write a deserializer for. Here’s what a simplified JSON document can look like

enum JsonValue<'a> {
  Text(&'a [u8]),
  Object(Vec<(&'a [u8], JsonValue<'a>)>),

This structure is hierarchical. It’s intuitive. You can see how it easily maps to JSON.

The issue is the structure is a horrible performance trap. When deserializing documents with many objects and arrays, the memory access and reallocation of the Array and Object vectors is significant. Profiling with valgrind showed that the code spent 35% of the time inside jemalloc’s realloc. The CPU has to jump all over RAM when constructing a JsonValue as these vectors could live anywhere on the heap. From Latency Numbers Every Programmer Should Know, accessing main memory takes 100ns, which is 200x over something in L1 cache.

So that’s the performance issue, but before we solve that we’ll need a algorithm to solve deserializing out of order fields.

Serde Deserializing out of order Fields

Below is some guidelines for those working with data formats that allow duplicate keys in an object but the keys are not consecutive / siblings:

  • Buffer the entire file in memory
  • For each object deserializing
    • Start at the first key in the object
    • Scan the rest of the keys to see if there are any others that are the same
    • Add all the indices of which these keys occurred into a bag
    • Send the bag for deserialization
      • Most of the time we only need to look at the first element in the bag
      • But if a sequence is desired, we use all the indices in the bag in a sequence deserializer
    • Mark all the keys from the bag as seen so they aren’t processed again
    • Empty the bag

This method requires that we have two additional vectors needed per deserialization of an object. One to keep track of the indices of values that are part of a given field key and another to keep track of which keys have been deserialized already. The alternative solution would be to create an intermediate “group by” structure but this was not needed as the above method never showed up in profiling results, despite how inefficient it may seem (it’s O(n^2), where n is the number of fields in an object, as deserializing the k’th field requires (n - k) comparisons).

Parsing to a Tape

A tape is a flattened document hierarchy that fits in a single vector of tokens that aid future traversals. This is not a novel technique, but I’ve seen so little literature on the subject that I figured I’d add my voice. There is a nice description of how the tape works in Daniel Lemire’s simdjson. To be concrete, this is what a tape looks like in Rust:

enum JsonValue {
  Text { data_idx: usize },
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }

struct Tapes<'a> {
  tokens: Vec<JsonValue>,
  data: Vec<&'a [u8]>,

So the earlier example of

    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"

Would turn into a tape like so: (numbers are the tape / data index)


0 : Text { data_idx: 0 }
1 : Text { data_idx: 1 }
2 : Text { data_idx: 2 }
3 : Array { tape_end_idx: 9}
4 : Number(1)
5 : Number(2)
6 : Number(3)
7 : Number(4)
8 : Number(5)
9 : End { tape_start_idx: 3}
10: Text { data_idx: 3 }
11: Text { data_idx: 4 }


0: core
1: core1
2: nums
3: core
4: core2


  • One can skip large child value by looking at the tape_end_idx and skipping to that section of the tape to continue processing the next key in an object
  • Delimiters like colons and semi-colons are not included in the tape as they provide no structure
  • The parser that writes to the tape must ensure that the data format is valid as the consumer of the tape assumes certain invariants that would have normally been captured by a hierachial structure. For instance, the tape technically allows for a number key.
  • The tape is not for end user consumption. It is too hard to work with. Instead it’s an intermediate format for more ergonomic APIs.
  • The tape design allows for future improvement of storing only unique data segments (see how index 0 and 3 on the data tape are equivalent) – this could be useful for interning strings in documents with many repeated fields / values.

Why create the data tape instead of embedding the slice directly into JsonValue::Text?

Yes, an alternative design could be:

enum JsonValue2<'a> {
  Text(&'a [u8]),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }

and then there would be no need for a separate data tape. The main reason why is space.

println!("{}", std::mem::size_of::<JsonValue>()); // 16
println!("{}", std::mem::size_of::<JsonValue2>()); // 24

JsonValue2 is 50% larger. This becomes apparent when deserializing documents that don’t contain many strings (ie mainly composed of numbers). The performance hit from the extra indirection has been negligible (probably because the data tape is often accessed sequentially and can spend a lot of time in cache), but I’m always on the lookout for better methods.

Performance Improvements

  • Deserializing in criterion benchmark: 4x
  • Parsing to intermediate format (tape vs hierarchical document): 5x (up to 800 MiB/s!)
  • Deserialization in browser (wasm): 6x

To me this is somewhat mind blowing – that I saw a 6x decrease in latency when deserializing in the browser. It took an idea that seemed untenable to possible. And I was so blown away that I felt compelled to write about it so that I couldn’t forget it and add tip on the internet in case someone comes looking.

Pushing the power envelope with undervolting

This article has been superceded at it’s new home: https://sff.life/how-to-undervolt-gpu/

This article depicts how to use standard undervolting tools on a standard CPU and GPU to achieve a significant gain in power efficiency while preserving performance.


I’m migrating away from a 4.5L sandwich style case that used a 300w flex atx PSU to power:

  • Ryzen 2700
  • MSI GTX 1070 Aero ITX
  • 32GB ram at 1.35v
  • 1TB Nvme

For the curious, entering these components into various power supply calculators yielded:

So it already seems like we’re pushing our PSU, but I’m going to go with an even smaller rater PSU soon, so I’ll need to get creative through undervolting. Undervolting will:

  • Decrease energy consumption by lowering voltage (and thus watts)
  • Decrease temperature of components as there is less heat to shed
  • Can have no effect on performance, as we can undervolt at given frequencies

Could this be too good to be true?

How to Undervolt

Components we’ll be using:

  • P4460 Kill a Watt electricity usage monitor: to measure output from the wall. This is the only thing that costs money on the list – you may be able to rent it from a local library or utility company. A wattmeter is not critical, but it’ll give us a sense of total component draw that the PSU has to supply
  • Cinebench R20: Free CPU benchmark
  • 3dmark Timespy: Free GPU benchmark
  • HWiNFO64: to measure our sensor readings (temperature, wattage, etc)
  • MSI Afterburner: to tweak the voltage / frequency curve of the gpu
  • Motherboard bios: to tweak cpu voltage
  • Google sheets: A spreadsheet for tracking power usage, benchmarks, any modifications, etc

a Kill a Watt device to measure wall draw

Feel free to swap components out for alternatives, but I do want to stress the importance of benchmarks, as it’s desirable for any potential performance loss to become apparent when running benchmarks at each stage of the undervolt.

Attentive readers will note the absence of any stress tests (eg: Prime95 With AVX & Small FFTs + MSI Kombustor). It is my opinion that drawing obscene amounts of power to run these stress tests is just too unrealistic. Benchmarks should be stressful enough, else what are they benchmarking? But I understand the drive for those who want ultimate stability, so I’d suggest after we work our way down the voltage ladder to start climbing up as stress tests fail.

Initial Benchmarks and Measurements

Let’s break down what it really means when we’re measuring the number of watts flowing through the wattmeter:

  • Kill A Watt shows 200 watts
  • The PSU (SSP-300-SUG) is 300W 80 Plus Gold certified
  • PSU is able to convert at least 87%-90% of inbound power to the components with the rest dispersed as heat
  • The components are asking between 174-180 watts (else the PSU would be rated silver or platinum)
  • If the components are asking for max power (300 watts), the Kill A Watt should be reading 337-344 before shutdown

First we’ll measure idle:

  • Close all programs including those in the background using any cpu cycles
  • Open HWiNFO
  • Wait for the system to settle down (ie, the wattmeter converges to a reading)
  • Record what’s being pulled from the wall, cpu / gpu watts, and cpu / gpu temps

Then benchmark the cpu:

  • Keep everything closed
  • Run Cinebench R20
  • For each run record score, max wattmeter reading, max watts / temp from the cpu
  • Reset HWiNFO recorded max values
  • Repeat three times

Then benchmark the gpu:

  • Keep everything closed
  • Run Timespy
  • For each run record score, max wattmeter reading, max watts / temp from the gpu
  • Reset HWiNFO recorded max values
  • Repeat three times

For timespy, the score you are interested in is the graphics score (highlighted below)

The reason why we’re interested in measuring max values is that we need any future power supply to handle any and all spikes.

Here’s an example of what I recorded for my initial measurements and how I interpreted them. Idle:

  • Wattmeter reading: 52.5W
  • CPU watts: 22
  • GPU watts: 15


  • Average Timespy score: 5976
  • Max wattmeter reading: 241W
  • Max GPU watts: 159W
  • Max GPU temperature: 82c
  • Our GPU exceeds it’s TDP rating by 10W and total component draw is around 210-217W (241 * [.87, .9]). Through some algebra we can approximate other components consuming 50-60W

If you’re recording CPU watts during the GPU benchmark you may be able to say a bit more about motherboard + ram + etc usage, but be careful – max wattmeter, max gpu watts, and max cpu watts are unlikely to happen at the same time, so unless one has the ability to correlate all three of them, better to stick with the more likely event: max wattmeter occurs at the same time as max gpu watts during a gpu benchmark.


  • Average Cinebench score: 3294
  • Max wattmeter reading: 131W
  • Max CPU watts: 77W
  • Max CPU temperature: 81c
  • Our CPU exceeds it’s TDP rating by 12W and total component draw is around 114-118 (131 * [.87, .9]). Other components are consuming around 20W.

GPU Undervolt

Determine Target Frequency

First, determine what target frequency you’d like your GPU to boost to, ideally a number between the GPU’s boost clock and max clock. The GPU’s boost clock will be listed in the manufacturer’s specification and also in GPU-Z. The GPU’s max clock is determined at runtime through GPU boost and will be reported as the “GPU Clock” in HWiNFO64 (make sure it’s running while the benchmark are in progress). Below are screenshots from HWiNFO64 and GPU-Z showing the differences between these two numbers.

  • Boost clock: 1721mhz
  • Max clock: 1886mhz

GPU boost (different from boost clock) increases the GPU’s core clock while thermal and TDP headroom remain. This is somewhat counterproductive for us as any voltage increases to reach higher frequencies will cause an increase in power usage and blow our budget.

I chose my target frequency to be 1860mhz – 139mhz over boost and 26mhz under max, as during benchmarking the gpu clock was pinned at 1860mhz most the time. The exact number doesn’t matter, as one can adjust their target frequency depending on their undervolting results. I change my target frequency later on. We’ll also choose our starting voltage to be 950mv, which is a pretty middling undervolt, and we’ll work our way down.

MSI Afterburner

The tool for undervolting! Powerful, but can be unintuitive at first.

Ctrl + F to bring up the voltage frequency curve. Find our target voltage of 950mv and take note of the frequency (1809mhz in the screenshot)

Our target frequency (1860mhz) is greater than 1809mhz by 51mhz, so we increase the core clock by 51mhz.

This will shift the voltage frequency graph up by 51mhz ensuring a nice smooth ride up the voltage / frequency curve until we hit 1860mhz at 950mv. Then to guarantee we don’t exceed 950mv:

  • Select the point at 950mv
  • Hit “l” to lock the voltage
  • For all voltage points greater than 950mv, drag them to or below 1860mhz
  • Hit “✔” to apply
  • Afterburner will adjust the points to be the same frequency as our locked voltage
  • You may have to comb over >950mv points to ensure that afterburner didn’t re-adjust any voltage points to be greater than 1860mhz. It happens
  • Hit “✔” to apply

End result should look like:

After our hard work, we’ll want to save this as a profile so that we can refer back to it after we inevitably undervolt too far. Also after determining what is the best undervolt, we’ll want to have that profile loaded on boot, so ensure the windows icon is selected.

Rinse and Repeat

Do the benchmark routine:

  • Clear / reset sensor data for max power usage and temperature
  • Start Timespy
  • Keep eyes glued on the kill a watt and record max draw
  • After Timespy completes record score, gpu max power usage, max temperature, and wall max wall draw.
  • Repeat 3 times

After three successful benchmarks:

  • Open Afterburner
  • Reset the voltage / frequency graph by hitting the “↺” icon
  • Decrement the target voltage by 25mv and keep the same target frequency
  • Calculate new core clock offset from new target voltage
  • Proceed to adjust all greater voltages to our frequency
  • Re-benchmark


  • Undervolted to 875mv at 1860mhz
  • 850mv was not stable
  • No affect on Timespy score (<1% difference)
  • GPU max wattage decreased from 159W to 124W (20-23% reduction)
  • GPU max temp decreased from 82 to 75 (9-10% reduction)
  • Attempting an undervolt of 875mv at 1886mhz (the original max clock) was not stable

These are good results that demonstrate that there is no performance loss from undervolting, yet one can mitigate heat and power usage. I decided to take undervolting one step further and set my target frequency to the GPU’s boost frequency (1721mhz) and record the results:

  • Undervolted to 800mv at 1721mhz
  • Slight decrease to Timespy score (6-7%)
  • GPU max wattage decreased from 159W to 105W (33% reduction)
  • GPU max temp decreased from 82 to 68 (15-17% reduction)

To me, the loss in performance in this chase to undervolt is greatly outweighed by the gain in efficiency.

CPU Undervolting

Now onto CPU undervolting. This step will differ greatly depending on the motherboard, but for me it was as easy as adjusting the VCore Voltage Offset in 20mv increments.

After booting, run cinebench, record values – the works. Repeat a few times. On success, decrement the offset more.


  • Undervolted to -100mv offset
  • Small decrease in Cinebench score (<3% difference)
  • CPU max wattage decreased from 77W to 65W (15% reduction)
  • CPU max temp decreased from 81 to 76 (6% reduction)


We’ve decreased power usage for the CPU and GPU considerably (a combined 66 watts), lowered temperatures, and opted into additional undervolts for a near neglible performance loss. Yet we’re unable to say anything about power options we can slim down to, so I ran an informal stress test by running cinebench and timespy at the same time and recorded max watts recorded from the wall: 205 watts. Meaning components are really desiring 180-200 watts. Incredible – less than 200 watts in a stress test.

Monitoring Remote Sites with Traefik and Prometheus

Partial screenshot of a traefik metric dashboard

I have several sites deployed on VPSs like DigitalOcean that have been dockerized and are reverse proxied by traefik so they don’t have to worry about Let’s Encrypt, https redirection, etc. Until recently I had very little insight into these sites and infrastructure. I couldn’t answer basic questions like:

  • How many requests is each site handling
  • What are the response times for each site
  • Is the box over / underprovisioned

For someone who has repeatedly blogged about metrics and observability (here, here, here, here, and here) – this gap was definitely a sore spot for me. I sowed the gap shut with traefix static routes, prometheus metrics, basic authentication, and Let’s Encrypt.

Do note that this article assumes you already having a working setup with traefik and let’s encrypt.

Sample screenshot

Exposing Traefik Metrics

Traefik keeps count of how many requests each backend has handled and the duration of these requests. Traefik exposes several possibilities for exporting these metrics, but only one of them is a pull model, which is prometheus. A pull model is ideal here as it allows one to be able to track metrics from a laptop without any errors on the server side, and one can more easily tell if a service has malfunctioned. So we’ll allow metrics to be exposed with a modification to our traefik.toml


If traefik is running inside a docker container (in my case, docker compose) the default api port needs to be exposed.

    # ... snip
      - 80:80
      - 443:443
      - 8080:8080

Now once traefik starts, we can retrieve metrics from http://<server-ip>:8080/metrics. Three things wrong:

  • Metrics broadcast to the entire internet. We need to lock this down to only authenticated individuals.
  • Metrics served over http so others can snoop on the metrics.
  • Typically I find it preferable to lock traffic down to as few ports as possible.

We fix these issues by binding the listening address to localhost, and reverse proxying through a traefik frontend that forces basic authentication and TLS.

To bind port 8080 so it only listens locally, update the docker compose file

       - 80:80
       - 443:443
-      - 8080:8080
+      - ""

Now let’s proxy traffic through a basic auth TLS frontend.

First, create a prometheus user with a password of “mypassword” encoded with bcrypt using htpasswd (installed through apache2-utils on Ubuntu):

$ htpasswd -nbB prometheus mypassword

Potential performance issues with constantly authenticating using bcrypt can be mitigated with SHA1. Though in practice CPU usage is less than 1%.

Next, we configure traefik using the file directive. This basically configures traefik with static routes – think nginx or apache. While not explicitly mentioned anywhere, one can configure traefik with as many route providers as necessary (in this case, docker and file). A nice feature is that the file provider can delegate to a separate file that can be watched so traefik doesn’t need to be restarted on config change. For the sake of this article, I’m keeping everything in one file.

Add the below to traefik.toml. It’ll listen for metrics.myapp.example.com, only allow those who authenticate as prometheus, and then forward the request to our traefik metrics.


       url = ""

      backend = "internal-traefik"
      basicAuth = ["prometheus:$2y$05$JMP9BgFp6rtzDpAMatnrDeuj78UG7W05Zr4eyjtq2i7.gk0KZfcIC"]

          rule = "Host:metrics.myapp.example.com"

Note that this relies on Let’s Encrypt working, as metrics.myapp.example.com will automatically be assigned a cert. Pretty neat!

Exposing System Metrics

We now have a slew of metrics giving insights into the number of requests, response times, etc, but we’re missing system metrics to know if we’ve over or under provisioned the box. That’s where node_exporter comes into play. It’s analogous to collectd, telegraf, and diamond, but it is geared for prometheus’s pull model.

While one can use the prometheus-node-exporter apt package, I opted to install node_exporter’s latest version right from Github Releases. I avoided installing via docker as the readme states “It’s not recommended to deploy it as a Docker container because it requires access to the host system”. No major harm done, the node_exporter package is a single executable.

Executing it reveals some work to do.

$ ./node_exporter
INFO[0000] Listening on :9100
  • Plain http
  • Internet accessible port 9100

The official standpoint is that anything requiring auth or security should use a reverse proxy. That’s fine, good to know.

Downloading just an executable isn’t installing it, so let’s do that now before we forget. This is for systemd, so steps may vary.

cp ./node_exporter /usr/sbin/.
useradd --no-create-home --shell /usr/sbin/nologin --system node_exporter
cat > /etc/systemd/system/node_exporter.service <<EOF
Description=Node Exporter



systemctl enable node_exporter
systemctl start node_exporter

Now we’ll add authentication and security.

Node Exporter through Traefik

Now that we have multiple metric exporters on the box, it may seem tempting to look for a solution that aggregates both exporters so we only need to configure prometheus to scrape one endpoint. One agent to rule them all goes into depth as to why that’d be a bad idea, but the gist is operational bottlenecks, lack of isolation, and bottom up configuration (instead of top down like prometheus prefers).

Ok two endpoints are needed for our two metric exporters. We could split our metrics.myapp.example.com into traefik.metrics.myapp.example.com and node.metrics.myapp.example.com (phew!). This is a fine approach, but I’m going let the Let’s Encrypt servers have a breather and only work with metrics.myapp.example.com. We’ll have /traefik route to /metrics on our traefik server and /node-exporter appropriately.

I’ll post traefik.toml config with commentary following:


       url = ""

       url = ""

      backend = "internal-traefik"
      basicAuth = ["prometheus:$2y$05$JMP9BgFp6rtzDpAMatnrDeuj78UG7W05Zr4eyjtq2i7.gk0KZfcIC"]

          rule = "Host:metrics.myapp.example.com;Path:/traefik;ReplacePath:/metrics"

      backend = "internal-node"
      basicAuth = ["prometheus:$2y$05$JMP9BgFp6rtzDpAMatnrDeuj78UG7W05Zr4eyjtq2i7.gk0KZfcIC"]

          rule = "Host:metrics.myapp.example.com;Path:/node-exporter;ReplacePath:/metrics"

The backend for internal-node is “” and not “”. is the ip address of the docker bridge. This interface links containers with each other and the outside world. If we had used that would be traffic local to traefik inside it’s container (which node-exporter is not inside, it resides on the host system). The bridge ip address can be found by executing:

docker network inspect bridge --format='{{(index .IPAM.Config 0).Gateway}}'

We now can update the node-exporter to listen to internal traffic only on

node_exporter --web.listen-address=""

Other notes:

  • One can’t override the backend in a frontend route based on path, so two frontends are created. This means that the config turned out a bit more verbose.
  • This necessitates another entry for basic auth. In this example I copied and pasted, but one could generate a new bcrypt hash with the same password or use a different password
  • Each frontend rule uses the ReplacePath modifier to change the path to /metrics so something like /node-exporter gets translated to
  • I prefix each frontend and backend endpoint with “internal” so that these can be excluded in prometheus queries with a simple regex.

Now it’s time for prometheus to actually scrape these metrics.

Enter Prometheus

The prometheus config can remain pretty bland, if only a bit redundant.

  - job_name: 'traefik'
    scheme: https
      username: prometheus
      password: mypassword

    metrics_path: "/traefik"
    - targets: ['metrics.myapp.example.com']

  - job_name: 'node-exporter'
    scheme: https
      username: prometheus
      password: mypassword

    metrics_path: "/node-exporter"
    - targets: ['metrics.myapp.example.com']

And finally Grafana

I haven’t quite got the hang of promQL, so I’d recommend importing these two dashboards:

Play with and tweak as needed.


Definitely a lot of pros with this experience

  • Traefik contains metrics and only needed configuration to expose
  • Traefik can reverse proxy through static routes in a config file. I was worried I’d have to setup basic auth and tls through nginx, apache, or something like ghost tunnel, which I’m unfamiliar with. I love nginx, but I love managing fewer services more.
  • Installing node_exporter was a 5 minute ordeal
  • Pre-existing Grafana dashboards for node_exporter and traefik
  • A metric pull model sits more comfortably with me, as the alternative would be to push the metrics to my home server where I house databases such as graphite and timescale. Pushing data to my home network is a lovely thought, but one I don’t want to depend on right now.

In fact, it was such a pleasant experience that even for boxes where I don’t host web sites, I’ll be installing this traefik setup for node_exporter.