Welcome to NBSoftSolutions, home of the software development company and writings of its main developer: Nick Babcock. If you would like to contact NBSoftSolutions, please see the Contact section of the about page.

Opinionated Guide for Web Development with Rust Web Workers

Webpack configuration: look – don’t touch

I have a toy site that parses Rocket League replays using the rust crate boxcars. Parsing replays is done synchronously and shouldn’t block the browser’s UI thread, so parsing is offloaded to a web worker. Getting this side project to a point where it works + is maintainable (minimum configuration) has been an exercise in doggedness as I spent weeks exploring options.

I believe I’ve found a happy medium and here’s the recipe:

  • Wasm-pack for converting rust to js + wasm
  • Typescript. Refrain from anything special that can’t be compiled (eg: no css modules or importing images). jsx / tsx is fine. No need for babel
  • Use a highly trimmed webpack config to output fingerprinted files
  • Hugo to glue everything together and preprocess sass files and other assets

Here’s the philosophy: By minimizing configuration and dependencies while maintaining production quality, our project becomes easier to maintain

To give a sneak peek, the entire pipeline can be executed as

wasm-pack build crate && tsc && mkdir -p data && webpack --mode production && hugo

or an npm run build.

Other than wasm-pack everything else has basically been derived from process of elimination, so let’s go through this recipe in detail.

Sample Code

Here is some sample code to demonstrate the core interaction between our UI and web assembly. First we need a web worker so that one can pass messages between the UI and worker thread.

src/index.tsx

// unused, but an example node_modules import
import { render, h } from "preact";

const replayWorker = new Worker("worker.js", { type: "module" });
replayWorker.postMessage("LOAD");

While { type: "module" } is a valid option for a worker, no browser supports it, so typescript will leave it in without complaint and have webpack split the worker into a separate bundle.

“worker.js” is not a typo even though it is written in typescript as typescript will convert the code to js before they are passed to the webpack phase.

src/worker.ts

In our worker code we load the web assembly.

let parser: ReplayParser | null = null;
onmessage = async (e) => {
  switch (e.data) {
    case "LOAD":
      const module = await import("../crate/pkg/rl_wasm");
      parser = new ReplayParser(module); 
      break;
  }
}

src/ReplayParser.ts

And we can use a neat typescript trick in ReplayParser to take advantage of the typescript declarations written by wasm-pack to ensure that our javascript is adhering to the wasm return types.

import * as rl_mod from "../crate/pkg/rl_wasm";
type RLMod = typeof rl_mod;
export class ReplayParser {
  mod: RLMod;
  constructor(mod: RLMod) {
    this.mod = mod;
  }
}

Configuration Files

I hate configuration files. I’ve tried to whittle them down, but they are necessary so let’s get them out of the way

tsconfig.json (remove comments if copying + pasting)

{
  "compilerOptions": {
    "strict": true,              // strict code quality
    "module": "esnext",          // allow for dynamic import of wasm bundle
    "moduleResolution": "node",  // allow imports of npm dependnecies
    "target": "es2018",          // Allow modern features as we're using wasm anyways
    "jsx": "react",              // I like preact so these options are for writing tsx
    "jsxFactory": "h",
    "outDir": "./dist/"          // output directory for webpack to consume
  },
  "include": ["src/**/*"]
}

The typescript phase can be ran with npx tsc.

webpack.json

const path = require("path");
const WorkerPlugin = require("worker-plugin");
const AssetsPlugin = require("assets-webpack-plugin");

module.exports = {
  entry: "./dist/index.js", // the output of our typescript phase
  devtool: "source-map",
  plugins: [
    new WorkerPlugin(),
    new AssetsPlugin({ filename: "data/webpack.json" }),
  ],
  output: {
    filename: "[name].[contenthash].js",
    path: path.join(__dirname, "static", "js"),
    publicPath: "/js/",
  },
};

There are only two webpack plugins needed to accomplish this:

  • worker-plugin so that webpack correctly splits the worker script into a fingerprinted file that also can correctly import a fingerprinted web assembly module
  • assets-webpack-plugin which will output a file that contains a map of original js filenames to the fingerprinted ones for the downstream hugo process.

The only dev dependencies in a package.json needed to run this webpack config (swap versions for latest):

{
  "devDependencies": {
    "assets-webpack-plugin": "^3.9.12",
    "typescript": "^3.8.3",
    "webpack": "^4.42.1",
    "webpack-cli": "^3.3.11",
    "worker-plugin": "^4.0.2"
  }
}

That’s it for webpack. This all can be accomplished with one of the shortest webpack configs I’ve ever seen and fewest number of dependencies:

Couple things to note:

  • The source maps are distributed into production. I have no problem with this and neither should you.
  • The source maps will be of the compiled javascript (not original typescript source). Since typescript generates modern javascript that is quite close to the original typescript, one shouldn’t have a problem stepping through the code.

The webpack phase can be ran with npx webpack --mode production.

Directory structure

The project directory should be setup with these folders:

  • assets: owned by hugo. Place files that will be processed by hugo here (images, sass, etc)
  • crate: our rust crate that wasm-pack compiles to wasm
  • cypress: (optional) used for end to end integration testing.
  • data: owned by hugo but generated by the build process. We configure webpack to output a webpack.json here which contains a mapping of original source file names to fingerprinted ones.
  • dev: (optional) misellaneous files (eg: nginx.conf used for static hosting) or files used in multiple places to be symlinked.
  • layouts: owned by hugo. Place an index.html in this directory.
  • src: typescript code
  • static: owned by hugo. Where one sticks static content, so this is where we’ll put the resulting js bundles written by webpack as well as any static images
  • tests: unit tests written with jest

Hugo

Let’s take a look at the most unorthodox piece of this puzzle – using the static site generator hugo (specifically hugo extended). Over time I’ve grown more sure of this decision of using hugo for a few reasons:

  • Hugo’s philosophy of convention over configuration means one can introduce it with very little configuration.
  • Hugo is distributed as a single binary for all platforms
  • Hugo (extended) has a built in Sass preprocessor, so no additional dependencies or configuration for css processing
  • Hugo can fingerprint assets (which ruled out the Rust alternative, zola)
  • Hugo can work hand in hand with webpack through through the data/webpack.json file generated by assets-webpack-plugin so hugo can insert the fingerprinted js links
  • A static site generator is purpose made to glue everything together to make a cohesive site through a built in templating language, which unlike webpack needs plugin after plugin and tons of configuration to try and replicate.

Aside: the reason why we need webpack to fingerprint the js (instead of having hugo do it) is that webpack also spits out multiple fingerprinted js files even with a single entry point (eg: web worker, wasm, extracted vendor dependencies)

Those familiar with the hugo ecosystem, may recognize this as a heavily simplified victor-hugo setup. In my opinion this guide is better due to it’s simplicity and being opinionated can drastically decrease the cognitive overhead. That is the goal here. Supporting everything under the sun leads to adding too much magic or configuration.

Yes it seems like overkill to introduce a static site generator for this project, but I believe it is the right approach. There is no magic glue between any of the components. Nothing really to go wrong.

I see people complain about the complexity of hugo and all the ceremony it takes to set it up but I only had to move around some files (see directory structure section described above) and add a single config.toml file. Brace yourself, it’s long:

disableKinds = ["taxonomy", "taxonomyTerm", "RSS", "sitemap"]

Even this isn’t required, but for a simple project I don’t want all those extra files generated.

To import our main javascript into app:

<script type="text/javascript" src={{ .Site.Data.webpack.main.js | relURL }}></script>

Hugo fits nicely into the equation. While hearing that hugo is a static site generator may scare of SPA enthusiasts, don’t be, this works well for a SPA setup.

JS Bundlers

I’ve tried to love the parcel bundler. It’s zero configuration setup works well most of the time, and while it still has a place for standard projects that need to hit the ground running – I’ve grown disillusioned with it for a Rust WebAssembly purposes:

  • Bizarre, baroque, and hard to reproduce errors
  • Incorrect cache busting causing stale files to break the production web app (I’ve had to reset to a poor person’s implementation of an asset pipeline: sha256 + sed)
  • And a whole host of other problems revolving around rust.
  • That specifying typescript compilation target means nothing without a browserlist.

Such began the search for other bundlers. I dabbled in the “no bundler” snowpack, but it’s reliance on etags instead of fingerprinting assets relects poorly in lighthouse performance audits (though this may be changing). Also the potential for loading each source file as a separate request is frightening. I need fingerprinting.

Webpack is a natural replacement for parcel, except I have a big gripe with webpack: it’s configuration. I’m sure if you polled developers how they feel about webpack configuration it’d be one word: magic. I find that they are write once and cross fingers it’s perfect for eternity. I’m not ok with trading one type of magic (parcel’s one size fits all) for another (webpack’s unintelligible config).

It seems everyday there is a new article about how to configure webpack (the irony of me writing an article on the same topic isn’t lost on me). Here are some top results for webpack “How to configure Webpack 4 from scratch”, “A tale of Webpack 4 and how finally configure it”, “An annotated webpack 4 config” – these articles contain bloated dependencies and configs. The last example in particular tops out at 60 dependencies and a 130 line config just to build the project. It can feel like webpack is a hammer and everything is a nail. No you don’t need hot module replacement (HMR), css modules, mdx, babel, a separate dev / prod config. Keep it simple.

There are a ton of webpack intermediaries that claim a zero or simplified config: create-react-app, nwb, neutrino, poi but they all fell short and left a sour taste in my mouth as these intermediaries only service their use case and rust + webassembly isn’t in it. I spent way more time trying to simplify a webpack setup through these tools than if I had just written it myself.

Ts-loader

One can consolidate the typescript then webpack invocations into the webpack config if they want to add a dash of complexity to their webpack config and add the ts-loader dependency. Personally, I don’t see the need for this consolidation as the “simplest setup” is still far too long to be a replacement for a single tsc command. Here is what our config would look like if we added in ts-loader

Not recommended

const path = require("path");
const WorkerPlugin = require("worker-plugin");
const AssetsPlugin = require("assets-webpack-plugin");

module.exports = {
  entry: "./src/index.tsx",
  devtool: "source-map",
  module: {
    rules: [
      {
        test: /\.tsx?$/,
        loader: "ts-loader",
      },
    ],
  },
  resolve: {
    extensions: [".ts", ".tsx", ".js"],
  },
  plugins: [
    new WorkerPlugin(),
    new AssetsPlugin({ filename: "data/webpack.json" }),
  ],
  output: {
    filename: "[name].[contenthash].js",
    path: path.join(__dirname, "static", "js"),
    publicPath: "/js/",
  },
};

The benefits of ts-loader:

  • Sourcemaps can now contain the original typescript code (may need to update tsconfig.json to enable source maps)

If it had been new TsLoaderPlugin() then it could maybe be worth considering, but the extra configuration seems a bit much.

Typescript

I write my projects in Typescript because I like static types.

SASS

Sass is not my first choice for a css preprocessor. I had gotten quite fond of using postcss in other projects, but I was not fond of adding the necessary configuration + multiple dependencies. If Hugo had an out of the box experience with postcss, I would have gone with that, but it doesn’t – it requires configuration and for the postcss dependency installed out of band.

So sass it is. At first I was grimacing at the thought of the work converting the postcss dialect chosen to scss, but it turns out that browsers have really good support for modern css features so there was actually a minimum amount of conversion necessary.

Sass usage in hugo is quite pleasant. I define an variable to hold a dictionary of options dictating we want to have includes from “node_modules” and that we want the result css compressed

{{ $opts := (dict "outputStyle" "compressed" "includePaths" (slice "node_modules")) }}

Critical css can have it’s own stylesheet and inlined

{{ $critical := resources.Get "critical.scss" | toCSS $opts }}
<style>{{$critical.Content | safeCSS }}</style>

With the rest of the app’s styles coming later.

{{ $app_css := resources.Get "app.scss" | toCSS $opts | fingerprint }}
<link rel="stylesheet" type="text/css" href="{{ $app_css.RelPermalink }}" />

One can also create a stylesheet that vendors all 3rd party style sheets so small app css updates don’t trigger a download of potentially heavy 3rd party stylesheet.

Testing

Cypress is used for end to end tests against the actual site. These tests confirm ability of the WebAssembly and js to function against real world input. Integration tests allows the project to skip component testing in favor of real life interactions. With no component tests, a project is free to swap out the internals (ie: move to a different framework or eschew all of them) and not invalidate any tests (easy maintenance). There are still unit tests, but only those that don’t use the DOM.

I’m not too opinionated with unit tests. Ideally js would have a built in test runner kinda like rust, but for now jest with ts-jest is fine (ts-jest necessary until jest natively supports es6 modules). Removing jest from the dependencies shrunk a package-lock.json from 12k to 6k lines, which is incredible, so there is still a lot of room left for improvement for unit testing.

FAQ

How to handle cross platform contributors

The command posted earlier:

wasm-pack build crate && tsc && mkdir -p data && webpack --mode production && hugo

Is not cross platform and leaves Window users out in the cold. This can be ok if this project of yours is for yourself and you don’t use windows. Don’t sweat the details. With WSL so popular these days, it’s not even a problem.

If native windows development is absolutely necessary, one can write cross platform node js scripts.

Development & Livereload

Rust and typescript aren’t known for their quick compilation times and these lags in response can be felt as even the tiniest change would necessitate an entire rebuild with npm run build. It would be nice when a typescript file changed to not need to execute the rust toolchain. When I edit frontend code, I like to see the browser refreshed quickly with my changes without effort on my part.

Fortunately, 3 out of the 4 tools used so far natively support a watch mode and they meld together nicely. Below is a script I have that is aliased to npm start

#!/bin/bash

npm run build
npx tsc --watch --preserveWatchOutput &
npx webpack --mode production --progress --watch &
hugo serve --renderToDisk

Modifying a typescript file causes a quick site reload. Perfect for quick iteration. Even though I exclusively use webpack in production mode, I’ve not found watch recompilation times to be an issue.

The one tool that doesn’t support a watch mode is wasm-pack, but it is a known issue with workarounds using fswatch, cargo-watch, or entr, etc.

while true; do
  ls -R -d crate/Cargo.toml crate/Cargo.lock crate/src/ \
    | entr -d wasm-pack build crate;
done

In practice, I have not needed to use the above workaround as it’s been intuitive enough for me to execute wasm-pack build after any rust modifications. After wasm-pack is done executing, webpack sees the wasm was modified and automatically picks it up and a new site is generated.

How to deploy

I prefer deploying sites by packaging them up into docker images. No exception here, just ensure that no generated files are copied into the dockerfile to aid reproducibility.

If instead you are uploading the public/ folder directly someplace, it’s probably a good to clean out the webpack generated static/js on every build with the clean-webpack-plugin. Since I don’t consider this essential, I’ve excluded this from posted webpack config. And when the additional configuration is approximately the same size of the equivalent command (rm -rf ./static/js), the configuration is preferred. Anyways, here is the config if one wants to clean directory on every build.

const path = require("path");
const WorkerPlugin = require("worker-plugin");
const AssetsPlugin = require("assets-webpack-plugin");
const { CleanWebpackPlugin } = require('clean-webpack-plugin');

module.exports = {
  entry: "./dist/index.js", // the output of our typescript phase
  devtool: "source-map",
  plugins: [
    new CleanWebpackPlugin(),
    new WorkerPlugin(),
    new AssetsPlugin({ filename: "data/webpack.json" }),
  ],
  output: {
    filename: "[name].[contenthash].js",
    path: path.join(__dirname, "static", "js"),
    publicPath: "/js/",
  },
};

Splitting Vendored JS

In the event that you are embedding heavy dependencies that don’t change at nearly the same rate as your application, one can split the dependencies out in order to ease the amount of data that users need to pull down when one deploys their app. One can do this with a optimization.splitChunks

  optimization: { splitChunks: { chunks: "all" } },

That’s it. Don’t get carried away by all the options. Full config shown below:

const path = require("path");
const WorkerPlugin = require("worker-plugin");
const AssetsPlugin = require("assets-webpack-plugin");

module.exports = {
  entry: "./dist/index.js",
  devtool: "source-map",
  plugins: [
    new WorkerPlugin(),
    new AssetsPlugin({ filename: "data/webpack.json" }),
  ],
  output: {
    filename: "[name].[contenthash].js",
    path: path.join(__dirname, "static", "js"),
    publicPath: "/js/",
  },
  optimization: { splitChunks: { chunks: "all" } },
};

Again, only use this optimization if dependencies significantly outweigh the app and one deploys often. I have a preact app where I don’t use this trick as there really isn’t any dependency baggage so there isn’t a reason to split dependencies off. Other times it’s a fat react + antd where I don’t want my users to download these again if I simply fix a typo in the app.

Proxy backend requests

When developing a frontend, it is beneficial to interact with the backend. While this can vary greatly from project to project, for my projects I’m able to reuse everything that is used for production. I have a docker container for housing the API and containers for S3 (minio), postgres, and redis. All of these are orchestrated by docker compose.

And since I’m already using nginx to host the frontend assets, I can simply bind mount /usr/share/nginx/html to the ./public directory to serve all the latest assets as they change. This is why the start.sh script shown earlier has hugo use --renderToDisk.

The last thing I do is have a traefik container which is the entrypoint for all requests route /api requests to the api server and / to the nginx server.

I’m quite satisfied with the result:

  • Purpose built tools watching their domain for any changes
  • Re-using production containers for development

Linting / Formatting / Other tools

For formatting, run prettier without installing as a dev dependency. Pinning a formatter to a specific version seems a tad asinine, as one should always be using the latest version. Just like how I don’t customize rustfmt through rustfmt.toml, don’t customize prettier, as it can be confusing where configuration is located (eg package.json, .prettierrc, prettier.config.js) and serves only to pollute the project directory.

I eschew linting in frontend projects. It’s a bit of a surprise considering how much I love clippy for rust projects, but I don’t have any love for eslint. Every guide on how to get started with eslint, typescript, prettier, react, etc has one installing a dozen linting dependencies and crafting a configuration that can reach hundreds of lines. This is not ok. My preference would be for an opinionated, self-contained tool that does not need to be added as a dev dependency to run lints. Rome can’t come fast enough. Until then, typescript type checking + prettier get me most of the way there.

As for other tools – when in doubt don’t use them:

  • You don’t need css modules when you can write regular sass files or attach styles directly to the component
  • You don’t need to Extract & Inline Critical-path CSS in HTML pages: split up your sass stylesheets and have hugo inline the style in the html.
  • You don’t need to Remove unused CSS when you import stylesheets of just the components you need
  • You don’t need dynamically loaded images fingerprinted (this is where etags do come in handy)
  • You don’t need it. It’s not worth maintaining an extra dependency + configuration

CSS in JS

Thus far, I’ve been mainly advocating for using a separate sass stylesheet. And I believe that one should continue using a separate stylesheet. However I recognize arguments of the shortcomings of css. So with that said, I have some thoughts if one finds the use of css in js a necessity:

  • Don’t use styled-components, as one “will have to do a little bit of configuration” to use with Typescript.
  • Consider avoiding TypeStyle even though it is a css in js library that has a deep integration with Typescript, as developer mind share will be less due to the library targeting a specific language and the slow pace of development.
  • styled-jsx requires configuration and doesn’t play nice with typescript.

The one css in js library that checked most the boxes is framework agnostic version of emotion mainly due to “no additional setup, babel plugin, or other config changes [required].” I can drop it into any project and expect it to work. The same can’t be said the @emotion/core package which needs the magic comment /** @jsx jsx */ at the head of the file in order to work (and was unable to witness the css typechecking promised by the @emotion/core package).

There is no pressure to jump on the css in js bandwagon. I had a pure preact project, migrated to css in js with emotion, and then migrated back to standalone stylesheet as the styles in that project hadn’t grown unruly enough that I felt the need for a css in js solution. Emotion is relatively lightweight, but for very lightweight sites it’s effect on the bundle size can be felt.

In Conclusion

It may seem like I’m bending over backwards to find pieces that fit the web development puzzle, but I’m left quite satisfied. Every tool chosen (wasm-pack, typescript, webpack, hugo, emotion (optional)) are industry standard and no one can say they are an obscure or bad choice. I like to lay out the investigative process, which means weeding out the tools that don’t work.

Some parting words of advice:

  • Keep the project simple
  • You don’t need that configuration
  • You don’t need that dependency
  • Use purpose built tools for the job

Yes there will be those who clearly shouldn’t follow this recipe, but I needed to reduce my overhead as a I switched between projects and reducing magic through dependencies and configuration has proven valuable.


Parsing Performance Improvement with Tapes and Spatial Locality

There’s a format that I need to parse in Rust. It’s analogous to JSON but with a few twists:

{
    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
}
  • The core field appears multiple times and not sequentially
  • The documents can be largish (100MB)
  • The document should be able to be deserialized by serde into something like
struct MyDocument {
  core: Vec<String>,
  nums: Vec<u8>,
}

Unfortunately we have to choose one or the other:

  • Buffer the entire document into memory so that we can group fields logically for serde, as serde doesn’t support aggregating multiple occurrences of a field
  • Not support serde deserialization but parse the document iteratively

Since I consider serde deserialization a must have feature, I resigned myself to buffering the entire file in memory. To the end user, nothing is as ergonomic and time saving as serde deserialization. My initial designs had a nice iterative approach, combining the best things about quick-xml, rust-csv, and pulldown-cmark. I was even going to write a post about designing a nice pull based parser in rust showing it can be a great foundational start for higher level parsers, but stopped short when the performance fell short.

Some serde deserializer like quick-xml / serde-xml-rs hack around the serde limitation by only supporting deserializing multiple fields that appear consecutively. This allows them to still iteratively read the file. Though it remains an open issue to support non-consecutive fields

Representing a JSON Document

Since we have the whole document in memory, we need a format that we can write a deserializer for. Here’s what a simplified JSON document can look like

enum JsonValue<'a> {
  Text(&'a [u8]),
  Number(u64),
  Array(Vec<JsonValue<'a>>),
  Object(Vec<(&'a [u8], JsonValue<'a>)>),
}

This structure is hierarchical. It’s intuitive. You can see how it easily maps to JSON.

The issue is the structure is a horrible performance trap. When deserializing documents with many objects and arrays, the memory access and reallocation of the Array and Object vectors is significant. Profiling with valgrind showed that the code spent 35% of the time inside jemalloc’s realloc. The CPU has to jump all over RAM when constructing a JsonValue as these vectors could live anywhere on the heap. From Latency Numbers Every Programmer Should Know, accessing main memory takes 100ns, which is 200x over something in L1 cache.

So that’s the performance issue, but before we solve that we’ll need a algorithm to solve deserializing out of order fields.

Serde Deserializing out of order Fields

Below is some guidelines for those working with data formats that allow duplicate keys in an object but the keys are not consecutive / siblings:

  • Buffer the entire file in memory
  • For each object deserializing
    • Start at the first key in the object
    • Scan the rest of the keys to see if there are any others that are the same
    • Add all the indices of which these keys occurred into a bag
    • Send the bag for deserialization
      • Most of the time we only need to look at the first element in the bag
      • But if a sequence is desired, we use all the indices in the bag in a sequence deserializer
    • Mark all the keys from the bag as seen so they aren’t processed again
    • Empty the bag

This method requires that we have two additional vectors needed per deserialization of an object. One to keep track of the indices of values that are part of a given field key and another to keep track of which keys have been deserialized already. The alternative solution would be to create an intermediate “group by” structure but this was not needed as the above method never showed up in profiling results, despite how inefficient it may seem (it’s O(n^2), where n is the number of fields in an object, as deserializing the k’th field requires (n - k) comparisons).

Parsing to a Tape

A tape is a flattened document hierarchy that fits in a single vector of tokens that aid future traversals. This is not a novel technique, but I’ve seen so little literature on the subject that I figured I’d add my voice. There is a nice description of how the tape works in Daniel Lemire’s simdjson. To be concrete, this is what a tape looks like in Rust:

enum JsonValue {
  Text { data_idx: usize },
  Number(u64),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }
}

struct Tapes<'a> {
  tokens: Vec<JsonValue>,
  data: Vec<&'a [u8]>,
}

So the earlier example of

{
    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
}

Would turn into a tape like so: (numbers are the tape / data index)

Tokens:

0 : Text { data_idx: 0 }
1 : Text { data_idx: 1 }
2 : Text { data_idx: 2 }
3 : Array { tape_end_idx: 9}
4 : Number(1)
5 : Number(2)
6 : Number(3)
7 : Number(4)
8 : Number(5)
9 : End { tape_start_idx: 3}
10: Text { data_idx: 3 }
11: Text { data_idx: 4 }

Data:

0: core
1: core1
2: nums
3: core
4: core2

Notes:

  • One can skip large child value by looking at the tape_end_idx and skipping to that section of the tape to continue processing the next key in an object
  • Delimiters like colons and semi-colons are not included in the tape as they provide no structure
  • The parser that writes to the tape must ensure that the data format is valid as the consumer of the tape assumes certain invariants that would have normally been captured by a hierachial structure. For instance, the tape technically allows for a number key.
  • The tape is not for end user consumption. It is too hard to work with. Instead it’s an intermediate format for more ergonomic APIs.
  • The tape design allows for future improvement of storing only unique data segments (see how index 0 and 3 on the data tape are equivalent) – this could be useful for interning strings in documents with many repeated fields / values.

Why create the data tape instead of embedding the slice directly into JsonValue::Text?

Yes, an alternative design could be:

enum JsonValue2<'a> {
  Text(&'a [u8]),
  Number(u64),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }
}

and then there would be no need for a separate data tape. The main reason why is space.

println!("{}", std::mem::size_of::<JsonValue>()); // 16
println!("{}", std::mem::size_of::<JsonValue2>()); // 24

JsonValue2 is 50% larger. This becomes apparent when deserializing documents that don’t contain many strings (ie mainly composed of numbers). The performance hit from the extra indirection has been negligible (probably because the data tape is often accessed sequentially and can spend a lot of time in cache), but I’m always on the lookout for better methods.

Performance Improvements

  • Deserializing in criterion benchmark: 4x
  • Parsing to intermediate format (tape vs hierarchical document): 5x (up to 800 MiB/s!)
  • Deserialization in browser (wasm): 6x

To me this is somewhat mind blowing – that I saw a 6x decrease in latency when deserializing in the browser. It took an idea that seemed untenable to possible. And I was so blown away that I felt compelled to write about it so that I couldn’t forget it and add tip on the internet in case someone comes looking.


Pushing the power envelope with undervolting

This article has been superceded at it’s new home: https://sff.life/how-to-undervolt-gpu/

This article depicts how to use standard undervolting tools on a standard CPU and GPU to achieve a significant gain in power efficiency while preserving performance.

Backstory

I’m migrating away from a 4.5L sandwich style case that used a 300w flex atx PSU to power:

  • Ryzen 2700
  • MSI GTX 1070 Aero ITX
  • 32GB ram at 1.35v
  • 1TB Nvme

For the curious, entering these components into various power supply calculators yielded:

So it already seems like we’re pushing our PSU, but I’m going to go with an even smaller rater PSU soon, so I’ll need to get creative through undervolting. Undervolting will:

  • Decrease energy consumption by lowering voltage (and thus watts)
  • Decrease temperature of components as there is less heat to shed
  • Can have no effect on performance, as we can undervolt at given frequencies

Could this be too good to be true?

How to Undervolt

Components we’ll be using:

  • P4460 Kill a Watt electricity usage monitor: to measure output from the wall. This is the only thing that costs money on the list – you may be able to rent it from a local library or utility company. A wattmeter is not critical, but it’ll give us a sense of total component draw that the PSU has to supply
  • Cinebench R20: Free CPU benchmark
  • 3dmark Timespy: Free GPU benchmark
  • HWiNFO64: to measure our sensor readings (temperature, wattage, etc)
  • MSI Afterburner: to tweak the voltage / frequency curve of the gpu
  • Motherboard bios: to tweak cpu voltage
  • Google sheets: A spreadsheet for tracking power usage, benchmarks, any modifications, etc

a Kill a Watt device to measure wall draw

Feel free to swap components out for alternatives, but I do want to stress the importance of benchmarks, as it’s desirable for any potential performance loss to become apparent when running benchmarks at each stage of the undervolt.

Attentive readers will note the absence of any stress tests (eg: Prime95 With AVX & Small FFTs + MSI Kombustor). It is my opinion that drawing obscene amounts of power to run these stress tests is just too unrealistic. Benchmarks should be stressful enough, else what are they benchmarking? But I understand the drive for those who want ultimate stability, so I’d suggest after we work our way down the voltage ladder to start climbing up as stress tests fail.

Initial Benchmarks and Measurements

Let’s break down what it really means when we’re measuring the number of watts flowing through the wattmeter:

  • Kill A Watt shows 200 watts
  • The PSU (SSP-300-SUG) is 300W 80 Plus Gold certified
  • PSU is able to convert at least 87%-90% of inbound power to the components with the rest dispersed as heat
  • The components are asking between 174-180 watts (else the PSU would be rated silver or platinum)
  • If the components are asking for max power (300 watts), the Kill A Watt should be reading 337-344 before shutdown

First we’ll measure idle:

  • Close all programs including those in the background using any cpu cycles
  • Open HWiNFO
  • Wait for the system to settle down (ie, the wattmeter converges to a reading)
  • Record what’s being pulled from the wall, cpu / gpu watts, and cpu / gpu temps

Then benchmark the cpu:

  • Keep everything closed
  • Run Cinebench R20
  • For each run record score, max wattmeter reading, max watts / temp from the cpu
  • Reset HWiNFO recorded max values
  • Repeat three times

Then benchmark the gpu:

  • Keep everything closed
  • Run Timespy
  • For each run record score, max wattmeter reading, max watts / temp from the gpu
  • Reset HWiNFO recorded max values
  • Repeat three times

For timespy, the score you are interested in is the graphics score (highlighted below)

The reason why we’re interested in measuring max values is that we need any future power supply to handle any and all spikes.

Here’s an example of what I recorded for my initial measurements and how I interpreted them. Idle:

  • Wattmeter reading: 52.5W
  • CPU watts: 22
  • GPU watts: 15

Timespy:

  • Average Timespy score: 5976
  • Max wattmeter reading: 241W
  • Max GPU watts: 159W
  • Max GPU temperature: 82c
  • Our GPU exceeds it’s TDP rating by 10W and total component draw is around 210-217W (241 * [.87, .9]). Through some algebra we can approximate other components consuming 50-60W

If you’re recording CPU watts during the GPU benchmark you may be able to say a bit more about motherboard + ram + etc usage, but be careful – max wattmeter, max gpu watts, and max cpu watts are unlikely to happen at the same time, so unless one has the ability to correlate all three of them, better to stick with the more likely event: max wattmeter occurs at the same time as max gpu watts during a gpu benchmark.

Cinebench:

  • Average Cinebench score: 3294
  • Max wattmeter reading: 131W
  • Max CPU watts: 77W
  • Max CPU temperature: 81c
  • Our CPU exceeds it’s TDP rating by 12W and total component draw is around 114-118 (131 * [.87, .9]). Other components are consuming around 20W.

GPU Undervolt

Determine Target Frequency

First, determine what target frequency you’d like your GPU to boost to, ideally a number between the GPU’s boost clock and max clock. The GPU’s boost clock will be listed in the manufacturer’s specification and also in GPU-Z. The GPU’s max clock is determined at runtime through GPU boost and will be reported as the “GPU Clock” in HWiNFO64 (make sure it’s running while the benchmark are in progress). Below are screenshots from HWiNFO64 and GPU-Z showing the differences between these two numbers.

  • Boost clock: 1721mhz
  • Max clock: 1886mhz

GPU boost (different from boost clock) increases the GPU’s core clock while thermal and TDP headroom remain. This is somewhat counterproductive for us as any voltage increases to reach higher frequencies will cause an increase in power usage and blow our budget.

I chose my target frequency to be 1860mhz – 139mhz over boost and 26mhz under max, as during benchmarking the gpu clock was pinned at 1860mhz most the time. The exact number doesn’t matter, as one can adjust their target frequency depending on their undervolting results. I change my target frequency later on. We’ll also choose our starting voltage to be 950mv, which is a pretty middling undervolt, and we’ll work our way down.

MSI Afterburner

The tool for undervolting! Powerful, but can be unintuitive at first.

Ctrl + F to bring up the voltage frequency curve. Find our target voltage of 950mv and take note of the frequency (1809mhz in the screenshot)

Our target frequency (1860mhz) is greater than 1809mhz by 51mhz, so we increase the core clock by 51mhz.

This will shift the voltage frequency graph up by 51mhz ensuring a nice smooth ride up the voltage / frequency curve until we hit 1860mhz at 950mv. Then to guarantee we don’t exceed 950mv:

  • Select the point at 950mv
  • Hit “l” to lock the voltage
  • For all voltage points greater than 950mv, drag them to or below 1860mhz
  • Hit “✔” to apply
  • Afterburner will adjust the points to be the same frequency as our locked voltage
  • You may have to comb over >950mv points to ensure that afterburner didn’t re-adjust any voltage points to be greater than 1860mhz. It happens
  • Hit “✔” to apply

End result should look like:

After our hard work, we’ll want to save this as a profile so that we can refer back to it after we inevitably undervolt too far. Also after determining what is the best undervolt, we’ll want to have that profile loaded on boot, so ensure the windows icon is selected.

Rinse and Repeat

Do the benchmark routine:

  • Clear / reset sensor data for max power usage and temperature
  • Start Timespy
  • Keep eyes glued on the kill a watt and record max draw
  • After Timespy completes record score, gpu max power usage, max temperature, and wall max wall draw.
  • Repeat 3 times

After three successful benchmarks:

  • Open Afterburner
  • Reset the voltage / frequency graph by hitting the “↺” icon
  • Decrement the target voltage by 25mv and keep the same target frequency
  • Calculate new core clock offset from new target voltage
  • Proceed to adjust all greater voltages to our frequency
  • Re-benchmark

Results

  • Undervolted to 875mv at 1860mhz
  • 850mv was not stable
  • No affect on Timespy score (<1% difference)
  • GPU max wattage decreased from 159W to 124W (20-23% reduction)
  • GPU max temp decreased from 82 to 75 (9-10% reduction)
  • Attempting an undervolt of 875mv at 1886mhz (the original max clock) was not stable

These are good results that demonstrate that there is no performance loss from undervolting, yet one can mitigate heat and power usage. I decided to take undervolting one step further and set my target frequency to the GPU’s boost frequency (1721mhz) and record the results:

  • Undervolted to 800mv at 1721mhz
  • Slight decrease to Timespy score (6-7%)
  • GPU max wattage decreased from 159W to 105W (33% reduction)
  • GPU max temp decreased from 82 to 68 (15-17% reduction)

To me, the loss in performance in this chase to undervolt is greatly outweighed by the gain in efficiency.

CPU Undervolting

Now onto CPU undervolting. This step will differ greatly depending on the motherboard, but for me it was as easy as adjusting the VCore Voltage Offset in 20mv increments.

After booting, run cinebench, record values – the works. Repeat a few times. On success, decrement the offset more.

Results

  • Undervolted to -100mv offset
  • Small decrease in Cinebench score (<3% difference)
  • CPU max wattage decreased from 77W to 65W (15% reduction)
  • CPU max temp decreased from 81 to 76 (6% reduction)

Conclusion

We’ve decreased power usage for the CPU and GPU considerably (a combined 66 watts), lowered temperatures, and opted into additional undervolts for a near neglible performance loss. Yet we’re unable to say anything about power options we can slim down to, so I ran an informal stress test by running cinebench and timespy at the same time and recorded max watts recorded from the wall: 205 watts. Meaning components are really desiring 180-200 watts. Incredible – less than 200 watts in a stress test.