Parsing Performance Improvement with Tapes and Spatial Locality

There’s a format that I need to parse in Rust. It’s analogous to JSON but with a few twists:

{
    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
}
  • The core field appears multiple times and not sequentially
  • The documents can be largish (100MB)
  • The document should be able to be deserialized by serde into something like
struct MyDocument {
  core: Vec<String>,
  nums: Vec<u8>,
}

Unfortunately we have to choose one or the other:

  • Buffer the entire document into memory so that we can group fields logically for serde, as serde doesn’t support aggregating multiple occurrences of a field
  • Not support serde deserialization but parse the document iteratively

Since I consider serde deserialization a must have feature, I resigned myself to buffering the entire file in memory. To the end user, nothing is as ergonomic and time saving as serde deserialization. My initial designs had a nice iterative approach, combining the best things about quick-xml, rust-csv, and pulldown-cmark. I was even going to write a post about designing a nice pull based parser in rust showing it can be a great foundational start for higher level parsers, but stopped short when the performance fell short.

Some serde deserializer like quick-xml / serde-xml-rs hack around the serde limitation by only supporting deserializing multiple fields that appear consecutively. This allows them to still iteratively read the file. Though it remains an open issue to support non-consecutive fields

Representing a JSON Document

Since we have the whole document in memory, we need a format that we can write a deserializer for. Here’s what a simplified JSON document can look like

enum JsonValue<'a> {
  Text(&'a [u8]),
  Number(u64),
  Array(Vec<JsonValue<'a>>),
  Object(Vec<(&'a [u8], JsonValue<'a>)>),
}

This structure is hierarchical. It’s intuitive. You can see how it easily maps to JSON.

The issue is the structure is a horrible performance trap. When deserializing documents with many objects and arrays, the memory access and reallocation of the Array and Object vectors is significant. Profiling with valgrind showed that the code spent 35% of the time inside jemalloc’s realloc. The CPU has to jump all over RAM when constructing a JsonValue as these vectors could live anywhere on the heap. From Latency Numbers Every Programmer Should Know, accessing main memory takes 100ns, which is 200x over something in L1 cache.

So that’s the performance issue, but before we solve that we’ll need a algorithm to solve deserializing out of order fields.

Serde Deserializing out of order Fields

Below is some guidelines for those working with data formats that allow duplicate keys in an object but the keys are not consecutive / siblings:

  • Buffer the entire file in memory
  • For each object deserializing
    • Start at the first key in the object
    • Scan the rest of the keys to see if there are any others that are the same
    • Add all the indices of which these keys occurred into a bag
    • Send the bag for deserialization
      • Most of the time we only need to look at the first element in the bag
      • But if a sequence is desired, we use all the indices in the bag in a sequence deserializer
    • Mark all the keys from the bag as seen so they aren’t processed again
    • Empty the bag

This method requires that we have two additional vectors needed per deserialization of an object. One to keep track of the indices of values that are part of a given field key and another to keep track of which keys have been deserialized already. The alternative solution would be to create an intermediate “group by” structure but this was not needed as the above method never showed up in profiling results, despite how inefficient it may seem (it’s O(n^2), where n is the number of fields in an object, as deserializing the k’th field requires (n - k) comparisons).

Parsing to a Tape

A tape is a flattened document hierarchy that fits in a single vector of tokens that aid future traversals. This is not a novel technique, but I’ve seen so little literature on the subject that I figured I’d add my voice. There is a nice description of how the tape works in Daniel Lemire’s simdjson. To be concrete, this is what a tape looks like in Rust:

enum JsonValue {
  Text { data_idx: usize },
  Number(u64),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }
}

struct Tapes<'a> {
  tokens: Vec<JsonValue>,
  data: Vec<&'a [u8]>,
}

So the earlier example of

{
    "core": "core1",
    "nums": [1, 2, 3, 4, 5],
    "core": "core2"
}

Would turn into a tape like so: (numbers are the tape / data index)

Tokens:

0 : Text { data_idx: 0 }
1 : Text { data_idx: 1 }
2 : Text { data_idx: 2 }
3 : Array { tape_end_idx: 9}
4 : Number(1)
5 : Number(2)
6 : Number(3)
7 : Number(4)
8 : Number(5)
9 : End { tape_start_idx: 3}
10: Text { data_idx: 3 }
11: Text { data_idx: 4 }

Data:

0: core
1: core1
2: nums
3: core
4: core2

Notes:

  • One can skip large child value by looking at the tape_end_idx and skipping to that section of the tape to continue processing the next key in an object
  • Delimiters like colons and semi-colons are not included in the tape as they provide no structure
  • The parser that writes to the tape must ensure that the data format is valid as the consumer of the tape assumes certain invariants that would have normally been captured by a hierachial structure. For instance, the tape technically allows for a number key.
  • The tape is not for end user consumption. It is too hard to work with. Instead it’s an intermediate format for more ergonomic APIs.
  • The tape design allows for future improvement of storing only unique data segments (see how index 0 and 3 on the data tape are equivalent) – this could be useful for interning strings in documents with many repeated fields / values.

Why create the data tape instead of embedding the slice directly into JsonValue::Text?

Yes, an alternative design could be:

enum JsonValue2<'a> {
  Text(&'a [u8]),
  Number(u64),
  Array { tape_end_idx: usize },
  Object { tape_end_idx: usize },
  End { tape_start_idx: usize }
}

and then there would be no need for a separate data tape. The main reason why is space.

println!("{}", std::mem::size_of::<JsonValue>()); // 16
println!("{}", std::mem::size_of::<JsonValue2>()); // 24

JsonValue2 is 50% larger. This becomes apparent when deserializing documents that don’t contain many strings (ie mainly composed of numbers). The performance hit from the extra indirection has been negligible (probably because the data tape is often accessed sequentially and can spend a lot of time in cache), but I’m always on the lookout for better methods.

Performance Improvements

  • Deserializing in criterion benchmark: 4x
  • Parsing to intermediate format (tape vs hierarchical document): 5x (up to 800 MiB/s!)
  • Deserialization in browser (wasm): 6x

To me this is somewhat mind blowing – that I saw a 6x decrease in latency when deserializing in the browser. It took an idea that seemed untenable to possible. And I was so blown away that I felt compelled to write about it so that I couldn’t forget it and add tip on the internet in case someone comes looking.

Comments: