Welcome to NBSoftSolutions, home of the software development company and writings of its main developer: Nick Babcock. If you would like to contact NBSoftSolutions, please see the Contact section of the about page.

Writing a high performance TLS terminating proxy in Rust

moving I’ve been unpacking boxes like a TLS proxy unboxes TCP packets of TLS data

I feel like I don’t write about the Rust language that often, but it turns out today is my 2nd anniversary of dabbling in Rust. Exactly two years ago, I wrote a post about processing arbitrary amount of data in Python where I benchmarked a Python csv solution against Rust (spoiler, Rust is faster). Since then, Rust has only been mentioned in passing on this blog (except for my post on creating a Rocket League Replay Parser in Rust, where I dive into not only how to create a parser with nom, but what the replay format looks like). Well, today will be all about creating a TLS termination proxy. Keep in mind, while I have a decent intuition, I still consider myself new to Rust (I still fight with the borrow checker and have to keep reminding myself what move lambdas do)

TLS Termination Proxy

Some languages have notoriously bad TLS implementations, which can be offputting to those that are looking to secure an app or site. One can put their app behind one of the big juggernauts (Apache, Nginx, HAProxy), which can do TLS termination along with hundreds of other features. I recommend using them wholeheartedly and I personally use them, but when only TLS termination is desired, these can be excessive. Hence the perfect opportunity to write a lightweight and speedy TLS implementation.

When implementing any kind of proxy, one has to determine how much content of the request is needed to proxy correctly. A layer 7 proxy would, for instance, read the HTTP url and headers (and maybe even the body) before proxying to the app. These proxies can even modify the requests, like adding the X-Forwarded-For, so the client IP address isn’t lost at the proxy. As one can imagine, this is more computationally intense than blindly handing requests from A directly to B. Since this project is more of an experiement than anything, I went with the blind approach! The advantage of a blind proxy (sometimes known as a dumb proxy) is that they are protocol agnostic and can work with any type of application data.

Note that even dumb proxies can forward client addresses to downstream services using the PROXY protocol initially championed by HAProxy. This is something that I could look to implement in the future, as it is simply a few additional bytes sent initially to the server.

Beside performance, a TLS proxy can bring other benefits. Since the use case is so focussed, it would be relatively easy to bring features such as structured logging or metrics. One wouldn’t have to reimplement the solution for “what’s the throughput for each client” for each new application. Just stick the the TLS proxy in front of any app that needs those answers.

The Design

Here’s the design, one could say it’s pretty simple:

tls termination design

  • The TLS proxy listens on a given port
  • Clients communicate via TLS with the proxy
  • Proxy decrypts and forwards the request to the application via Unix sockets
  • Application responds
  • Proxy re-applies encryption and sends it back to the client.

Rust Implementation

Since Rust ecosystem has only started to dip its toes in asynchronous IO with tokio I figured I should see what all the fuss was about.

In the end, while I was satisified with the result, the labor involved in getting code correct took a significant amount of time. I don’t think I have ever written code slowly. This makes sense though, I don’t ever write networking code this low, so it should take me longer as I figure what is needed to satifiy the compiler. The meat of my implementation was taken from the official tokio repo. TLS was achieved through a combination of tokio-tls and native-tls. Since I only care about the Linux use case, I needed to extend the functionality of native-tls to allow users of the library additional customizations for openssl.

Proxy App Code

I needed a server that would satisfy the following benchmark requirements:

  • Have a mode to serve only HTTP requests
  • Have a mode to server only HTTPS requests
  • Have a mode where the server listens on a Unix socket for HTTP requests
  • Be able to saturate multiple CPUs

I went with nodejs. Go probably would have been a good choice too, but I’m still not comfortable with Go build tools. An interesting note is that nodejs also uses openssl for crypto tasks, so whatever speed differences will most likely be due to the overhead of interacting with openssl.

The server will simply echo back the request data, which will give us a good sense of what overhead TLS constitutes.

Below is the code for the server. There are zero dependencies, which I reckon is quite an achievement in javascript land!

const fs = require('fs');
const http = require('http');
const https = require('https');
const cluster = require('cluster');
const sock = '/tmp/node.sock';

if (cluster.isMaster) {
  // Since unix sockets are communicated over files, we must make sure the
  // file is deleted if it exists so we can open the socket
  if (fs.existsSync(sock)) {

  // Create as many processes as provided on the commandline
  const cpus = parseInt(process.argv[3]);
  for (let i = 0; i < cpus; i++) {
} else {
  // We're going to respond with every request by echoing back the body
  const handler = (req, res) => {

  // Callback to let us know when the server is up and listening. To ease
  // debugging potential race conditions in scripts
  const listenCb = () => console.log('Example app is listening');

  const arg = process.argv[2];
  switch (arg) {
    case 'sock':
      http.createServer(handler).listen(sock, listenCb);
    case 'http':
      http.createServer(handler).listen(3001, listenCb);
    case 'https':
      const sslOptions = {
        key: fs.readFileSync('key.pem'),
        cert: fs.readFileSync('cert.pem')

      https.createServer(sslOptions, handler).listen(3001, listenCb);
      throw Error(`I do not recognize ${arg}`);

The script is invoked with

node index.js <http|https|sock> <#cpus>

One thing I noticed is that when I simply had the server res.send("Hello World") it would be about at least ten times faster (1 - 5ms) than the equivalent echo of the same request body (50ms). I’m not sure if the act piping http streams causes a lot more work (eg. the request body doesn’t even need to be parsed, etc). A brief internet search turned up nothing, so I’m not going to worry about it.

CPU Affinity

I tried something that I have never done before when it comes to benchmarking. I ran the benchmarking software (wrk) on the same box as server. Before anyone gets up in arms, let me introduce you to CPU affinity. In linux, one can set which CPUs a process can run on. By setting the proxy and the server to the same CPUs, it would be representative of them running on the same box – competing for the same resources. Wrk is set to a mutually exclusive set of CPUs.

Invoking taskset will control what CPUs are used:

taskset --cpu-list 0-2 node index.js http 3

Caveat, the kernel can still schedule other processes on the mentioned CPUs. To force the kernel to ignore certain CPUs(so that only your process is using that CPU), look into booting your kernel with the isolcpus option.

Benchmarking Script

Our script will have three different configurations:

  • http: A nodejs server listening for HTTP requests on 3001
  • https: A nodejs server listening for HTTPs requests on 3001
  • tls-off: A nodejs server listening for HTTP requests on /tmp/node.sock with our proxy listening on 8080.

Each configuration will be ran five times for each body size.

#!/bin/bash -e

# This script will use all the cores on the box
CPUS=$(nproc --all)

# This is the directory where the TLS proxy is stored, the code is not
# currently open sourced but is included in the script for completeness.
# The command is ran on the CPUs as the application
RUSTTLS_CMD="taskset --cpu-list 0-$((CPUS - 2)) target/release/rustls-off config.toml"

# Node server created from the javascript posted earlier
NODE_CMD="taskset --cpu-list 0-$((CPUS - 2)) node index.js"

# Benchmarking command. It uses only a single thread, as a single thread was
# able to saturate three server CPUs. Strangely enough, anything lower than
# 3000 connections wouldn't saturate anything (node is bottlenecked on something,
# but I'm not sure what)
WRK_CMD="taskset --cpu-list $((CPUS - 1)) wrk -t1 -c3000 -d30s"

# We'll be creating a Lua file to customize benchmarking. The Lua file will
# be deleted on completion
trap "rm -f $LUA_FILE" EXIT

# We're going to to test request bodies that are of varying sizes
for BODY_SIZE in 10 100 1000 10000; do

    cat > $LUA_FILE <<EOF
wrk.method = "POST"
wrk.body   = string.rep("a", tonumber(os.getenv("BODY_SIZE")))

done = function(summary, latency, requests)

    # Each body size benchmark will be repeated five times to ensure a good
    # representation of the data
    for i in {1..5}; do
        echo "Body size: $BODY_SIZE, iteration: $i"
        pushd $NODE_DIR
        $NODE_CMD http $((CPUS - 1)) &

        # Must sleep for a second to wait for the node server to start
        sleep 1
        CONFIG=http BODY_SIZE=$BODY_SIZE $WRK_CMD -s $LUA_FILE http://localhost:3001 | \
			tee >(tail -n 1 >> tls-bench.csv);
        kill $NODE_PID

        pushd $NODE_DIR
        $NODE_CMD https $((CPUS - 1)) &
        sleep 1
        CONFIG=https BODY_SIZE=$BODY_SIZE $WRK_CMD -s $LUA_FILE https://localhost:3001 | \
			tee >(tail -n 1 >> tls-bench.csv);
        kill $NODE_PID

        pushd $NODE_DIR
        $NODE_CMD sock $((CPUS - 1)) &

        pushd $RUSTTLS_DIR
        $RUSTTLS_CMD &

        sleep 1
        CONFIG="tls-off" BODY_SIZE=$BODY_SIZE $WRK_CMD -s $LUA_FILE https://localhost:8080 | \
			tee >(tail -n 1 >> tls-bench.csv);
        kill $NODE_PID
        kill $RUSTTLS_PID


Here are the median response times graphed for our three methods across multiple body sizes.

results for tls termination


df <- read_csv("tls-bench.csv",
               c("config", "size", "requests", "mean", "stdev", "p50", "p90", "p99"))

df <- mutate(df,
             mean = mean / 1000,
             stdev = stdev / 1000,
             p50 = p50 / 1000,
             p90 = p90 / 1000,
             p99 = p99 / 1000)

ggplot(df, aes(factor(df$size), p50, fill=config)) +
  geom_jitter(size=4, width=0.15, shape=21) +
  xlab("Body size (bytes)") + ylab("Median response time (ms)") +
  ggtitle("Median response times to complete requests")

At low body sizes, the TLS proxy beats the HTTPs configured nodejs server handily, and even manages to beat the HTTP version (does this mean that Rust has a better TCP stack than node!?). As request size increase, our proxy does worse and worse. Why is that? Well, I’ve been keeping a secret. The proxy (so far) is only single threaded! As request size increases the amount of time needed to encrypt and decrypt requests grows and this computationally expensive task is bottlenecked one core. I reached out to the tokio community and they’ve given men a few ideas on how to incorporate multiple cores.

Open Source?

I’m normally a proponent of open sourcing a lot of my material, but this is still very experimental and I wouldn’t want anyone to get any other impression. This is just a small excercise in Rust and Tokio! There are a bunch of features I want to play with (I mentioned metrics and logging in the beginning), and if it starts to look good, I’ll let the world see it.

Creating a parallel AMP site with Jekyll


Google AMP (Accelerated Mobile Pages) is a strict subset of html that essentially only allows for content and a subset of supported inline css. Nearly all of the case studies are news sites. Google will give preference to valid AMP sites for mobile searches, if only because google can ensure the page is lightweight and fast. Though I attribute the project to google as google is leading the evangelizing effort, bing has joined with cloudflare hosting AMP pages too.

While I’ve known about AMP since last year, I’ve largely ignored it (I prefer to think I was waiting for the project to mature). Last weekend, I was chatting with a friend who works at a large travel corporation. She mentioned how they were working on creating an AMP enabled site because they’ve noticed their site falling in search rankings. Their site is a highly dynamic react app – as one might imagine – but how could they use AMP when first party javascript is disabled? She told me the homepage will be server side rendered, so it’s the only page that will be AMP enabled. It’ll have plenty of links to their main site where users can actually do searching and bookings. It’s a little cheeky, but I hear SEO is cutthroat and if large part of your business comes through search engines, I think it’s reasonable for one try new tech to and see if that increases the rank.

On a side note, I visited the travel site to find they have 250 elements in their head element. As an unfair comparison, this page has 5 at the time of this writing.

But this discussion got me thinking about how easy it would be for me to “AMPlify” this site which is a Jekyll site. The history of this site has basically been a testing grounds for new technologies:

AMPlifying with amp-jekyll

Jekyll is one of the most popular static site generators. As a result, there is a rich list of plugins and amp-jekyll is one of them.

Amp-jekyll is still maturing and the currently released version (1.0.1) does not support jekyll permalinks, so I had to directly reference the repo in the Gemfile.

gem 'amp-jekyll', :git => 'https://github.com/juusaw/amp-jekyll'

One of the first steps is to copy their amp.html for your own use. Since I have a stylesheet that is preprocessed and minified, I needed to include it in my amp.html. Typically, built pages use a link to reference the stylesheet, but since AMP requires only inline css, I had to go with an include directive.

<style amp-custom>
  {% include styles.css %}

This did not work out of the box for me, so I created the following script to execute on every build to make sure the included css fit AMP requirements.

# Copy the built css into the includes directory so that it can be used in the
# amp pages
cp _assets/build/styles.css _includes/.

# Amp doesn't like the !important css attribute that pure css uses, so we
# remove it!
sed -i -e 's/!important//g' _includes/styles.css

# pure contains IE7 specific hacks to get a style to reset using a syntax error.
# This is not needed for AMP, so we'll remove it from the css. See stackoverflow
# for more information https://stackoverflow.com/q/1690642/433785
sed -i -e 's/*display:inline;//g' _includes/styles.css

For regular posts, I updated their layout to include an amphtml link so that webcrawlers know that I support AMP and where the AMP pages are loaded.

{% if page.path contains '_posts' %}
  <link rel="amphtml" href="{{ page.id | prepend: '/amp' | prepend: site.baseurl | prepend: site.url }}">
{% endif %}

Now when I build the site, all blog posts have a sibling, AMP-enabled page that lives at the same url except it’s prefixed with /amp. So the urls look like:

Unanticipated Benefits

I had three posts where I forgot the /blog prefix in the url. Amp-jekyll made these obvious and so I fixed these urls. Yes, I realize changing urls after the fact breaks the internet, but these weren’t popular articles – not popular enough for me to setup redirects.

I also had broken image links in older posts because I was using raw figure html elements instead of the img directive.

Criticisms for AMP

The top articles on Hacker News for AMP are all criticisms and I’d be remiss if this section was omitted. The top articles are:

The articles are relatively succinct, so I recommend reading them – they’re decent, but I believe they’re overreacting:

  • Google doesn’t actually own AMP. AMP has an open governance model. Only when there is no one working on the project does Google give itself the power to appoint a leader, in which case, since no one is working on the project does it even matter?
  • Content is not just loaded from Google’s servers. Cloudflare hosts AMP content as well. Speaking of cloudflare, I’m not sure why having external sites cache content on edge server rubs people the wrong way. It alleviates concerns of DDOS attacks and site maintenance.
  • You can create your own AMP cache
  • Links work as normal. If a user clicks on an amp link, they’re taken to a feather-light page which has all the content they want, else if a user clicks on a non-amp link they get all the features of the full app.
  • The subset of CSS allowed is annoying if only for the fact that one has to do small tweaks to get the css compliant. Without this subset, AMP would not be able to assume the layout was fast and sensible. It’s a tough sell, as no one wants to create custom css for AMP, but I understand the reasoning. Thankfully, this blog does not use a lot of CSS – only a few sed statements were needed to become AMP compliant.
  • I’ll agree that removing AMP content from google’s AMP cache isn’t the most user friendly. Sending an “update-ping” to get immediate removal is tedious at best, but “cached content that no longer exists will eventually get removed from the cache” so one can just wait for the expiration. Since cache busting is a tough problem, this is a fine solution, though an optimal solution would be like Cloudflare’s where they have a nice UI for cache busting.
  • One isn’t giving up anything when having a parallel AMP site. AMP content is still served from the origin, and if there happens to be an AMP cache server present, the content is served from there. It’s not like whoever hosts the AMP server owns the content. It’s simply an optimization for those who seek the content in a nice layout.
  • Some lament that AMP would be good for fake news sites. I’m not sure where this idea sprouted from. I believe that satirical newspapers like the onion should be able to participate in AMP and reap any potential benefits.

One thing that I find very interesting is the topic of Subresource Integrity, which is where a client can specify the hash of the incoming file. If the hashes do not match then the file is rejected. The problem is that AMP defines itself as an evergreen library, so a single URL will always point to the lastest version of the library. This means that someone could compromise the javascript payload and clients would have no way of knowing. See the following Github issue for more information. I do have to chuckle though, as there as has been a recent trend towards digesting all assets, and now google is pushing a platform that is fundamentally opposed to identifying payloads based on hash (which one would normally think it could increase security and speed).

With all the back and forth, why did I use AMP? It was a combination of two reasons. First, this site is a testbed and I wanted to answer the question of what effort is required. Since I only spent a couple hours adding this feature, the effort was low. Second, will tons of traffic pour in from the increase in search rank? I won’t know until if I don’t try! Before we part, I have to embarrassingly admit, given my stance on the importance of metrics, this site employs zero analytics and I will have a hard time determining improvements unless there is an order of magnitude increase. AMP does have a whole section on analytics, so this may be an area worth exploring.

Introduction to Journald and Structured Logging

grilled cheese The mildly interesting depictions one finds in their journal

To start, Journald’s minimalist webpage describes itself as:

[as a] service that collects and stores logging data. It creates and maintains structured, indexed journals.

With one of the sources of logging data being:

Structured log messages

Before diving into journald, what’s structured logging?

Structured Logging

Imagine the following hypothetical log outputs

"INFO: On May 22nd 2017 5:30pm UTC, Nick Foobar accessed gooogle.com/#q=what is love in 52ms"


  "datetime": "2017-05-22T17:30:00Z",
  "level": "INFO",
  "user": "Nick Foobar",
  "url": "gooogle.com/#q=what is love",
  "responseTime": 52

The first example is unstructured or simple, textual logging. The date format makes it a tad contrived, but the log line contains everything in a rather human readable format. From the line, I know when someone accessed what and how fast – in a sentence like structure. If I was tasked with reading log lines all day I would prefer the first output.

The second output is an example of structured logging using JSON. Notice that it conveys the same information, but instead of a sentence, the output are key-value pairs. Considering JSON support is ubiquitous, querying and retrieving values would be trivial in any programming language. Whereas, one would need to meticulously parse the first output to ensure no ambiguity. For instance, not everyone has a given and last name, response time units need to be parsed, the url is arbitrary, timezone conversion, etc. There are way too many pitfalls if one stored their logs in the first format – it would be too hard to consistently analyze.

One could massage their textual log format into a semi-structured output using colons as field delimiters.

"INFO: 2017-05-22T17:30:00Z: Nick Foobar: gooogle.com/#q=what is love: 52"

This may be a happy medium for those unable or unwilling to adopt structured logging, but there are still pitfalls. To give an example if I wanted to find all the log statements with a WARN level, I have to remember to match against only the beginning of the log line or I run the risk of matching against WARN in the user name or in the url. What if I wanted to find all of the searches by Ben Stiller? I’d need to be careful to exclude the lines where people are searching for “Who is Ben Stiller”. These examples are not artificial either as yours truly has fallen victim to several of these mistakes.

Let’s say that one does accomplish a level of insight from the textual format using text manipulations. If the log format were to ever change (eg. transpose response time and url, more data being logged, etc) the log parsing code would break. So if you’re planning on gaining insight from text logs, make sure you define a rigorous standard first!

There is also a nice benefit of a possibility of working with types with structured logging. Instead of working with only strings, JSON also has a numeric type so one doesn’t need the conversion when analyzing.

Structured logging doesn’t need to be JSON, but is a common format in log management suites like graylog, logstash, fluentd, etc.

The only downsides that I’ve seen for structured logging (and specifically JSON structured logging) are log file size increases due to the added keys for disambiguation, and the format won’t be in a grammatically correct English sentence! These seem like minor downsides for the benefit of easier log analysis.


Now that we’ve established the case for structured logging, now onto journald. Be warned, this is a much more controversial topic.

Journald is the logging component of systemd, which was a rethinking of Linux’s boot and process management. A lot of feathers were ruffled are still ruffled because of the movement towards systemd (1, 2, 3, 4, 5). Wow, so a multitude of complaints. There must be several redeeming qualities to systemd because most distros are converging on it. I won’t be talking about systemd, but rather the logging component.

To put it simply, journald is a structured, binary log that is indexed and rotated. It was introduced in 2011.

Here’s how we would query the log for all messages written by sshd

$  journalctl _COMM=sshd
-- Logs begin at Thu 2017-05-18 23:43:18 EDT, end at Mon 2017-05-22 16:05:29 EDT. --
May 19 16:57:31 vm-ubuntu sshd[19494]: syslogin_perform_logout: logout() returned an error
May 19 16:57:31 vm-ubuntu sshd[19494]: pam_unix(sshd:session): session closed for user nick
May 22 09:03:40 vm-ubuntu sshd[5311]: Accepted password for nick from port 56618 ssh2
May 22 09:03:40 vm-ubuntu sshd[5311]: pam_unix(sshd:session): session opened for user nick by (uid=0)

For all sshd messages since yesterday

$  journalctl -S yesterday  _COMM=sshd
-- Logs begin at Thu 2017-05-18 23:43:18 EDT, end at Mon 2017-05-22 16:10:59 EDT. --
May 22 09:03:40 vm-ubuntu sshd[5311]: Accepted password for nick from port 56618 ssh2
May 22 09:03:40 vm-ubuntu sshd[5311]: pam_unix(sshd:session): session opened for user nick by (uid=0)

To view properties for autossh and sshd messages since yesterday (output truncated to first event)

$  journalctl -o verbose -S yesterday  _COMM=sshd + _COMM=autossh
-- Logs begin at Thu 2017-05-18 23:43:18 EDT, end at Mon 2017-05-22 16:12:45 EDT. --
Mon 2017-05-22 07:01:20.894720 EDT
    MESSAGE=timeout polling to accept read connection
    _CMDLINE=/usr/lib/autossh/autossh <IP>

To find all events logged through the journal API for autossh. If a + is included in the command, it means “OR” else entries need to match both expressions

$ journalctl _TRANSPORT=syslog _COMM=autosshd
-- No entries --

Find all possible values written for a given field:

$  journalctl --field=_TRANSPORT

What I think about journald

I want journald to be the next big thing. To have one place on your server were all logs are sent to* sounds like a pipe dream. No longer do I have to look up where logfiles are stored.

Journald has nice size based log rotation, meaning I no longer have to be woken up at night because a rogue log grew unbounded, which could degrade other services.

Gone are the days of arguing what format logs should be in – these would be replaced with disucssions about what metadata to expose.

With journald I can cut down on the number of external service that each service talks to. Instead of having every service write metrics to carbon, metrics would be written to journald. This way applications don’t need to go through the hoops of proper connection managment: re-connect on every metric sent, a single persistent connection, or some sort of hybrid? By logging to journald, carbon or the log forwarder can be down, but metrics will still be written to the local filesystem. There is very little that would case an absolute data loss.

People can use tools that they are most familiar with: some can use journalctl with the indexes on the local box and a others will want to see the bigger picture once the same logs are aggregated into another system.

* Technically the data may not be sent to a single file location as journald can be configured such that each user has their own journal – but journalctl abstracts that away such that users won’t know or care.

Complaints Against journald

  • Journald can’t be used outside of systemd, which limits it to only newer distros that have adopted systemd. I have CentOS 6 servers, so it’s a hard no to use journald on those systems.
  • Journald writes to a binary file that one can’t use standard unix tools to dissect, resulting in difficulty if the log becomes corrupt. If the log is not corrupt, one can pipe the output of journalctl to the standard tools.
  • There’s not a great story for centralizing journald files. The introduction mentioned copying the files to another server. People have found a way using journalctl -o json and sending the output to their favorite log aggregation.
  • A lot of third party plugins for journald ingestion for log management suites don’t appear well maintained.
  • It invented another logging service instead of working with pre-existing tools. Considering Syslog can work with structured data – that’s one less reason to switch to journald.
  • The data format is not standardized or documented well
  • Will not support encryption other than file-system encryption. If a user has access to the file system and has permission to read the log file, all logs will be available.
  • No way to exclude sensitive information from the log (like passwords on the commandline) – though you’re probably doing something wrong if this is an issue.
  • The best way to communicating with journald programmatically seems to be either through the C API or journalctl.

With all these complaints, it may be a wonder why I lean towards advocating journald. By advocating structured data first, journald is setting the tone for the logging ecosystem. Yes, I know that jounald is far from the first, but the simplicity of having single, queryable, structured log baked into the machine is admirable.