Writing a high performance TLS terminating proxy in Rust
I’ve been unpacking boxes like a TLS proxy unboxes TCP packets of TLS data
I feel like I don’t write about the Rust language that often, but it turns out today is my 2nd anniversary of dabbling in Rust. Exactly two years ago, I wrote a post about processing arbitrary amount of data in Python where I benchmarked a Python csv solution against Rust (spoiler, Rust is faster). Since then, Rust has only been mentioned in passing on this blog (except for my post on creating a Rocket League Replay Parser in Rust, where I dive into not only how to create a parser with
nom, but what the replay format looks like). Well, today will be all about creating a TLS termination proxy. Keep in mind, while I have a decent intuition, I still consider myself new to Rust (I still fight with the borrow checker and have to keep reminding myself what
move lambdas do)
TLS Termination Proxy
Some languages have notoriously bad TLS implementations, which can be offputting to those that are looking to secure an app or site. One can put their app behind one of the big juggernauts (Apache, Nginx, HAProxy), which can do TLS termination along with hundreds of other features. I recommend using them wholeheartedly and I personally use them, but when only TLS termination is desired, these can be excessive. Hence the perfect opportunity to write a lightweight and speedy TLS implementation.
When implementing any kind of proxy, one has to determine how much content of the request is needed to proxy correctly. A layer 7 proxy would, for instance, read the HTTP url and headers (and maybe even the body) before proxying to the app. These proxies can even modify the requests, like adding the X-Forwarded-For, so the client IP address isn’t lost at the proxy. As one can imagine, this is more computationally intense than blindly handing requests from A directly to B. Since this project is more of an experiement than anything, I went with the blind approach! The advantage of a blind proxy (sometimes known as a dumb proxy) is that they are protocol agnostic and can work with any type of application data.
Note that even dumb proxies can forward client addresses to downstream services using the PROXY protocol initially championed by HAProxy. This is something that I could look to implement in the future, as it is simply a few additional bytes sent initially to the server.
Beside performance, a TLS proxy can bring other benefits. Since the use case is so focussed, it would be relatively easy to bring features such as structured logging or metrics. One wouldn’t have to reimplement the solution for “what’s the throughput for each client” for each new application. Just stick the the TLS proxy in front of any app that needs those answers.
Here’s the design, one could say it’s pretty simple:
- The TLS proxy listens on a given port
- Clients communicate via TLS with the proxy
- Proxy decrypts and forwards the request to the application via Unix sockets
- Application responds
- Proxy re-applies encryption and sends it back to the client.
Since Rust ecosystem has only started to dip its toes in asynchronous IO with tokio I figured I should see what all the fuss was about.
In the end, while I was satisified with the result, the labor involved in getting code correct took a significant amount of time. I don’t think I have ever written code slowly. This makes sense though, I don’t ever write networking code this low, so it should take me longer as I figure what is needed to satifiy the compiler. The meat of my implementation was taken from the official tokio repo. TLS was achieved through a combination of tokio-tls and native-tls. Since I only care about the Linux use case, I needed to extend the functionality of native-tls to allow users of the library additional customizations for openssl.
Proxy App Code
I needed a server that would satisfy the following benchmark requirements:
- Have a mode to serve only HTTP requests
- Have a mode to server only HTTPS requests
- Have a mode where the server listens on a Unix socket for HTTP requests
- Be able to saturate multiple CPUs
I went with nodejs. Go probably would have been a good choice too, but I’m still not comfortable with Go build tools. An interesting note is that nodejs also uses openssl for crypto tasks, so whatever speed differences will most likely be due to the overhead of interacting with openssl.
The server will simply echo back the request data, which will give us a good sense of what overhead TLS constitutes.
The script is invoked with
One thing I noticed is that when I simply had the server
res.send("Hello World") it would be about at least ten times faster (1 - 5ms) than the equivalent echo of the same request body (50ms). I’m not sure if the act piping http streams causes a lot more work (eg. the request body doesn’t even need to be parsed, etc). A brief internet search turned up nothing, so I’m not going to worry about it. EDIT: Turns out I ended up writing a post about this!
I tried something that I have never done before when it comes to benchmarking. I ran the benchmarking software (wrk) on the same box as server. Before anyone gets up in arms, let me introduce you to CPU affinity. In linux, one can set which CPUs a process can run on. By setting the proxy and the server to the same CPUs, it would be representative of them running on the same box – competing for the same resources. Wrk is set to a mutually exclusive set of CPUs.
taskset will control what CPUs are used:
Caveat, the kernel can still schedule other processes on the mentioned CPUs. To force the kernel to ignore certain CPUs(so that only your process is using that CPU), look into booting your kernel with the
Our script will have three different configurations:
- http: A nodejs server listening for HTTP requests on 3001
- https: A nodejs server listening for HTTPs requests on 3001
- tls-off: A nodejs server listening for HTTP requests on
/tmp/node.sockwith our proxy listening on 8080.
Each configuration will be ran five times for each body size.
Here are the median response times graphed for our three methods across multiple body sizes.
At low body sizes, the TLS proxy beats the HTTPs configured nodejs server handily, and even manages to beat the HTTP version (does this mean that Rust has a better TCP stack than node!?). As request size increase, our proxy does worse and worse. Why is that? Well, I’ve been keeping a secret. The proxy (so far) is only single threaded! As request size increases the amount of time needed to encrypt and decrypt requests grows and this computationally expensive task is bottlenecked one core. I reached out to the tokio community and they’ve given men a few ideas on how to incorporate multiple cores.
I’m normally a proponent of open sourcing a lot of my material, but this is still very experimental and I wouldn’t want anyone to get any other impression. This is just a small excercise in Rust and Tokio! There are a bunch of features I want to play with (I mentioned metrics and logging in the beginning), and if it starts to look good, I’ll let the world see it.