Welcome to NBSoftSolutions, home of the software development company and writings of its main developer: Nick Babcock. If you would like to contact NBSoftSolutions, please see the Contact section of the about page.

Building a Home Server

oysters RAID-0 of oysters

I wrote this article not because I’ve built a home server but because I’m on the verge of doing it and I’d like to justify (to myself) why building one is reasonable and why I chose the parts I did.

I want to build a home server because I realize that I have content (pictures, home movies, etc). This content used to be stored on an external hard drive, but when that hard drive died, I lost a good chunk of it. Since then, I’ve moved the rest to a Windows Storage Pool. But then I thought about accessing the content remotely, and I didn’t want my work / gaming PC to be on 24/7 and exposed to the internet for power efficiency and security respectively. Using Onedrive is fine – great for editing online documents, but space is limited and I want a better sharing story (eg. loved one’s backup their pictures here too). Though, probably the most important reason is because setting up a home system sounds fun to me and it’s a good learning opportunity.

First up the part list:

Type Item Price
CPU Intel - Pentium G4600 3.6GHz Dual-Core Processor $86.99 @ Amazon
CPU Cooler Noctua - NH-L9i 33.8 CFM CPU Cooler $39.15 @ Newegg
Motherboard ASRock - E3C236D2I Mini ITX LGA1151 Motherboard $239.99 @ Newegg
Memory Kingston - ValueRAM 16GB (1 x 16GB) DDR4-2133 Memory $159.79 @ Amazon
Storage 6x Seagate - Desktop HDD 4TB 3.5" 5900RPM
Case Fractal Design - Node 304 Mini ITX Tower Case $74.99 @ Newegg
Power Supply Silverstone - 300W 80+ Bronze Certified SFX Power Supply $49.99 @ Amazon
Prices include shipping, taxes, rebates, and discounts
Total $650.90
Generated by PCPartPicker 2017-08-08 19:54 EDT-0400

The Case

The case really defines the rest of the build, so I’m starting here. A small form factor (SFF) case will limit oneself to more expensive components while a larger case will take up more room. I waffled between many cases – I was trying to get a small case that would fit on a shelf in the utility closet, but wouldn’t compromise on the number of 3.5” drives. The height restriction meant a lot of decent mini tower cases were excluded because even they were too tall. Here were the contenders:

SilverStone DS380B

ds380b

A SFF case with 8 hot-swappable 3.5” drive bays inside + more is quite an achievement and is the only option when SFF is needed with an absolute maximum number of drives. The downsides are that at $150 it was on the pricey side and many reviews stated that thermal management was a challenge so aftermarket fans with case modding is a necessity. This guy wrote an article solely to convince people not to buy the DS380B. Anyways, one goal of this build is to keep cost and effort to a minimum, so this case was eliminated.

SilverStone GD06

ds380b

An HTPC that has a lot going for it. The horizontal design makes it a alluring, as it can be placed on one of my cabinets. But with only four slots for 3.5”, it would be limited as far as a storage server is concerned. Double parity RAID would mean that half of the drives are redundant. The worst thing that could happen would be running out of room and being forced to decide on whether to buy bigger drives or get a dedicated NAS case.

Fractal Design Node 304

ds380b

A SFF cube case that has 6 3.5” bays, goes on sale for $60, has great reviews, and touted for the silent fans!? Sold.

Lian Li PC-Q25

ds380b

Special mention must be made to Lian Li’s case which houses 7 drive bays, costs more, and some (not many) have reported thermal issues.

The CPU

I’ve decided on the Pentium G4600.

  • With Kaby Lake, pentium processors are blessed with hyper threading so their 2 physical cores become 4 logical cores.
  • Kaby Lake also improved power efficiency with stress testing a G4560 using only 24W.
  • All “Core” chips don’t support ECC memory (thus excluded)
  • Paying 15% more a 100 MHz boost made me exclude the G4620
  • I actually wanted top of the line integrated graphics card (Intel HD Graphics 630) because there won’t be a dedicated GPU in this box and I will cry if I was GPU limited anywhere.
  • Cheap! I’m going to grab it when the price hits $80
  • The server will sit idle most of its life, so no need to get a powerful CPU. In the future, if it turns out I need more horsepower, by then there should be a nice array of secondhand kaby lake xeons out there.

The Motherboard

A Mini-ITX motherboard that supports ECC memory, socket 1151, and the Kaby Lake basically makes our decision for us!

There were a couple of ASRock boards and I went with E3C236D2I, the one with six SATA ports (same as case) with an added bonus of IPMI.

Unfortunately, a $240 price tag is a bit hard to swallow. There is definitely a price to pay for keeping the size down, but using enterprise RAM!

The RAM

Speaking of the RAM, I went with a single stick of 16GB ECC RAM. This may seem odd, but I’ll try and explain. I’m using ECC memory because I want to be safe rather than sorry and I’m not scrounging around looking for pennies so I can afford it. I’m only interested in a single stick because buying 32GB upfront seems overkill, I’m not made of money. Since the motherboard only has two DIMM slots I wanted something significant that should last in the meantime.

On a side note, RAM is expensive right now, the 16GB of RAM is retailing for $150 whereas it debuted at $75. Don’t worry, I have price triggers.

CPU Cooler

Even though the case supports tower CPU coolers and the Pentium G4600 comes with a stock cooler, I’ve opted for the slim aftermarket Notctua CPU Cooler: Noctua NH-L9i. The Noctua promises to be much quieter than the stock cooler. Since the fan is so slim, if I decide to get an even tinier case in the future, the fan will fit!

Since I won’t be overclocking the CPU, I’ll be able to use the low noise adaptor to make the cooler even quieter.

Power Supply

I went with the SilverStone 300W power supply.

  • I couldn’t buy anything that was below 300W (I was shooting for something around 200W). The reason is that power supplies were made to operate between 20% and 100% of their rated wattage. If I had gone with a 450W power supply (the next power supply in SilverStone’s lineup), I’d need a idle usage of at least 90W instead of 60W to get that guaranteed efficiency. Basically, this is me being environmentally conscious.
  • 80+ bronze rating was distinguishing in this low of power range
  • SFX form will allow me to get an even smaller case in the future if needed
  • Is semi fanless (quiet). People report that only under extreme duress does the fan turn on

Storage

I already have a couple of 4TB Seagate 3.5” drives, so getting more of them is a logical choice. Ideally, I wouldn’t have to buy all of them up front, but that is cost of ZFS. Here’s to hoping a get a good deal on them!

One of the things I’m still pondering is what I should do about a bootable drive. I could drop down to using a RAID of 5 drives and get a different drive for the OS. Brian Moses uses a flash drive. I’m actually thinking of using my one PCIe slot to host a M.2 PCIe adapter and grabbing a Samsung 960 EVO or something similar. PCPartPicker doesn’t list the motherboard as capable of using the M.2, but we’ll see about that, as the motherboard manual specifically calls out instructions for M.2 NVMe drives.

Software

After trying for a week to get FreeBSD and Plex working together, I gave up and have decided that Ubuntu 16.04 with docker is the way forward. Let me explain:

The first task was to determine whether to use a hardware RAID controller or a software based one. Searching around, it became clear that a software RAID was better due to costs and features from file systems like ZFS. Speaking of ZFS, it’s the best file system for a home server, as it is built for turning several disks into a one and features compression, encryption, etc.

Choosing ZFS, it would make sense if the OS was FreeBSD. ZFS and FreeBSD go together like bread and butter. They are tried and tested together. Since I was (and still am) unfamiliar with FreeBSD, I spent a week learning about jails and other administrative functions. The concept of jails (application isolation without performance cost) sounded amazing. Not to mention FreeBSD seemed like a lightweight OS. Running top would only show a dozen or so processes. I got quickly to work setting up a FreeBSD playground inside a virtual machine.

First I tried setting up an NFS server but ran into problems as I needed NFS v4 to run an NFS server on nested ZFS filesystems, but NFS v4 isn’t baked into Windows, so it was a no go. Then after only a couple hours of fighting with SMB, I finally got it working. I’m just going to squirrel away the config here for a rainy day:

[global]
workgroup = WORKGROUP
server string = Samba Server Version %v
netbios name = vm-freebsd
wins support = No
security = user
passdb backend = tdbsam
domain master = yes
local master = yes
preferred master = yes
os level = 65

# Example: share /usr/src accessible only to 'developer' user
[pool]
path = /pool/data
valid users = nick, guest
writable  = yes
browsable = yes
read only = no
guest ok = yes
public = no
create mask = 0666
directory mask = 0755

I think the trick was that I wanted SMB users to be users on the VM, so the Samba server should act as the master.

So as you can see, everything was going smoothly – that is until I tried setting up Plex. I thought that since plexmediaserver was on FreshPorts that everything should work. It didn’t, and since I didn’t know FreeBSD, ZFS, or plex, I went on a wild goose chase of frustration. The internet even failed me, as the errors I was searching came back with zero results.

In a fit, I created an Ubuntu VM and ran the plex docker container and everything just worked. I gave up FreeBSD right then and there. I wasn’t going to force something. I later found out that since FreeBSD represented less than 1% of plex’s user base, the team didn’t want to spend the resources for updates. Oh well, ideally I wouldn’t have to use docker (downloading all those images seem … bloated), but since it’s rise to ubiquity and promise of compatibility, I’ll hop on the bandwagon.

With that, let’s take a look at some of the applications I’m looking to run:

  • ddclient: A dynamic dns client. It keeps my dns records updated whenever my ISP decides to give me a new IP.
  • nginx: A webserver that will serve as a reverse proxy for all downstream applications. Will be able to use certificates from Let’s Encrypt without configuring each application.
  • collectd: A system metric gather (CPU, Memory, Disk, Networking, Thermals, IPMI, etc). This will send the data to:
  • graphite: Using the official graphite docker image to store various metrics about the system and other applications. These metrics will be visualized using:
  • grafana: Using the official grafana docker image creates graphs and dashboards that are second to none. Just look what I did for my home PC
  • plex: Using the official plex docker image will be used to host the few movies and shows that I have lying around.
  • nextcloud: Using the official nextcloud docker image will be essential for creating my own “cloud”. I can even use extensions to access my keepass or enable two factor authentication
  • gitea: Using the official gitea docker image, I’ll be hosting my private code here.
  • jenkins: The official jenkins docker will build all the private code
  • rstudio: The rocker docker image will let me access my rstudio sessions when I’m away. Currently, I have a digitalocean machine with rsutdio, but it’s been a pain for me to create and destroy the machine every time I need it.

You’d be wrong if you thought I’d abandon my current cloud storage providers (Onedrive, Google Drive, etc). In fact, I pay them for increased storage because stuff happens and I need to have backups of pictures, home videos, code, and important documents. I’m planning on keeping all the clouds in sync with everything encrypted using rclone. That way if a backup is compromised, it is no big deal.

I’m also not going to abandon DigitalOcean, as those machines easily have more uptime and uplink than Comcast here. My philosophy is that if I want to show people my creation, I’ll host it externally, else I’ll self-host it. Plus it is a lot easier to tear down and recreate machines with an IAAS rather than bare metal.

The only question now is … when will I jump head first?


Investigation into the Inefficiencies of Node.js Http Streams

skyway

Previously, I wrote a Rust TLS termination proxy where I tested it with a Node.js echo server and noted the terrible performance. This post will be a deeper dive into this problem. See this post on stackoverflow.

The weird thing is I don’t know what the problem is, but I know the solution. Let me explain.

Echo Server Implementations

On node v8.1.4 or v6.11.1:

I started out with the following echo server implementation, which I will refer to as pipe.js or pipe.

const http = require('http');

const handler = (req, res) => req.pipe(res);
http.createServer(handler).listen(3001);

And I benchmarked it with wrk and the following lua script (shortened for brevity) that will send a small body as a payload.

wrk.method = "POST"
wrk.body   = string.rep("a", 10)

At 2k requests per second and 44ms of average latency, performance is not great.

So I wrote another implementation that uses intermediate buffers until the request is finished and then writes those buffers out. I will refer to this as buffer.js or buffer.

const http = require('http');

const handler = (req, res) => {
  let buffs = [];
  req.on('data', (chunk) => {
    buffs.push(chunk);
  });
  req.on('end', () => {
    res.write(Buffer.concat(buffs));
    res.end();
  });
};
http.createServer(handler).listen(3001);

Performance drastically changed with buffer.js servicing 20k requests per second at 4ms of average latency.

For those that are visual learners, the graph below depicts the average number of requests serviced over 5 runs and various latency percentiles (p50 is median).

comparison between buffer and pipe implementation

So, buffer is an order of magnitude better in all categories. My question is why?

What follows next are my investigation notes, hopefully they are at least educational.

Response Behavior

Both implementations have been crafted so that they will give the same exact response as returned by curl -D - --raw. If given a body of 10 d’s, both will return the exact same response (with modified time, of course):

HTTP/1.1 200 OK
Date: Thu, 20 Jul 2017 18:33:47 GMT
Connection: keep-alive
Transfer-Encoding: chunked

a
dddddddddd
0

Both output 128 bytes (remember this).

Interesting tidbit, we can modify buffer.js to remove the res.write and write the buffer in res.end(). The response will now not be chunked encoded.

The Mere Fact of Buffering

Semantically, the only difference between the two implementations is that pipe.js writes data while the request hasn’t ended. This might make one suspect that there could be multiple data events in buffer.js. This is not true.

req.on('data', (chunk) => {
  console.log(`chunk length: ${chunk.length}`);
  buffs.push(chunk);
});
req.on('end', () => {
  console.log(`buffs length: ${buffs.length}`);
  res.write(Buffer.concat(buffs));
  res.end();
});

Empirically:

  • Chunk length will always be 10
  • Buffers length will always be 1

Since there will only ever be one chunk, what happens if we remove buffering and implement a poor man’s pipe:

const http = require('http');

const handler = (req, res) => {
  req.on('data', (chunk) => res.write(chunk));
  req.on('end', () => res.end());
};
http.createServer(handler).listen(3001);

Turns out, this has as abysmal performance as pipe.js. I find this interesting because the same number of res.write and res.end calls are made with the same parameters. My best guess so far is that the performance differences are due to sending response data after the request data has ended.

Profiling

I profiled both application using the simple profiling guide (–prof).

I’ve included only the relevant lines:

pipe.js

 [Summary]:
   ticks  total  nonlib   name
   2043   11.3%   14.1%  JavaScript
  11656   64.7%   80.7%  C++
     77    0.4%    0.5%  GC
   3568   19.8%          Shared libraries
    740    4.1%          Unaccounted

 [C++]:
   ticks  total  nonlib   name
   6374   35.4%   44.1%  syscall
   2589   14.4%   17.9%  writev

buffer.js

 [Summary]:
   ticks  total  nonlib   name
   2512    9.0%   16.0%  JavaScript
  11989   42.7%   76.2%  C++
    419    1.5%    2.7%  GC
  12319   43.9%          Shared libraries
   1228    4.4%          Unaccounted

 [C++]:
   ticks  total  nonlib   name
   8293   29.6%   52.7%  writev
    253    0.9%    1.6%  syscall

We see that in both implementations, C++ dominates time; however, the functions that dominate are swapped. Syscalls account for nearly half the time for pipe, yet only 1% for buffer (forgive my rounding). Next step, which syscalls are the culprit?

Strace Here We Come

Invoking strace like strace -c node pipe.js will give us a summary of the syscalls. Here are the top syscalls:

pipe.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.91    0.014974           2      9492           epoll_wait
 25.57    0.008720           0    405693           clock_gettime
 20.09    0.006851           0     61748           writev
  6.11    0.002082           0     61803       106 write

buffer.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 42.56    0.007379           0    121374           writev
 32.73    0.005674           0    617056           clock_gettime
 12.26    0.002125           0    121579           epoll_ctl
 11.72    0.002032           0    121492           read
  0.62    0.000108           0      1217           epoll_wait

The top syscall for pipe (epoll_wait) with 44% of the time is only 0.6% of the time for buffer (a 140x increase). While there is a large time discrepancy, the number of times epoll_wait is invoked is less lopsided with pipe calling epoll_wait ~8x more often. We can derive a couple bits of useful information from that statement, such that pipe calls epoll_wait constantly and an average, these calls are heavier than the epoll_wait for buffer.

For buffer, the top syscall is writev, which is expected considering most of the time should be spent writing data to a socket.

Logically the next step is to take a look at these epoll_wait statements with regular strace, which showed buffer always contained epoll_wait with 100 events (representing the hundred connections used with wrk) and pipe had less than 100 most of the time. Like so:

pipe.js

epoll_wait(5, [.16 snip.], 1024, 0) = 16

buffer.js

epoll_wait(5, [.100 snip.], 1024, 0) = 100

Graphically:

comparison between buffer and pipe implementation

This explains why there are more epoll_wait in pipe, as epoll_wait doesn’t service all the connections in one event loop. The epoll_wait for zero events makes it look like the event loop is idle! All this doesn’t explain why epoll_wait takes up more time for pipe, as from the man page it states that epoll_wait should return immediately:

specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available.

While the man page says the function returns immediately, can we confirm this? strace -T to the rescue:

comparison between buffer and pipe in terms of how long epoll_wait took to execute

Besides supporting that buffer has fewer calls, we can also see that nearly all calls took less than 100ns. Pipe has a much more interesting distribution showing that while most calls take under 100ns, a non-negligible amount take longer and land into the microsecond land.

Strace did find another oddity, and that’s with writev. The return value is the number of bytes written.

pipe.js

writev(11, [{"HTTP/1.1 200 OK\r\nDate: Thu, 20 J"..., 109},
	{"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}], 4) = 123

buffer.js

writev(11, [{"HTTP/1.1 200 OK\r\nDate: Thu, 20 J"..., 109},
  {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}, {"0\r\n\r\n", 5}], 5) = 128

Remember when I said that both output 128 bytes? Well, writev returned 123 bytes for pipe and 128 for buffer. The five bytes difference for pipe is reconciled in a subsequent write call for each writev.

write(44, "0\r\n\r\n", 5)

And if I’m not mistaken, write syscalls are blocking.

Conclusion

If I have to make an educated guess, I would say that piping when the request is not finished causes write calls. These blocking calls significantly reduce the throughput partially through more frequent epoll_wait statements. Why write is called instead of a single writev that is seen in buffer is beyond me.

The kicker? In the official Node.js guide you can see how the guide starts with the buffer implementation and then moves to pipe! If the pipe implementation is in the official guide there shouldn’t be such a performance hit, right?

July 31st 2017 EDIT

My initial hypothesis that writing the echoed body after the request stream has finished increases performance has been disproved by @robertklep with his readable.js (or readable) implementation:

const http   = require('http');
const BUFSIZ = 2048;

const handler = (req, res) => {
	req.on('readable', _ => {
		let chunk;
		while (null !== (chunk = req.read(BUFSIZ))) {
			res.write(chunk);
		}
	});
	req.on('end', () => {
		res.end();
	});
};
http.createServer(handler).listen(3001);

Readable performed at the same level as buffer while writing data before the end event. If anything this makes me more confused because the only difference between readable and my initial poor man’s pipe implementation is the difference between the data and readable event and yet that caused a 10x performance increase. But we know that the data event isn’t inherently slow because we used it in our buffer code.

For the curious, strace on readable reported writev outputs all 128 bytes output like buffer

This is perplexing!


Writing a high performance TLS terminating proxy in Rust

moving I’ve been unpacking boxes like a TLS proxy unboxes TCP packets of TLS data

I feel like I don’t write about the Rust language that often, but it turns out today is my 2nd anniversary of dabbling in Rust. Exactly two years ago, I wrote a post about processing arbitrary amount of data in Python where I benchmarked a Python csv solution against Rust (spoiler, Rust is faster). Since then, Rust has only been mentioned in passing on this blog (except for my post on creating a Rocket League Replay Parser in Rust, where I dive into not only how to create a parser with nom, but what the replay format looks like). Well, today will be all about creating a TLS termination proxy. Keep in mind, while I have a decent intuition, I still consider myself new to Rust (I still fight with the borrow checker and have to keep reminding myself what move lambdas do)

TLS Termination Proxy

Some languages have notoriously bad TLS implementations, which can be offputting to those that are looking to secure an app or site. One can put their app behind one of the big juggernauts (Apache, Nginx, HAProxy), which can do TLS termination along with hundreds of other features. I recommend using them wholeheartedly and I personally use them, but when only TLS termination is desired, these can be excessive. Hence the perfect opportunity to write a lightweight and speedy TLS implementation.

When implementing any kind of proxy, one has to determine how much content of the request is needed to proxy correctly. A layer 7 proxy would, for instance, read the HTTP url and headers (and maybe even the body) before proxying to the app. These proxies can even modify the requests, like adding the X-Forwarded-For, so the client IP address isn’t lost at the proxy. As one can imagine, this is more computationally intense than blindly handing requests from A directly to B. Since this project is more of an experiement than anything, I went with the blind approach! The advantage of a blind proxy (sometimes known as a dumb proxy) is that they are protocol agnostic and can work with any type of application data.

Note that even dumb proxies can forward client addresses to downstream services using the PROXY protocol initially championed by HAProxy. This is something that I could look to implement in the future, as it is simply a few additional bytes sent initially to the server.

Beside performance, a TLS proxy can bring other benefits. Since the use case is so focussed, it would be relatively easy to bring features such as structured logging or metrics. One wouldn’t have to reimplement the solution for “what’s the throughput for each client” for each new application. Just stick the the TLS proxy in front of any app that needs those answers.

The Design

Here’s the design, one could say it’s pretty simple:

tls termination design

  • The TLS proxy listens on a given port
  • Clients communicate via TLS with the proxy
  • Proxy decrypts and forwards the request to the application via Unix sockets
  • Application responds
  • Proxy re-applies encryption and sends it back to the client.

Rust Implementation

Since Rust ecosystem has only started to dip its toes in asynchronous IO with tokio I figured I should see what all the fuss was about.

In the end, while I was satisified with the result, the labor involved in getting code correct took a significant amount of time. I don’t think I have ever written code slowly. This makes sense though, I don’t ever write networking code this low, so it should take me longer as I figure what is needed to satifiy the compiler. The meat of my implementation was taken from the official tokio repo. TLS was achieved through a combination of tokio-tls and native-tls. Since I only care about the Linux use case, I needed to extend the functionality of native-tls to allow users of the library additional customizations for openssl.

Proxy App Code

I needed a server that would satisfy the following benchmark requirements:

  • Have a mode to serve only HTTP requests
  • Have a mode to server only HTTPS requests
  • Have a mode where the server listens on a Unix socket for HTTP requests
  • Be able to saturate multiple CPUs

I went with nodejs. Go probably would have been a good choice too, but I’m still not comfortable with Go build tools. An interesting note is that nodejs also uses openssl for crypto tasks, so whatever speed differences will most likely be due to the overhead of interacting with openssl.

The server will simply echo back the request data, which will give us a good sense of what overhead TLS constitutes.

Below is the code for the server. There are zero dependencies, which I reckon is quite an achievement in javascript land!

const fs = require('fs');
const http = require('http');
const https = require('https');
const cluster = require('cluster');
const sock = '/tmp/node.sock';

if (cluster.isMaster) {
  // Since unix sockets are communicated over files, we must make sure the
  // file is deleted if it exists so we can open the socket
  if (fs.existsSync(sock)) {
    fs.unlinkSync(sock);
  }

  // Create as many processes as provided on the commandline
  const cpus = parseInt(process.argv[3]);
  for (let i = 0; i < cpus; i++) {
    cluster.fork();
  }
} else {
  // We're going to respond with every request by echoing back the body
  const handler = (req, res) => {
    req.pipe(res);
  };

  // Callback to let us know when the server is up and listening. To ease
  // debugging potential race conditions in scripts
  const listenCb = () => console.log('Example app is listening');

  const arg = process.argv[2];
  switch (arg) {
    case 'sock':
      http.createServer(handler).listen(sock, listenCb);
      break;
    case 'http':
      http.createServer(handler).listen(3001, listenCb);
      break;
    case 'https':
      const sslOptions = {
        key: fs.readFileSync('key.pem'),
        cert: fs.readFileSync('cert.pem')
      };

      https.createServer(sslOptions, handler).listen(3001, listenCb);
      break;
    default:
      throw Error(`I do not recognize ${arg}`);
  }
}

The script is invoked with

node index.js <http|https|sock> <#cpus>

One thing I noticed is that when I simply had the server res.send("Hello World") it would be about at least ten times faster (1 - 5ms) than the equivalent echo of the same request body (50ms). I’m not sure if the act piping http streams causes a lot more work (eg. the request body doesn’t even need to be parsed, etc). A brief internet search turned up nothing, so I’m not going to worry about it. EDIT: Turns out I ended up writing a post about this!

CPU Affinity

I tried something that I have never done before when it comes to benchmarking. I ran the benchmarking software (wrk) on the same box as server. Before anyone gets up in arms, let me introduce you to CPU affinity. In linux, one can set which CPUs a process can run on. By setting the proxy and the server to the same CPUs, it would be representative of them running on the same box – competing for the same resources. Wrk is set to a mutually exclusive set of CPUs.

Invoking taskset will control what CPUs are used:

taskset --cpu-list 0-2 node index.js http 3

Caveat, the kernel can still schedule other processes on the mentioned CPUs. To force the kernel to ignore certain CPUs(so that only your process is using that CPU), look into booting your kernel with the isolcpus option.

Benchmarking Script

Our script will have three different configurations:

  • http: A nodejs server listening for HTTP requests on 3001
  • https: A nodejs server listening for HTTPs requests on 3001
  • tls-off: A nodejs server listening for HTTP requests on /tmp/node.sock with our proxy listening on 8080.

Each configuration will be ran five times for each body size.

#!/bin/bash -e

# This script will use all the cores on the box
CPUS=$(nproc --all)

# This is the directory where the TLS proxy is stored, the code is not
# currently open sourced but is included in the script for completeness.
# The command is ran on the CPUs as the application
RUSTTLS_DIR=../projects/rustls-off
RUSTTLS_CMD="taskset --cpu-list 0-$((CPUS - 2)) target/release/rustls-off config.toml"

# Node server created from the javascript posted earlier
NODE_DIR="../projects/test"
NODE_CMD="taskset --cpu-list 0-$((CPUS - 2)) node index.js"

# Benchmarking command. It uses only a single thread, as a single thread was
# able to saturate three server CPUs. Strangely enough, anything lower than
# 3000 connections wouldn't saturate anything (node is bottlenecked on something,
# but I'm not sure what)
WRK_CMD="taskset --cpu-list $((CPUS - 1)) wrk -t1 -c3000 -d30s"

# We'll be creating a Lua file to customize benchmarking. The Lua file will
# be deleted on completion
LUA_FILE=$(mktemp)
trap "rm -f $LUA_FILE" EXIT

# We're going to to test request bodies that are of varying sizes
for BODY_SIZE in 10 100 1000 10000; do

    cat > $LUA_FILE <<EOF
wrk.method = "POST"
wrk.body   = string.rep("a", tonumber(os.getenv("BODY_SIZE")))

done = function(summary, latency, requests)
    io.write(string.format("%s,%s,%d,%d,%d,%d,%d,%d\n",
        os.getenv("CONFIG"),
        os.getenv("BODY_SIZE"),
        summary.requests,
        latency.mean,
        latency.stdev,
        latency:percentile(50),
        latency:percentile(90),
        latency:percentile(99)))
end
EOF

    # Each body size benchmark will be repeated five times to ensure a good
    # representation of the data
    for i in {1..5}; do
        echo "Body size: $BODY_SIZE, iteration: $i"
        pushd $NODE_DIR
        $NODE_CMD http $((CPUS - 1)) &
        NODE_PID=$!
        popd

        # Must sleep for a second to wait for the node server to start
        sleep 1
        CONFIG=http BODY_SIZE=$BODY_SIZE $WRK_CMD -s $LUA_FILE http://localhost:3001 | \
			tee >(tail -n 1 >> tls-bench.csv);
        kill $NODE_PID

        pushd $NODE_DIR
        $NODE_CMD https $((CPUS - 1)) &
        NODE_PID=$!
        popd
        sleep 1
        CONFIG=https BODY_SIZE=$BODY_SIZE $WRK_CMD -s $LUA_FILE https://localhost:3001 | \
			tee >(tail -n 1 >> tls-bench.csv);
        kill $NODE_PID

        pushd $NODE_DIR
        $NODE_CMD sock $((CPUS - 1)) &
        NODE_PID=$!
        popd

        pushd $RUSTTLS_DIR
        $RUSTTLS_CMD &
        RUSTTLS_PID=$!
        popd

        sleep 1
        CONFIG="tls-off" BODY_SIZE=$BODY_SIZE $WRK_CMD -s $LUA_FILE https://localhost:8080 | \
			tee >(tail -n 1 >> tls-bench.csv);
        kill $NODE_PID
        kill $RUSTTLS_PID
    done
done

Results

Here are the median response times graphed for our three methods across multiple body sizes.

results for tls termination

library(tidyverse)
library(readr)

df <- read_csv("tls-bench.csv",
               c("config", "size", "requests", "mean", "stdev", "p50", "p90", "p99"))

df <- mutate(df,
             mean = mean / 1000,
             stdev = stdev / 1000,
             p50 = p50 / 1000,
             p90 = p90 / 1000,
             p99 = p99 / 1000)

ggplot(df, aes(factor(df$size), p50, fill=config)) +
  geom_jitter(size=4, width=0.15, shape=21) +
  xlab("Body size (bytes)") + ylab("Median response time (ms)") +
  ggtitle("Median response times to complete requests")

At low body sizes, the TLS proxy beats the HTTPs configured nodejs server handily, and even manages to beat the HTTP version (does this mean that Rust has a better TCP stack than node!?). As request size increase, our proxy does worse and worse. Why is that? Well, I’ve been keeping a secret. The proxy (so far) is only single threaded! As request size increases the amount of time needed to encrypt and decrypt requests grows and this computationally expensive task is bottlenecked one core. I reached out to the tokio community and they’ve given men a few ideas on how to incorporate multiple cores.

Open Source?

I’m normally a proponent of open sourcing a lot of my material, but this is still very experimental and I wouldn’t want anyone to get any other impression. This is just a small excercise in Rust and Tokio! There are a bunch of features I want to play with (I mentioned metrics and logging in the beginning), and if it starts to look good, I’ll let the world see it.