Welcome to NBSoftSolutions, home of the software development company and writings of its main developer: Nick Babcock. If you would like to contact NBSoftSolutions, please see the Contact section of the about page.

Back to the Register: Distilling the Federal Register for All

Making something simple from complicated Making something simple from complicated

Introduction

A year and a half ago, I wrote Federal Register Data Exploration with R, and I had written it mainly for myself, as a way of exhibiting a technique I called “flexml” (a portmanteau of flexible and xml). The post is light on details about how to transform the Federal Register data (natively stored in xml) into a more queryable form (in that case, a database). I didn’t think much of it until an academic mentioned they wanted to replicate my findings and were having difficulty, so I decided to rethink the process and open the results for all. The latest code can be found on github.

Step 1: Grabbing the xml data

The data is stored yearly the Federal Register website. Each year is a zip file where and the days are individual xml files. These xml files are nontrivial and contain a structure designed for formatting.

To grab the data from years 2005 to 2016:

# Store all the register data under the 'data' directory
mkdir -p data
for i in {2005..2016}; do
    # Since the download takes awhile, detect if we've have already downloaded it
    if [[ ! -e FR-$i.zip ]]; then
        curl -O -L "https://www.gpo.gov/fdsys/bulkdata/FR/$i/FR-$i.zip"
    fi
    unzip -d data FR-$i.zip
done;

Step 2: Transform the XML with XQuery

There are many ways to query and manipulate XML:

  • XPath
  • XSLT
  • XQuery
  • Language specific parsers (DOM, Sax, Stax)

I decided to go with XQuery as XQuery 3.1 will allow us to concisely select the important tidbits and return JSON output. Any other method would be a verbose eyesore (looking at your XSLT) or language specific. Short, sweet, and being language agnostic will help potential contributors – especially contributors that may not be spending 40+ hours a week programming like myself.

To make XQuery accessible, we’ll need a command line program to run our transforms. For that, there is really only one option: Saxon HE written by the man and company behind advancing much of the XML query languages.

Here is a taste of what XQuery looks like:

declare function local:names($content as node()*) {
    for $raw in $content

        (: A name element can have an emphasis, so all children text is needed
           to be joined. I'm looking at you <E T="04">Marlene H. Dortch,</E> :)
        let $trim := functx:trim(fn:string-join($raw//text(), ''))

        (: And all names end with a comma or semi-colon, which we dutifully
           strip. The one semicolon seen is for a George Aiken for the 2006-09-14
           publication :)
        let $name := functx:substring-before-last-match($trim, '[\.;,]')
        return $name
};

One question that people may have is why I want to transform XML to JSON. The lingua franca of data communication is JSON and CSVs, so we transform the XML into JSON so that it’s trivial for us to transform that to CSV. It may seem like a lot of indirections, but each transformation is isolated and can be well quickly understood. This project could have been constructed as a monolithic using (insert favorite language here), but it would have resulted in a black-box instead of a couple short scripts.

Step 3: Transform the JSON with Python

Since the federal register contains unicode text, we’ll need to use Python3 for the unicode-enabled csv module. We could have still used Python2, but that would have required external dependencies. I normally have no problem using dependencies, but to keep the theme of simplicity, we’ll limit ourselves to only Python3 (that’s ubiquitous enough, right?)

The script itself is extremely straightforward, ingesting JSON from stdin and outputting csv to stdout.

#!/usr/bin/env python3
# This script ingests json formatted by `transform.xql` and outputs a csv.
# Execute this script under python3 as the federal register contains unicode
# data and python 3's csv writer supports unicode data.
import json
import csv
import sys

# Turn presidents json into csv
def presidents(pres):
    return ['presidential', 'presidential', pres['title'], 'the president', None, pres['docket']]

# Turn the other document types from json into csv
def rules(r, name):
    if any(map(lambda s: ';' in s, r['names'])):
        raise Exception("Can't semicolon list")

    # Some rins are written as RIN 1625-AA08; AA00 and need to be split apart
    rin = ';'.join([i for sl in r['rin'] for i in sl.split('; ')])
    return [name, r['agency'], r['subject'], ';'.join(r['names']), rin, r['docket']]

data = json.load(sys.stdin)
dt = data['date']
writer = csv.writer(sys.stdout)
writer.writerow(['date', 'type', 'agency', 'subject', 'names', 'rin', 'docket'])
writer.writerows(map(lambda p: [dt] + presidents(p), data['presidentials']))
writer.writerows(map(lambda r: [dt] + rules(r, "rule"), data['rules']))
writer.writerows(map(lambda r: [dt] + rules(r, "proposed-rule"), data['proposed-rules']))
writer.writerows(map(lambda r: [dt] + rules(r, "notice"), data['notices']))

The only thing clever in the script is how it embeds a list of values for a single CSV column using semi colon delimited list. After the script is done, each day has now been turned into a csv file.

The column definitions:

  • date: The date the document appeared in the registry
  • type: Presidential / rule / proposed-rule / notice
  • agency: What agency issued this document (eg. Department of transportation)
  • subject: What is the subject / title of this document
  • names: List of names associated with the document (semi-colon delimited)
  • rin: List Regulation Identifier Numbers associated with the document (semi-colon delimited)

Now to combine all the files.

Step 4: Combining CSV with XSV

It would be possible to combine all the files using another python script, but I decided to take the easy route and use XSV, as combining all files correctly is literally a one liner:

xsv cat rows $TMP_DIR/* > $OUT

XSV is not available by default in any distro, but the saving grace is that one can just simply download the executable from the github page and have it just work.

Step 5: Analysis with R

Now we have a CSV and analyzing this with R should be a breeze. Before we get too far into creating visualizations, we’ll massage the data a bit.

Department names are long, so we’re going to create abbreviations. Some abbreviations are probably more known than the actual department name (SEC), so use those abbreviations. Then if we’re dealing with a “Department of X” shorten it to “D. x”. Abbreviating these names make the scales much more reasonable.

abbrev <- function(x) {
  ifelse(as.character(x) == "Department of health and human services", "HHS", 
  ifelse(as.character(x) == "Securities and exchange commission", "SEC", 
  ifelse(as.character(x) == "Department of transportation", "DoT", 
         gsub("Department of (.*)", "D. \\1", x))))
}

df <- read_csv("output-2017.10.19.csv")
df <- df %>% mutate(
  agency = abbrev(agency),
  names = strsplit(names, ";"),
  rin = strsplit(rin, ";"),
  month = as.Date(cut(date, breaks = "month")))

First let’s take a look at how each document type has fluctuated throughout the years

pics/register/register-1.png

All document types appear stable throughout time. Notices, by far, are the most common. Presidential documents are the fewest, and it may be hard to discern a pattern from this graph so let’s take a closer look

pics/register/register-2.png

Wow, the number of presidential documents in the register fluctuates a lot! Instead of looking at time linearly, what if there is a pattern grouped by month

pics/register/register-3.png

So the president takes it easy in the winter and summer, but kicks it into high gear in the spring and fall.

What are the top five agencies by number of documents in the federal register and how do they compare to the rest?

pics/register/register-4.png

While the top 5 agencies are somewhat close to each other, they are all dwarfed when other agencies are included. Let’s break down the document types for the top 5 agencies

pics/register/register-5.png

All agencies appear to follow the trend of notices » proposed-rules and rules except for the department of transportation, which contains the most rules and the least notices. Why this might be is beyond me!

Let’s see if we can’t depict how ephemeral people are (eg. people are promoted, administration changes, etc). So for each month, grab the top names and we’ll graph their growth.

pics/register/register-6.png

Maybe I’ll figure out how to discern individuals better, but we can see those who once were prominent are replaced by someone new.

Remaining Questions

To run the XQuery transformation on 3,000 files the JVM is started and stopped 3,000 times. This is time consuming and wish there was a better way. I have stumbled across drip and ahead of time compilation of JDK9, but I haven’t had too much success with them.


WireGuard VPN Walkthrough

VPN hill VPN greenfield

This article was last updated: October 15th 2017

With the rise of privacy consciousness, people are looking to solutions like a hosted VPN (I hear one should never use a free service), or self hosted like streisand and algo. How does a VPN (a remote access VPN – not a site-to-site VPN for the pedantic) help maintain privacy?

In the scenario of maintaining privacy or getting around geographic content blocking, the VPN connects you to a server, oftentimes in a different country, where it forwards all your traffic to the intended recipient. The recipient responds to the server, which dutifully forwards back to you. So, if you live in the US, but are VPNed into a German server and request content from India, India will think you’re in Germany (this assuming countries have thoughts </joke>).

I’m going to show how to self host WireGuard, which bills itself as easier to configure than IPSec and OpenVPN, while being faster and more powerful. WireGuard is a component feature of of streisand, but since we’re going to be dealing with only a linux client and server setup we cut out the streisand middleman and just use WireGuard. Theoretically, this cuts down on the bloat and attack surfaces that are inherent with the wide array of software that streisand installs (streisand is planning on supporting modular install in the future).

It should be noted:

WireGuard is not yet complete. You should not rely on this code. It has not undergone proper degrees of security auditing and the protocol is still subject to change.

This demonstration will be on a DigitalOcean Ubuntu 16.04 box, but it should be easily adaptable for other platforms (as long as they are linux based).

Script

The following script is to be executed on one’s server. This script will be subsequently broken down.

#!/bin/bash

# The client's public key (generated in subsequent section client side)
CLIENT_PUGLIC="<ENTER>"

sysctl -w net.ipv4.ip_forward=1
add-apt-repository --yes ppa:wireguard/wireguard
apt-get update
apt-get install --yes wireguard-dkms wireguard-tools

wg genkey | tee privatekey | wg pubkey > publickey

PRIVATE=$(cat privatekey)
echo "public: $(cat publickey)"

cat > /etc/wireguard/wgnet0.conf <<EOF
[Interface]
Address = 10.192.122.1/24
SaveConfig = true
ListenPort = 51820
PostUp = iptables -A FORWARD -i wgnet0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wgnet0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
PrivateKey = $PRIVATE

[Peer]
PublicKey = $CLIENT_PUBLIC
AllowedIPs = 10.192.122.2/32
EOF

Simple enough, how does this work?

sysctl -w net.ipv4.ip_forward=1

Sysctl allow modifying kernel parameters at runtime, so here we’re allowing the kernel to forward packets from one network interface to another. We need this, as wireguard works by creating the VPN on another network interface (commonly called wg0 or wgnet0). This interface, by itself, does not have internet access, but with ip forwarding we can foward traffic from the VPN to the interface that can communicate with the internet.

Forwarding is only important for the server because once connected to the VPN the default client interface won’t be used anymore.

sysctl -w is only for changing kernel parameters at runtime. To persist these settings, edit the relevant line in /etc/sysctl.conf

add-apt-repository --yes ppa:wireguard/wireguard
apt-get update
apt-get install --yes wireguard-dkms wireguard-tools

These commands fetches the latest wireguard version and installs it. Since WireGuard hooks into the kernel, it attempts to automatically detect the correct kernel to hook into. This should work flawlessly.

The one problem I’ve had is that for DigitalOcean controls the kernel through their web interface (one can use a custom kernel but that is outside of the scope of this post). Anyways, if you had tried to install a custom kernel ontop of the one in DigitalOcean, wireguard will skip the correct kernel as it believes it’s chrooted. Sorry for the tangent, but since I experienced this problem, I figured I should document it for others.

wg genkey | tee privatekey | wg pubkey > publickey

Both the client and server need to generate a pair of keys. The server does not need to know the client’s private key and vice versa; however they do need to know each other’s public key to permit only authorized use of the VPN (else anyone who knew your VPN server’s address could use your VPN).

[Interface]
Address = 10.192.122.1/24
ListenPort = 51820

When clients connect to the server, they can communicate directly with by using the 10.192.122.1. We know that 10.192.122.1 can’t possibly be an internet facing box because it falls under a private network. The /24 is a CIDR subnet mask that states that this VPN will is capable of housing 254 clients. WireGuard then listens on port 51820 for interested clients.

PostUp = iptables -A FORWARD -i wgnet0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wgnet0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

Probably the most convulted section of the config, yet this step must not be skipped. This is how those lines configure the firewall.

When the VPN is created:

  • We accept packets from our VPN interface for packets being routed through the box
  • Then whenever a new connection is created (eg. our client wants to access google.com, so the server needs to connect to google.com now), the outgoing packets are altered to have the server IP address so google.com responds to the server, which relays it back to the client.

Then when the VPN is destroyed everything in our firewall is deleted.

If you forget those lines, when you go to connect as a client your requests will blackhole and it may appear as if you lost internet connection.

[Peer]
PublicKey = $CLIENT_PUBLIC
AllowedIPs = 10.192.122.2/32

The peer section is for client information. The client that connects with the given client public key is assigned 10.192.122.2 for their IP address.

Client Config

The server is all setup so what does the client configuration look like?

[Interface]
Address = 10.192.122.2/32
PrivateKey = CLIENT_PRIVATE_KEY
DNS = 10.192.122.1

[Peer]
PublicKey = SERVER_PUBLIC_KEY
AllowedIPs = 0.0.0.0/0
Endpoint = SERVER_IP:51820

Breaking down the config:

[Interface]
Address = 10.192.122.2/32
DNS = 10.192.122.1

The 10.192.122.2/32 matches the same address as the server.

We set the client’s DNS server to that of the VPN server. This is not needed, but I recommend it, as you want to communicate with servers (eg. google.com) that are closest to your VPN server to minimize latency. For instance, if you live in the US, VPNed into Singapore, and wanted google.com, you’d want to talk to Singapore Google server (and not the one’s in the US) so that packets travel the least distance.

Automation

It is actually possible to create and destroy VPN boxes on demand for next level privacy. Here is how one would do this:

  • Create DigitalOcean box by hand using the previous instructions.
  • Verify that the VPN works (wg-quick up wgnet0 on both client and server).
  • Take snapshot.
  • Using the DigitalOcean cli one can create a new server from our snapshot (so the server will have the same public and private key, it’ll just have a new ip address)
  • Update your client config to reference server’s new IP address

And just like that you can create and destory VPNs all around the world in under a minute.


Robust systemd Script to Store Pi-hole Metrics in Graphite

serene metrics Ah, a serene lake of … metrics

I’m building a home server and I’m using Pi-hole to blackhole ad server domains in a test environment. It’s not perfect, but works well enough. The Pi-hole admin interface shows a dashboard like the following to everyone (admin users get an even greater breakdown, but we’ll focus on just the depicted dashboard)

pi-hole admin

It’s a nice looking dashboard; however, it’s a standalone dashboard that doesn’t fit in with my grafana dashboards. I don’t want to go to multiple webpages to know the health of the system especially since I want to know two things: is pi-hole servicing DNS requests and are ads being blocked. This post will walk through exporting this data from pi-hole into Graphite, which is my time series database of choice.

There is a blog post that contains a python script that will do this process. Not to pick on the author (they inspired me) but there are a couple of things that can be improved:

  • The python script is not standalone. It requires the requests library. While ubiquitous, the library is not in the standard library and will require either a virtual environment or installation into the default python
  • Any error will crash the program (eg connection refused, different json response, etc) and it’ll have to be manually restarted
  • No logs to know if the program is running or why it would have crashed
  • Does not start on computer reboot

An Alternative

Let’s start with making a network request:

curl http://$HOST:$PORT/admin/api.php?summaryRaw

returns

{"domains_being_blocked":116855,"dns_queries_today":8882 ... }

Ah, let’s turn to jq to massage this data, which will, by default, prettyify the output

{
  "domains_being_blocked": 116855,
  "dns_queries_today": 8882,
  "ads_blocked_today": 380,
  "ads_percentage_today": 4.278316,
  "unique_domains": 957,
  "queries_forwarded": 5969,
  "queries_cached": 2533,
  "unique_clients": 2
}

We somehow need to get the previous data into the <path> <value> <timestamp> format for carbon with lines seperated by newlines.

Since jq prefers working with arrays, we’ll transform the object into an array: jq 'to_entries'

[
  {
    "key": "domains_being_blocked",
    "value": 116855
  },
  {
    "key": "dns_queries_today",
    "value": 8954
  },
  // ...
]

Now we’re going to transform each element of the array into a string of $key $value with jq 'to_entries | map(.key + " " + (.value | tostring))'. Value is numeric and had to converted into a string.

[
  "domains_being_blocked 116855",
  "dns_queries_today 8966",
  "ads_blocked_today 385",
  "ads_percentage_today 4.294",
  "unique_domains 961",
  "queries_forwarded 6021",
  "queries_cached 2560",
  "unique_clients 2"
]

Finally, unwrap the array and string with a jq -r '... | .[]' to get:

domains_being_blocked 116855
dns_queries_today 9005
ads_blocked_today 386
ads_percentage_today 4.286508
unique_domains 962
queries_forwarded 6046
queries_cached 2573
unique_clients 2

We’re close to our desired format. All that is left is an awk oneliner:

awk -v date=$(date +%s) '{print "pihole." $1 " " $2 " " date}' >>/dev/tcp/localhost/2003

So what does our command look like?

curl --silent --show-error --retry 5 --fail \
       http://$HOST:$PORT/admin/api.php?summaryRaw | \
    jq -r 'to_entries |
           map(.key + " " + (.value | tostring)) |
           .[]' | \
    awk -v date=$(date +%s) '{print "pihole." $1 " " $2 " " date}' \
    >>/dev/tcp/localhost/2003

Is this still considered a one-liner at this point?

I’ve add some commandline options to curl so that all non-200 status codes are errors and that curl will retry 5 times up to about a half a minute to let the applications finish booting.

We could just stick this in cron and call it a day, but we can do better.

systemd

systemd allows for some nice controls over our script that will solve the rest of the pain points with the python script. One of those pain points is logging. It would be nice to log the response sent back from the API so we’ll know what fields were added or modified. Since our script doesn’t output anything, we’ll capture the curl output and log that (see final script to see modification, but it’s minimal).

With that prepped, let’s create /etc/systemd/system/pihole-export.service

[Unit]
Description=Exports data from pihole to graphite

[Service]
Type=oneshot
ExecStart=/usr/local/bin/pi-hole-export
StandardOutput=journal+console
Environment=PORT=32768
  • Type=oneshot: great for scripts that exit after finishing their job
  • StandardOutput=: Has the stdout go to journald (which is indexed, log-rotated, the works). Standard error inherits from standard out. Read a previous article that I’ve written about journald
  • Environment=PORT=32768: sets the environment for the script (allows a bit of configuration)

After reloading the daemon to find our new service, we can run it with the following:

systemctl start pihole-export

# And look at the output
journalctl  -u pihole-export

If we included an exit 1 in the script, the status of the service would be failed even though it is oneshot and the log file will let us know the data that failed it or if there was a connection refused (printed to standard error). This allows systemd to answer the question “what services are currently in the failed state” and I’d imagine that one could create generic alerts off that data.

One of the last things we need to do is create a timer to be triggered every minute.

[Unit]
Description=Run pihole-export every minute

[Timer]
OnCalendar=*-*-* *:*:00
AccuracySec=1min

It might annoy some people that the amount of configuration is about the same number of lines as our script, but we gained a lot. In a previous version of the script, I was preserving standard out by using tee with process substitution to keep the script concise. This resulted in logs showing the script running every minute, but the data in graphite only captured approximately every other point. Since I knew from the logs that the command successfully exited, I realized process substitution happens asynchronously, so there was a race condition between tee finishing and sending the request. Simply removing tee for a temporary buffer proved effective enough for me, though there reportedly are ways of working around the race condition.

Final script:

#!/bin/bash
# Script to stop immediately on non-zero exit code, carry exit code through
# pipe, and disallow unset variables
set -euo pipefail

OUTPUT=$(curl --silent --show-error --retry 10 --fail \
       http://localhost:$PORT/admin/api.php?summaryRaw)
echo $OUTPUT
echo $OUTPUT | \
    jq -r 'to_entries |
           map(.key + " " + (.value | tostring)) |
           .[]' | \
    awk -v date=$(date +%s) '{print "pihole." $1 " " $2 " " date}' \
    >>/dev/tcp/localhost/2003

Let’s review:

  • New script relies on system utilities and jq, which is found in the default ubuntu repo.
  • Logging output into journald provides a cost free debugging tool if things go astray
  • Anything other than what’s expected will cause the service to fail and notify systemd, which will try it again in a minute
  • Starts on computer boot

Sounds like an improvement!

Grafana

Now that we have our data robustly inserted into graphite, now to time to graph it! The two data points we’re interested in are dns_queries_today and ads_blocked_today. Since they are counts that are reset after 24 hours, we’ll calculate the derivative so we can get a hitcount.

grafana dashboard

alias(summarize(nonNegativeDerivative(keepLastValue(pihole.dns_queries_today)),
	'$interval', 'sum'), 'queries')

alias(summarize(nonNegativeDerivative(keepLastValue(pihole.ads_blocked_today)),
	'$interval', 'sum'), 'blocked')

The best part might just be that I can add in a link in the graph that will direct me to the pi-hole admin in the situations when I need to see the full dashboard.