Robust systemd Script to Store Pi-hole Metrics in GraphitePublished on
Ah, a serene lake of … metrics
I’m building a home server and I’m using Pi-hole to blackhole ad server domains in a test environment. It’s not perfect, but works well enough. The Pi-hole admin interface shows a dashboard like the following to everyone (admin users get an even greater breakdown, but we’ll focus on just the depicted dashboard)
It’s a nice looking dashboard; however, it’s a standalone dashboard that doesn’t fit in with my grafana dashboards. I don’t want to go to multiple webpages to know the health of the system especially since I want to know two things: is pi-hole servicing DNS requests and are ads being blocked. This post will walk through exporting this data from pi-hole into Graphite, which is my time series database of choice.
There is a blog post that contains a python script that will do this process. Not to pick on the author (they inspired me) but there are a couple of things that can be improved:
- The python script is not standalone. It requires the requests library. While ubiquitous, the library is not in the standard library and will require either a virtual environment or installation into the default python
- Any error will crash the program (eg connection refused, different json response, etc) and it’ll have to be manually restarted
- No logs to know if the program is running or why it would have crashed
- Does not start on computer reboot
Let’s start with making a network request:
Ah, let’s turn to jq to massage this data, which will, by default, prettyify the output
We somehow need to get the previous data into the
<path> <value> <timestamp> format for carbon with lines seperated by newlines.
Since jq prefers working with arrays, we’ll transform the object into an array:
Now we’re going to transform each element of the array into a string of
$key $value with
jq 'to_entries | map(.key + " " + (.value | tostring))'. Value is numeric and had to converted into a string.
Finally, unwrap the array and string with a
jq -r '... | .' to get:
We’re close to our desired format. All that is left is an awk oneliner:
So what does our command look like?
Is this still considered a one-liner at this point?
I’ve add some commandline options to curl so that all non-200 status codes are errors and that curl will retry 5 times up to about a half a minute to let the applications finish booting.
We could just stick this in cron and call it a day, but we can do better.
systemd allows for some nice controls over our script that will solve the rest of the pain points with the python script. One of those pain points is logging. It would be nice to log the response sent back from the API so we’ll know what fields were added or modified. Since our script doesn’t output anything, we’ll capture the curl output and log that (see final script to see modification, but it’s minimal).
With that prepped, let’s create
Type=oneshot: great for scripts that exit after finishing their job
StandardOutput=: Has the stdout go to journald (which is indexed, log-rotated, the works). Standard error inherits from standard out. Read a previous article that I’ve written about journald
Environment=PORT=32768: sets the environment for the script (allows a bit of configuration)
After reloading the daemon to find our new service, we can run it with the following:
If we included an
exit 1 in the script, the
status of the service would be failed even though it is
oneshot and the log file will let us know the data that failed it or if there was a connection refused (printed to standard error). This allows systemd to answer the question “what services are currently in the failed state” and I’d imagine that one could create generic alerts off that data.
One of the last things we need to do is create a timer to be triggered every minute.
It might annoy some people that the amount of configuration is about the same number of lines as our script, but we gained a lot. In a previous version of the script, I was preserving standard out by using
tee with process substitution to keep the script concise. This resulted in logs showing the script running every minute, but the data in graphite only captured approximately every other point. Since I knew from the logs that the command successfully exited, I realized process substitution happens asynchronously, so there was a race condition between
tee finishing and sending the request. Simply removing
tee for a temporary buffer proved effective enough for me, though there reportedly are ways of working around the race condition.
- New script relies on system utilities and jq, which is found in the default ubuntu repo.
- Logging output into journald provides a cost free debugging tool if things go astray
- Anything other than what’s expected will cause the service to fail and notify systemd, which will try it again in a minute
- Starts on computer boot
Sounds like an improvement!
Now that we have our data robustly inserted into graphite, now to time to graph it! The two data points we’re interested in are
ads_blocked_today. Since they are counts that are reset after 24 hours, we’ll calculate the derivative so we can get a hitcount.
The best part might just be that I can add in a link in the graph that will direct me to the pi-hole admin in the situations when I need to see the full dashboard.