Welcome to NBSoftSolutions, home of the software development company and writings of its main developer: Nick Babcock. If you would like to contact NBSoftSolutions, please see the Contact section of the about page.

Creating a parallel AMP site with Jekyll

steps

Google AMP (Accelerated Mobile Pages) is a strict subset of html that essentially only allows for content and a subset of supported inline css. Nearly all of the case studies are news sites. Google will give preference to valid AMP sites for mobile searches, if only because google can ensure the page is lightweight and fast. Though I attribute the project to google as google is leading the evangelizing effort, bing has joined with cloudflare hosting AMP pages too.

While I’ve known about AMP since last year, I’ve largely ignored it (I prefer to think I was waiting for the project to mature). Last weekend, I was chatting with a friend who works at a large travel corporation. She mentioned how they were working on creating an AMP enabled site because they’ve noticed their site falling in search rankings. Their site is a highly dynamic react app – as one might imagine – but how could they use AMP when first party javascript is disabled? She told me the homepage will be server side rendered, so it’s the only page that will be AMP enabled. It’ll have plenty of links to their main site where users can actually do searching and bookings. It’s a little cheeky, but I hear SEO is cutthroat and if large part of your business comes through search engines, I think it’s reasonable for one try new tech to and see if that increases the rank.

On a side note, I visited the travel site to find they have 250 elements in their head element. As an unfair comparison, this page has 5 at the time of this writing.

But this discussion got me thinking about how easy it would be for me to “AMPlify” this site which is a Jekyll site. The history of this site has basically been a testing grounds for new technologies:

AMPlifying with amp-jekyll

Jekyll is one of the most popular static site generators. As a result, there is a rich list of plugins and amp-jekyll is one of them.

Amp-jekyll is still maturing and the currently released version (1.0.1) does not support jekyll permalinks, so I had to directly reference the repo in the Gemfile.

gem 'amp-jekyll', :git => 'https://github.com/juusaw/amp-jekyll'

One of the first steps is to copy their amp.html for your own use. Since I have a stylesheet that is preprocessed and minified, I needed to include it in my amp.html. Typically, built pages use a link to reference the stylesheet, but since AMP requires only inline css, I had to go with an include directive.

<style amp-custom>
  {% include styles.css %}
</style>

This did not work out of the box for me, so I created the following script to execute on every build to make sure the included css fit AMP requirements.

# Copy the built css into the includes directory so that it can be used in the
# amp pages
cp _assets/build/styles.css _includes/.

# Amp doesn't like the !important css attribute that pure css uses, so we
# remove it!
sed -i -e 's/!important//g' _includes/styles.css

# pure contains IE7 specific hacks to get a style to reset using a syntax error.
# This is not needed for AMP, so we'll remove it from the css. See stackoverflow
# for more information https://stackoverflow.com/q/1690642/433785
sed -i -e 's/*display:inline;//g' _includes/styles.css

For regular posts, I updated their layout to include an amphtml link so that webcrawlers know that I support AMP and where the AMP pages are loaded.

{% if page.path contains '_posts' %}
  <link rel="amphtml" href="{{ page.id | prepend: '/amp' | prepend: site.baseurl | prepend: site.url }}">
{% endif %}

Now when I build the site, all blog posts have a sibling, AMP-enabled page that lives at the same url except it’s prefixed with /amp. So the urls look like:

Unanticipated Benefits

I had three posts where I forgot the /blog prefix in the url. Amp-jekyll made these obvious and so I fixed these urls. Yes, I realize changing urls after the fact breaks the internet, but these weren’t popular articles – not popular enough for me to setup redirects.

I also had broken image links in older posts because I was using raw figure html elements instead of the img directive.

Criticisms for AMP

The top articles on Hacker News for AMP are all criticisms and I’d be remiss if this section was omitted. The top articles are:

The articles are relatively succinct, so I recommend reading them – they’re decent, but I believe they’re overreacting:

  • Google doesn’t actually own AMP. AMP has an open governance model. Only when there is no one working on the project does Google give itself the power to appoint a leader, in which case, since no one is working on the project does it even matter?
  • Content is not just loaded from Google’s servers. Cloudflare hosts AMP content as well. Speaking of cloudflare, I’m not sure why having external sites cache content on edge server rubs people the wrong way. It alleviates concerns of DDOS attacks and site maintenance.
  • You can create your own AMP cache
  • Links work as normal. If a user clicks on an amp link, they’re taken to a feather-light page which has all the content they want, else if a user clicks on a non-amp link they get all the features of the full app.
  • The subset of CSS allowed is annoying if only for the fact that one has to do small tweaks to get the css compliant. Without this subset, AMP would not be able to assume the layout was fast and sensible. It’s a tough sell, as no one wants to create custom css for AMP, but I understand the reasoning. Thankfully, this blog does not use a lot of CSS – only a few sed statements were needed to become AMP compliant.
  • I’ll agree that removing AMP content from google’s AMP cache isn’t the most user friendly. Sending an “update-ping” to get immediate removal is tedious at best, but “cached content that no longer exists will eventually get removed from the cache” so one can just wait for the expiration. Since cache busting is a tough problem, this is a fine solution, though an optimal solution would be like Cloudflare’s where they have a nice UI for cache busting.
  • One isn’t giving up anything when having a parallel AMP site. AMP content is still served from the origin, and if there happens to be an AMP cache server present, the content is served from there. It’s not like whoever hosts the AMP server owns the content. It’s simply an optimization for those who seek the content in a nice layout.
  • Some lament that AMP would be good for fake news sites. I’m not sure where this idea sprouted from. I believe that satirical newspapers like the onion should be able to participate in AMP and reap any potential benefits.

One thing that I find very interesting is the topic of Subresource Integrity, which is where a client can specify the hash of the incoming file. If the hashes do not match then the file is rejected. The problem is that AMP defines itself as an evergreen library, so a single URL will always point to the lastest version of the library. This means that someone could compromise the javascript payload and clients would have no way of knowing. See the following Github issue for more information. I do have to chuckle though, as there as has been a recent trend towards digesting all assets, and now google is pushing a platform that is fundamentally opposed to identifying payloads based on hash (which one would normally think it could increase security and speed).

With all the back and forth, why did I use AMP? It was a combination of two reasons. First, this site is a testbed and I wanted to answer the question of what effort is required. Since I only spent a couple hours adding this feature, the effort was low. Second, will tons of traffic pour in from the increase in search rank? I won’t know until if I don’t try! Before we part, I have to embarrassingly admit, given my stance on the importance of metrics, this site employs zero analytics and I will have a hard time determining improvements unless there is an order of magnitude increase. AMP does have a whole section on analytics, so this may be an area worth exploring.


Introduction to Journald and Structured Logging

grilled cheese The mildly interesting depictions one finds in their journal

To start, Journald’s minimalist webpage describes itself as:

[as a] service that collects and stores logging data. It creates and maintains structured, indexed journals.

With one of the sources of logging data being:

Structured log messages

Before diving into journald, what’s structured logging?

Structured Logging

Imagine the following hypothetical log outputs

"INFO: On May 22nd 2017 5:30pm UTC, Nick Foobar accessed gooogle.com/#q=what is love in 52ms"

And

{
  "datetime": "2017-05-22T17:30:00Z",
  "level": "INFO",
  "user": "Nick Foobar",
  "url": "gooogle.com/#q=what is love",
  "responseTime": 52
}

The first example is unstructured or simple, textual logging. The date format makes it a tad contrived, but the log line contains everything in a rather human readable format. From the line, I know when someone accessed what and how fast – in a sentence like structure. If I was tasked with reading log lines all day I would prefer the first output.

The second output is an example of structured logging using JSON. Notice that it conveys the same information, but instead of a sentence, the output are key-value pairs. Considering JSON support is ubiquitous, querying and retrieving values would be trivial in any programming language. Whereas, one would need to meticulously parse the first output to ensure no ambiguity. For instance, not everyone has a given and last name, response time units need to be parsed, the url is arbitrary, timezone conversion, etc. There are way too many pitfalls if one stored their logs in the first format – it would be too hard to consistently analyze.

One could massage their textual log format into a semi-structured output using colons as field delimiters.

"INFO: 2017-05-22T17:30:00Z: Nick Foobar: gooogle.com/#q=what is love: 52"

This may be a happy medium for those unable or unwilling to adopt structured logging, but there are still pitfalls. To give an example if I wanted to find all the log statements with a WARN level, I have to remember to match against only the beginning of the log line or I run the risk of matching against WARN in the user name or in the url. What if I wanted to find all of the searches by Ben Stiller? I’d need to be careful to exclude the lines where people are searching for “Who is Ben Stiller”. These examples are not artificial either as yours truly has fallen victim to several of these mistakes.

Let’s say that one does accomplish a level of insight from the textual format using text manipulations. If the log format were to ever change (eg. transpose response time and url, more data being logged, etc) the log parsing code would break. So if you’re planning on gaining insight from text logs, make sure you define a rigorous standard first!

There is also a nice benefit of a possibility of working with types with structured logging. Instead of working with only strings, JSON also has a numeric type so one doesn’t need the conversion when analyzing.

Structured logging doesn’t need to be JSON, but is a common format in log management suites like graylog, logstash, fluentd, etc.

The only downsides that I’ve seen for structured logging (and specifically JSON structured logging) are log file size increases due to the added keys for disambiguation, and the format won’t be in a grammatically correct English sentence! These seem like minor downsides for the benefit of easier log analysis.

Journald

Now that we’ve established the case for structured logging, now onto journald. Be warned, this is a much more controversial topic.

Journald is the logging component of systemd, which was a rethinking of Linux’s boot and process management. A lot of feathers were ruffled are still ruffled because of the movement towards systemd (1, 2, 3, 4, 5). Wow, so a multitude of complaints. There must be several redeeming qualities to systemd because most distros are converging on it. I won’t be talking about systemd, but rather the logging component.

To put it simply, journald is a structured, binary log that is indexed and rotated. It was introduced in 2011.

Here’s how we would query the log for all messages written by sshd

$  journalctl _COMM=sshd
-- Logs begin at Thu 2017-05-18 23:43:18 EDT, end at Mon 2017-05-22 16:05:29 EDT. --
May 19 16:57:31 vm-ubuntu sshd[19494]: syslogin_perform_logout: logout() returned an error
May 19 16:57:31 vm-ubuntu sshd[19494]: pam_unix(sshd:session): session closed for user nick
May 22 09:03:40 vm-ubuntu sshd[5311]: Accepted password for nick from 192.168.137.1 port 56618 ssh2
May 22 09:03:40 vm-ubuntu sshd[5311]: pam_unix(sshd:session): session opened for user nick by (uid=0)

For all sshd messages since yesterday

$  journalctl -S yesterday  _COMM=sshd
-- Logs begin at Thu 2017-05-18 23:43:18 EDT, end at Mon 2017-05-22 16:10:59 EDT. --
May 22 09:03:40 vm-ubuntu sshd[5311]: Accepted password for nick from 192.168.137.1 port 56618 ssh2
May 22 09:03:40 vm-ubuntu sshd[5311]: pam_unix(sshd:session): session opened for user nick by (uid=0)

To view properties for autossh and sshd messages since yesterday (output truncated to first event)

$  journalctl -o verbose -S yesterday  _COMM=sshd + _COMM=autossh
-- Logs begin at Thu 2017-05-18 23:43:18 EDT, end at Mon 2017-05-22 16:12:45 EDT. --
Mon 2017-05-22 07:01:20.894720 EDT
    PRIORITY=6
    _UID=1000
    _GID=1000
    _CAP_EFFECTIVE=0
    _SYSTEMD_OWNER_UID=1000
    _SYSTEMD_SLICE=user-1000.slice
    _BOOT_ID=...
    _MACHINE_ID=...
    _HOSTNAME=vm-ubuntu
    _TRANSPORT=syslog
    _AUDIT_LOGINUID=1000
    SYSLOG_FACILITY=1
    SYSLOG_IDENTIFIER=autossh
    SYSLOG_PID=46342
    MESSAGE=timeout polling to accept read connection
    _PID=46342
    _COMM=autossh
    _EXE=/usr/lib/autossh/autossh
    _CMDLINE=/usr/lib/autossh/autossh <IP>
    _AUDIT_SESSION=2
    _SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-2.scope
    _SYSTEMD_SESSION=2
    _SYSTEMD_UNIT=session-2.scope
    _SOURCE_REALTIME_TIMESTAMP=...
...

To find all events logged through the journal API for autossh. If a + is included in the command, it means “OR” else entries need to match both expressions

$ journalctl _TRANSPORT=syslog _COMM=autosshd
-- No entries --

Find all possible values written for a given field:

$  journalctl --field=_TRANSPORT
syslog
journal
stdout

What I think about journald

I want journald to be the next big thing. To have one place on your server were all logs are sent to* sounds like a pipe dream. No longer do I have to look up where logfiles are stored.

Journald has nice size based log rotation, meaning I no longer have to be woken up at night because a rogue log grew unbounded, which could degrade other services.

Gone are the days of arguing what format logs should be in – these would be replaced with disucssions about what metadata to expose.

With journald I can cut down on the number of external service that each service talks to. Instead of having every service write metrics to carbon, metrics would be written to journald. This way applications don’t need to go through the hoops of proper connection managment: re-connect on every metric sent, a single persistent connection, or some sort of hybrid? By logging to journald, carbon or the log forwarder can be down, but metrics will still be written to the local filesystem. There is very little that would case an absolute data loss.

People can use tools that they are most familiar with: some can use journalctl with the indexes on the local box and a others will want to see the bigger picture once the same logs are aggregated into another system.

* Technically the data may not be sent to a single file location as journald can be configured such that each user has their own journal – but journalctl abstracts that away such that users won’t know or care.

Complaints Against journald

  • Journald can’t be used outside of systemd, which limits it to only newer distros that have adopted systemd. I have CentOS 6 servers, so it’s a hard no to use journald on those systems.
  • Journald writes to a binary file that one can’t use standard unix tools to dissect, resulting in difficulty if the log becomes corrupt. If the log is not corrupt, one can pipe the output of journalctl to the standard tools.
  • There’s not a great story for centralizing journald files. The introduction mentioned copying the files to another server. People have found a way using journalctl -o json and sending the output to their favorite log aggregation.
  • A lot of third party plugins for journald ingestion for log management suites don’t appear well maintained.
  • It invented another logging service instead of working with pre-existing tools. Considering Syslog can work with structured data – that’s one less reason to switch to journald.
  • The data format is not standardized or documented well
  • Will not support encryption other than file-system encryption. If a user has access to the file system and has permission to read the log file, all logs will be available.
  • No way to exclude sensitive information from the log (like passwords on the commandline) – though you’re probably doing something wrong if this is an issue.
  • The best way to communicating with journald programmatically seems to be either through the C API or journalctl.

With all these complaints, it may be a wonder why I lean towards advocating journald. By advocating structured data first, journald is setting the tone for the logging ecosystem. Yes, I know that jounald is far from the first, but the simplicity of having single, queryable, structured log baked into the machine is admirable.


Migrating to the new NET SDK MSBuild Project Files

mountain road The mountain road of migration that isn’t all that scary

Farmhash.Sharp is a pet project of mine where I port Google’s Farmhash algorithm to the .NET platform, and I recently migrated it to use the new MSBuild architecture. Previously the project was structured based on the ProjectScaffold, but was found to be deficient. While the migration was not without some difficulties, in the end, I’ve found it to be worthwhile, and figured I’d document the process for my future self and others.

There was not a single keystone issue that caused the migration, but rather a series of small, annoying issues.

F#

The testing project for Farmhash.Sharp was a F# project using Nunit. Under .NET Core, a MarshalDirectiveException would be thrown:

System.Runtime.InteropServices.MarshalDirectiveException: Cannot marshal ‘parameter #2’: Unknown error.

The only internet help I found was an xunit issue with no resolution. I am guessing the root cause is that F# support for .NET Core is bad even though the .NET Core train has been on the horizon for at least a couple years. I have a love-hate relationship with F#. F# could be good but one loses hope after years of trying to run a simple test project and receiving cryptic errors.

The fix was to convert the F# Nunit project to C# Xunit. With a little editor skills and InlineData the tests were quickly migrated now could be executed on a multitude of platforms.

Github Pages Site

ProjectScaffold sets up a basic documentation site. It does have some nice features where it’ll include release notes, license, and a API reference page where you can see the code comments converted to Markdown. Also if F# code is used in the code samples, users can hover over lines for IDE-like tooltips.

The downside is that I find the style outdated. The bootstrap version is 2.2.1 which was released five years ago. I could go through all the work to update the site to a newer bootstrap, jquery, and jquery UI, but I thought it prudent to just use Github pages supported jekyll theme, which would be a lot quicker to update, as I only have to focus on the content. Losing the code tooltips is only a minor loss as they were only for F# and C# won the language war.

The only thing I really miss is since I have the release notes inlined in github pages, whenever I make a release I have to copy and paste from the main repo into the gh-pages branch, which is annoying but I’ve only done three Farmhash.Sharp releases so the annoyance is definitely tolerable.

Since moving to a Github Pages theme I find the site to be much more mobile friendly and stylish!

Paket and Fake vs .NET Cli and Nuget

The ProjectScaffold sets up a project to use the Paket dependency manager and the FAKE build tool. The problem with these tool is that they are quickly being outclassed. Nuget is now tightly integrated with MSBuild and solved many of the problems that Paket originally touted. Now instead of having way too many paket.* files, all dependencies are specified in the project file.

FAKE would build projects, build docs, create nuget packages, etc. While nice, a lot of that functionality is now built into the .NET cli, and since we’re using Jekyll and Github Pages, we no longer need our documentation built. The only feature I miss is having release notes and version propagated when changed, and FAKE creating Github releases. These losses are tolerable as Farmhash.Sharp releases are few and far between and updating release notes is a tasks that mainly involves copying + pasting.

The bottom line is, why should I continue using these tools if 95% of the functionality I want is built in to the new SDK which will be much more approachable to new users.

The Migration

Hopefully the decision to move to C# everywhere with the new .NET SDK is reasonable. I’m tired of using non-first class tools. Whether it’s Microsoft’s or someone else’s fault, I don’t care. I just want my project to be built, tested, and ran on all relevant platforms. Trying to be clever with build tools is just as bad as trying to be clever with code.

While the .NET SDK does provide dotnet-migrate I elected to start fresh because the Farmhash.Sharp code is relatively small and gives me an excellent chance to learn the new build system.

How did the resulting project file for the main hashing algorithm turn out? You tell me, here’s code with the assembly information omitted, which by the way is awesome as it used to always be a source of conflicts or contention with git.

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFramework>netstandard1.0</TargetFramework>
    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
    <PlatformTarget>AnyCPU</PlatformTarget>
  </PropertyGroup>

</Project>

Let’s step it up a notch. Let’s take the benchmarking code which runs benchmarks across .NET Core, Mono, and the full .NET framework. Some hash functions that will be benchmarked can only be ran on the full .NET framework, so they need to be conditionally referenced. Below is the entirety of the benchmarking project file, which I believe is so self-explanatory that it can be left uncommented.

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.1;net46</TargetFrameworks>
    <PlatformTarget>AnyCPU</PlatformTarget>
    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
  </PropertyGroup>

  <PropertyGroup Condition=" '$(TargetFramework)' == 'net46' ">
    <DefineConstants>$(DefineConstants);CLASSIC</DefineConstants>
  </PropertyGroup>
  <PropertyGroup Condition=" '$(TargetFramework)' == 'netcoreapp1.1' ">
    <DefineConstants>$(DefineConstants);CORE</DefineConstants>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.10.3" />
  </ItemGroup>

  <ItemGroup>
    <ProjectReference Include="..\Farmhash.Sharp\Farmhash.Sharp.csproj" />
  </ItemGroup>

  <ItemGroup Condition="'$(TargetFramework)' == 'net46'">
    <PackageReference Include="CityHash.Net" Version="1.0.1"/>
    <PackageReference Include="SpookilySharp" Version="1.1.5128"/>
    <PackageReference Include="System.Data.HashFunction.CityHash" Version="1.8.2.2"/>
    <PackageReference Include="System.Data.HashFunction.SpookyHash" Version="1.8.2.2"/>
    <PackageReference Include="xxHashSharp" Version="1.0.0"/>
  </ItemGroup>

</Project>

After the migration was said and done, there were 50% fewer lines in the repo. For a project where all the functionality is packed into 580 lines of code, that’s incredible.

Ensuring Farmhash.Sharp worked on Classic .NET and Mono

As part of the build process I wanted to ensure that the built Farmhash.Sharp DLL output could be used in a classic .NET MSBuild project as well as a Mono project. This was the hard part, but in retrospect isn’t that hard as long as the following steps are taken.

  1. Create a regular .NET 4.5 Project used for testing with Xunit, Nunit, or etc. I am calling mine Classic
  2. Exclude this project as part of the regular build process in the solution file.
  3. Add in a reference to the testing code and nuget testing packages.
  4. In travis and appveyor files one will need to explicitly reference nuget to restore these packages (remember, nuget wasn’t as integrated as before!)
  5. Then invoke xbuild or msbuild (depending on platform) manually to build the test dll
  6. Invoke the xunit console app with the output dll
  7. Once the tests pass you have confirmed that the library works as expected on Mono or the full .NET framework.

If anything is confusing, please see .travis.yml or appveyor.yml in the repo.

Conclusion

The new .NET SDK is a breeze to work with. Once MSBuild becomes cross platform for Mono projects there will be very little reason to ever want to use the old project files.