F# and Data go Hand in hand

Published on: April 5, 2015

As a programmer, I’m constantly on the look out to do less work when it comes to coding, both in terms of maintenance and the initial code through. A great example of this is my quest to write a perfect library for dealing with Paradox Interactive’s custom format (briefly highlighted here). I’ve been trying to work with this format for eight years. As an aside: I just had a mini heart-attack – eight years is a long time for a 22 year old! Anyways, it’s safe to say that I kinda know what works and what doesn’t.

Here’s the crux of the problem, as I wrote in an earlier post:

Between expansions and patches, Paradox will change the [data] (add/remove/change fields), which breaks compatibility with the parser.

The question now becomes, how can one code against an ever changing spec, preserve backward compatibility, and not lose my sanity in the process? I believe F# fills this role. Before I talk about F#, I’ll briefly highlight past attempts, some successful – some not.

C#

I’m proud of the parser I wrote in C#. It’s called Pdoxcl2Sharp, and I’ve been working on it over the years; refining the API and making it fast. The design ended up a SAX parser with only a single event (callback) of the next token. I’ve used it in various projects. The only one that is open sourced is called EU4.Savegame, but others have found it useful too.

While Pdoxcl2Sharp gets the job done, but it can be a pain. There can be several dozen types in the file that you’re parsing and you’d have to make each of your plain old data classes have a custom parsing method. This quickly becomes tedious and error prone due to the frequent changes in the data. To mitigate the pain, I created a T4 template that will automatically construct the classes, parsing, and saving methods. That worked for awhile, but there was too much churn when the data changed, even if it was data that was part of a property I didn’t care for. The way the library was designed is that there could be drastic side effects if unexpected data was encountered. Constantly changing the template became taxing enough for me to warrant looking at alternative implementations and languages.

Javascript

I was really hoping that Javascript would be the way forward for my ease of data parsing. It’s not, though. I’ve developed jomini and it works pretty well. Behind the scenes it uses jison, which has been nice (see how short the implementation is). However, performance is terrible. When parsing a 30MB file, memory usage spikes to 1.5GB (and it wants more, but there is a limit to the heap size in node) and then everything becomes unresponsive. The parser never finishes the file. Smaller files are fine though.

I’m a determined person, so I decided to look into whether or not I could improve the performance. First step is streaming the file while parsing it. Turns out this is incredibly difficult, due to javascript’s emphasis on non-blocking. There is no way in javascript for one to request additional data from a file in small increments and continue parsing. If you want a streaming library you must create a gigantic state machine because you never know when or how much data the runtime will give you. For instance, here is the code for a streaming JSON parser and here is the code for zip streams. The implementation is hard to get right. I have a branch where I try mimicking their strategies, but it has just been too tedious to make a state machine and cover all possibilities.

The nail in the coffin for the javascript implementation was mulling over what advantages javascript in the browser/server would bring. The only tangible benefits would be the ability for everyone to use it (everyone has a browser), and for the vast ecosystem. These are not small benefits, but the question I have a hard time answering, is what would people be using the library for? All the tools developed for the games built with the Clausewitz engine, the game engine that generates files in the mentioned format, are for the desktop. To me, Javascript on the desktop is still not proven. I realize there are libraries such as NW.js that make it possible, but I’d rather not venture into unexplored territory with it, though I’ll keep my eye on it.

So there may be a Javascript parser one day, but I have become too disheartened during development to continue it at the moment.

F#

Enter F# and Pfarah the library (it’s pronounced Farrah).

As the developer, the implementation is sane. We’re looking at 200 lines of a recursive descent parser that results in a DOM-like structure. It’s short, easy to read, and uses a combination of functional and imperative programming. Writing tests are a breeze for exactly the same reasons. To put it simply, it was a joy to write and hardly took any time at all. After only a couple of hours I went from idea to fully fleshed prototype that would provide a solid foundation.

Users of the library should enjoy harnessing it. The library API surface area is small (3 typical static parsing methods with several extensions), which facilitates productivity. There is a great tutorial that walks users through typical data scenarios. It’s short and easy to understand.

Performance

The library, and by extension F#, is fast. Since the most likely usage of the library will be used in reading files, the library will be limited by disk speeds. If the file happens to be in cache or in RAM, the parser can still chew through 30MB/s, which should be plenty fast. I profiled Pfarah in Visual Studio on a file that was in cache. Below is a screenshot where the CPU is spending most of it’s time.

80% of the CPU time is spent in the underlying runtime! There are still performance improvements available, though it may be come at the cost of readability because the improvements would involve working around the standard .NET library. For instance, the line where I’m checking to see if all the characters in a string are numbers or letters costs about 12% CPU time. If I know the data, and I know that it will most likely be false, I can short circuit any potential overhead. This means rewriting from

if str |> Seq.forall (fun c -> isNum c || c = '.') then
  match str.Split('.') with
  | [|y;m;d|] -> Some(new DateTime(int y, int m, int d))
  | [|y;m;d;h|] -> Some(new DateTime(int y, int m, int d, int h, 0, 0))
  | _ -> None
else
  None

// Check to see if the string exists and the first character is numeric before
// processing
if (String.IsNullOrEmpty(str) || isNum (str.[0]) = false) then
  None
else if str |> Seq.forall (fun c -> isNum c || c = '.') then
  match str.Split('.') with
  | [|y;m;d|] -> Some(new DateTime(int y, int m, int d))
  | [|y;m;d;h|] -> Some(new DateTime(int y, int m, int d, int h, 0, 0))
  | _ -> None
else
  None

Adding in the couple lines of code saved a few percent in CPU time. Worth it? Debatable. This is probably the most painless optimization, but unless it is an optimization with large, perceivable benefits, I would rather not sacrifice readability. As an example, I could replace str.Split('.') (7% of CPU time after previous optimization) with a more tailored function that didn’t need to create an intermediate array, but that would probably mean an additional 40 lines of code, and in my opinion, the optimization isn’t worth it. Before you think I’m anti-optimization, I’ll have you know that I replaced the .NET number parser with my own, which saved an incredible 40% of the CPU time. Know when to optimize and what to optimize. A corollary to that would be: no one knows your data like you do (ie. the .NET team sacrificed performance in functions to handle most use cases – a good tradeoff.)

To drive my point home about performance, if you have the data but not the performance, then you might as well not have the data. F# has great performance out of the box with the tools turn it up a notch (all the way up to 11). With F#’s REPL, both online and off, you can iterate on algorithms as fast as if you were using Python or R. This makes F# unique in comparison with other strongly typed languages – most of them do not have a REPL.

The Community

I much prefer the community of F# to C#. fsharp.org is a beautiful site that has all the tools available for beginners. They also have scaffolding set up that will create a project with a cross platform build system, documentation, and a release process. I’m using it for all my .NET projects now. And Pfarah is the only library of mine that has it’s own website and that’s simply because it’s so easy to set one up with Github Pages and the scaffolding set up in place for you. Using the scaffolding is clearly the superior way to start anything, whether it is a library or a full-fledged program. Another aspect of the community that I like is the willingness to help others. There are many projects on Github to take a look at and learn from.

Disadvantages

There are disadvantages to F# such as adoption. F# is not popular. I’m not sure why it is not (maybe it is due to how awesome C# is, unlike how terrible Java was/is drove adoption of Scala). But like how Java and Scala share a base of the JVM, F# and C# share the .NET runtime. F# code can be invoked from C#, so I don’t see using F# as a loss of adoption. Also, some people shy away from C# and F# because they fear they are limiting themselves to Windows, which is not true. Mono, which I’ve talked about before, has always been cross platform and now .NET becoming compatible with Linux.

I met a statistician once who upon hearing I was running data through F# in generation of statistics, was mortified as he couldn’t understand why I had chosen F# over traditional choices such as Python, R, Excel, Matlab, and Mathematica. Well, F# plays nice and interoperates with all of those languages.

So after taking a positive spin on each of the previously mentioned disadvantages, there aren’t really any disadvantages! That’s not entirely true, there are some places where F# is a bit out of place. For instance, when you’d want to make your ViewModels of MVVM in your WPF app in C# because all the tooling has been geared towards that use case.

An Example

Given the following data.

trade={
  node={
    name="foo"
    powers={
      USA=1.000
      FRA=2.000
    }
  }
  node={
    name="bar"
    powers={
      USA=3.000
    }
  }
}

We can find the top ten nations by power across trade nodes by the following (note the code does use a couple Pfarah specifics, but it is nothing too upsetting):

let data = ParaValue.Parse (* string/file *)

data?trade
|> collect "node"
|> Array.collect (fun x -> x?powers |> asRecord)
|> Seq.groupBy fst
|> Seq.map (fun (key, grp) ->
    (key, grp |> Seq.map (snd >> asFloat) |> Seq.sum))
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
|> Seq.iter (fun (country, sum) -> printfn "%s: %f" country sum)

To me the code is definitely readable due to the |>, aka forward pipe, operator, which essentially takes the output of one function and pipes the result to the next function. The previous code snippet effectively means:

From the “trade” property collect all the “node” properties into an array. From all the “power” properties in all the nodes, group by the nation’s name. Take each group and sum the floating point numbers. Sort the nations by their total power in a descending fashion. Take up to ten and print their name and sum.

F# allows the reader to follow the flow of manipulation, and if at any time, they want to double check the result they can aforementioned REPL.

What if the data changes, how does it affect our code? The optimal case is that the properties that were changed doesn’t affect us! This means a property that was not relied upon. Nothing changes. The optimal case should be the majority of cases, but if the code assumed a certain structure of the data, then there are ways to easily adapt the code such that property assumes becomes optional, such to maintain backwards compatibility.

What’s Next

A Type Provider

It’d be awesome to create a type provider for the data format. For the unfamiliar, there is great explanatory video from a .NET conference on F# and data which covers type providers, but it essentially creates dynamic class structures based on the given sample data. Now instead of trying to query data with strings (as shown in the example) and checking the type by hand, you can interact with the data as if it was a normal object and types will be inferred from the type provider. The one downside for clients that no one seems to talk about, is that the type provider is only as good as the sample! If your sample is not representative of the actual data (someone removes/adds in a property), what is the expected behavior? Part of me wants it to fail hard and fast with a helpful error message so the sample can be fixed for all. The consequences may involve a version bump in the library and redistribution of code. On the contrary, continuing on may result in a degraded experience for the user (back to string and type checking, or ignoring parts of the data) and the original sample may never be amended.

To get a better understanding the problem and a potential solution: here’s some JSON. Let’s say that our sample document is the following.

{
  "countries": {
    "HAB": {
      "small": true,
      "treasury": 10.000
    },
    "FRA": {
      "small": false
    }
  }
}

and this is the real document

{
  "countries": {
    "HAB": {
      "small": true
    },
    "GBR": {
      "small": true
    },
    "FRA": {
      "small": false
    }
  }
}

How does the F# Data Json Type provider stand up to this? If you tried it out, you would have noticed that with each object there was a property of JsonValue, such that one could access the underlying data structure if needed. Similarly, accessing a property in the sample document not in the real document yields an exception. So while the type provider provides the convenience of dot notation and type inference, we don’t lose any flexibility.

Still from a functionality standpoint, it could use some work. The type provider is useless on the countries object because it is an object! We want countries to be an array of similar objects. To achieve this we want the type provider to act as if it was generating a type based on the following JSON.

{
  "countries": [{
    "abbreviation": "HAB",
    "small": true,
    "treasury": 10.000
  }, {
    "abbreviation": "FRA",
    "small": false
  }]
}

This is not possible right now in F# Data, so I have opened an issue: data transformation before type generation. Sadly, it doesn’t look like it will ever be possible without intermediate steps, but I should forge on as there will be benefits either with the intermediate steps or partial type inference.

The next issue is implementation. If you look the F# Data type provider implementation or the F# Configuration implementation, these files are over 2,500 lines of code, which is 10x the total number of lines in my project! Thankfully, there seems to be starter kit, which greatly simplifies implementation that both the R provider and the Apiary provider use.

After everything I’ve said, I believe that implementing a type provider is worth it. It allows for interaction with the data in a type safe and convenient way in just a few lines of code. Previously, I’ve gone so far as to write a massive T4 template in my own domain language to achieve similar goals, and let me say that from experience it has not been easy maintaining it. Thus, type providers are an excellent return on investment.

A GUI

Not everyone is a commandline genius. In fact, I would even go so far as to say most people aren’t. The problem then becomes how can we use our F# library so that we can use it in a GUI. I’m currently working on a side project that hooks Pfarah up to the F# backend of a C# WPF application made with Caliburn.Micro, MahApps, and Material Design. The hope is to one day release it to the community, but until then I’ll keep working on it my garage (so to speak).

Comments

If you'd like to leave a comment, please email [email protected]