Processing arbitrary amount of data in PythonPublished on
A JSON stream in Python is surprisingly difficult
I’ve recently written about data processing in F#, and I thought I’d keep up the trend, but this time showcase a bit of python. A naive Pythonista can easily exhaust all available memory when working with large datasets, as the default Python data structure, which are easy to use, eagerly evaluate elements causing a million elements to be processed in memory instead of processing one element at a time in memory.
First we need data. A few minutes of searching the internet last night led me to a csv dump of Reddit’s voting habits. Let’s download the data to votes.csv.
Discover a bit of what the data looks like by taking the top 10 lines from the files with
There are no headers and we can see the format looks like
username,link,vote where the vote is either -1 or 1. This made me curious if there were votes other than -1 or 1. Maybe there is a super user or admin that can have five votes. Luckily, standard linux tools come to our rescue.
As an aside, csv mongers are probably upset that
cut was used instead of some robust and accurate csv tool that can handle escaping characters, but by the output we can see that it is not a problem. No user has a
, in their username.
If you are following along, you probably noticed that the last step took a relatively significant amount of time (ie. it wasn’t instant). Executing
time on the previous command (bash shortcut
!!) gives about nine seconds:
We could do further analysis on the file, but in an effort to get to the meat of this post, we’ll just find out the number of lines in the file by executing
wc --lines votes.csv. 7.5 million lines of data! Puny, but it will work for this example.
The goal of this post is to convert this csv file into json without using excessive memory. I’ll post the code below explain sections subsequently below. Feel free to skip the explanations if you already know Python inside and out.
Here is a sample of the output
Section 1: Parsing
The lines of python code packs a punch and relies heavily on the following modules (other than the csv module).
- namedtuple: takes away all the boilerplate in creating a class that is immutable that you can treat as a tuple or a dictionary
- fileinput: probably the most simple way to read from standard input with the option of reading multiple files in the future (in case we had multiple data files)
- generators: easily the hardest concept to grok in this list. All the uses of
imapare saying that only as elements are requested in an iteration is the function in
imapinvoked. The variable
votesis only iterated once (and that’s done on the line
What remains left after this section is serializing a sequence of dictionaries to JSON.
Section 2: Inheriting from List
Generator class is the most “hacky” part of the code, but is needed because Python struggles serializing arbitrary iterators in a memory efficient way. The
Generator class is adapted from an answer on StackOverflow.
Generator class has to derive from
list because when serializing an object to json, Python will check if it
isinstance(list). Not deriving from
list will cause an exception because the json module can’t serialize generators. I don’t like this implementation because by deriving from
Generator is stating that it is a list, but an iterator is not a list! An iterator that is masquarading as an list gives me nightmares because the user may assume certain truths about lists that are violated when given an iterator.
The three functions defined are known as special methods, and are easily identifiable by the signature double underscore wrapping:
__init__defines how a
Generatoris customized when someone invokes
__iter__defines the behavior of how elements are iterated (eg.
for x in Generator(foobar):). Our implementation simply forwards the request onto the actual generator
__len__defines how to calculate the length of the class (eg.
length = len(Generator(foobar))Defining
__len__is needed in
Generatorbecause the JSON encoder detects if a list is “falsy, and a python list will evaluate to
falseif it has a zeroth length (eg.
if not : print 'a'will print
a), so only
will be outputted. Defining the function gets around it.
What’s frustrating is that in the json module documentation for encoders, there is an example “to support arbitrary iterators”, but it ends up allocating a whole list for the iterator, which is what we’re trying to avoid!
There is a way around this hack using simplejson, which is “the externally maintained development version of the json library included with Python 2.6 and Python 3.0.” An issue was raised about four years ago, lamenting the same problem. The author created a branch that fixed the problem and asked for testing volunteers. Unfortunately, no one responded so the author never merged the branch into simplejson. Luckily for us, we can still pip install the branch.
Then we can remove the
Generator class and just use:
Performance difference between the two implementation (deriving from list vs. custom simplejson) is nearly neglible with using the simplejson approach being about 5% faster.
Doing my open source duty, I made sure that I let the author know that their implementation worked for our purposes. Here’s to hoping he hears and merges it into the next version!
Update: The author responded and asked me to merge the branch back into master so it is updated. I accepted and made a pull request the next day.
Update: I’m happy to say that pull request has been accepted and as of simplejson 3.8.0,
iterable_as_array can be used, so there is no need to reference a specific (and outdated) branch on Github for the functionality.
pip install simplejson will include the option. Now, combining all the tricks of the trade, we end with:
Section 3: Writing
The shortest and (hopefully) the easiest section:
We’re interested in the amount of time and the max memory usage to show that the votes file, is in fact, not loaded entirely in memory at once. To measure max memory usage, there is a nice utility called
memusg online, which we can grab.
Let’s run it!
Ok, 3 minutes and 36 seconds to run. Not terrible, but not great. Memory usage is about 6KB, so the whole file was processed without exhausting memory.
You may have noticed that I’m piping the output to
/dev/null. I do this as a way to measure CPU performance not IO performance. Though, interestingly enough writing to the file didn’t effect CPU performance, which leads me to think that the algorithm is strangely CPU bound.
Comparison with Rust
A program that transforms an input shouldn’t be CPU bound. This made my mind wander. What if I coded it up in a more naturally performance language? Most people would drift towards C/C++, but I chose Rust because the state of package management in C/C++ is sad. I don’t want to re-code a robust csv and json parser, or spend an inordinate amount of time trying to integrate packages like rapidjson so that everything is redistributible.
Anyways, below is the first cut of my first program I have ever written in Rust that produces an identical result as the Python code. Don’t judge too harshly!
The code is technically longer, but I feel like it hasn’t lost any readibility. It has different syntax than Python, but that shouldn’t be a big turnoff.
Since this was the first Rust code I have ever written, I can’t tell if it is idiomatic, but I’m satisfied. The hardest part was looking up function signatures for serde and lamenting the fact that the rust csv reader decodes records natively using rustc-serialize, but the community is moving towards serde because it has more features and is faster. There is an issue open for the rust csv parser to move to
serde, so the posted code should only become more concise as time passes.
At the time of writing this,
Cargo.toml looked like:
Running the code (after compiling with
cargo build --release:
That’s about a 6x speedup compared to even the Python version that drops down to C for speed. Thus, if speed is of no concern, use Python, but if you want speed and C/C++ might not the right fit – use Rust.
I’ve avoided talking about the
sys timing because until now it hasn’t constituted a significant part to the timing, but now that
sys is more than a quarter of the
real time, it is time to talk about it. In our example,
sys measures reading the file and memory allocation, two jobs that are relegated to the kernal to perform. A hypothesis would be that the Rust code is so fast that the program is becoming IO bound, and this statement is backed up by watching
top display our Rust program consistently using less CPU (about 10%).
If you take a look at the resulting json file, it is quite large. Converting it back to csv might be a good idea because we don’t really gain anything from the json and a csv is much smaller in comparison. We use the same tools as before, except we need to use ijson, as that allows for to stream in JSON.
There are more lines of code dedicated to importing modules than to the actual conversion process. If that is not a powerful snippet of code, I’m not sure what is!
I chose the ijson backend as
yajl2 because the pure Python implementation is twice as slow. The downside to this approach is that it may not be cross platform as
yajl2 requires a compilation step.
We can ensure that roundtripping produces identical output without creating any additional files:
How fast can we go?
From here on out, the content is trivial, but let’s say that you were tasked with converting csv to JSON as fast as possible, and it didn’t matter how re-useable the code was.
For this task we are going to dip our toes into C.
Compile the following code with:
gcc -O3 -o to-json to-json.c.
gcc is needed because we use a couple gnu-isms, such as
getline to read a line at a time and
__builtin_expect for branch prediction.
and the results:
That’s 114.73x speedup compared to our python solution and 18.75x speedup compared to our rust solution. Also note the low memory usage.
Despite the speedup, please don’t code like this. There are many inputs that could wreck our toy example (multi-line records, quotations, long usernames, etc). We get all of these features in the Rust and Python versions because we used libraries that handle all the corner cases.
As a side note the C version is the longest line count and took the longest to code.
How about CSVKit?
csvjson determines the JSON keys from column headers, but in our dataset we don’t have headers. Also
csvjson doesn’t natively stream data and using the
--stream option, it won’t ouput valid JSON! The former problem is easily fixed, but the latter renders this test almost useless. Still, we’ll execute it and record the results.
Wow, slow as molasses (over 250 times slower than our C version) and the final result is still incorrect, but I figured I should this example to be complete, as for small CSV files it should be the quickest because there is no code you have to write, just execute a command!
PyPy is a fast, compliant alternative implementation of the Python language (2.7.9 and 3.2.5). […] Thanks to its Just-in-Time compiler, Python programs often run faster on PyPy
Nice. Dropping in PyPy yielded about a 3x speedup without any additional work. Memory usage is significantly higher due (most likely) to PyPy’s JIT compiler.