Disecting the Postgres Bulk Insert and Binary Format

Introduction

This post delves deep into the world of bulk loading data with PostgresSQL. I’ve written about this subject before in Optimizing Innodb for Bulk Insert for MySQL-esque databases. These posts differ not merely by which database they target, but also in their target audience. The Innodb post is more a write oriented setup, where there are a many writes in batches. On the contrary this post will deal with inserting a data set as fast as possible into the database using COPY and evaluating pg_bulkload.

For those interested in optimizing more frequent, but smaller number of writes, see:

Scenario

The data is a set of Reddit voting data, which is the the same data used in Processing arbitrary amount of data in Python. This is what the data looks like (username, link, and score).

00ash00,t3_as2t0,1
00ash00,t3_ascto,-1
00ash00,t3_asll7,1
00ash00,t3_atawm,1
00ash00,t3_avosd,1
0-0,t3_6vlrz,1
0-0,t3_6vmwa,1
0-0,t3_6wdiv,1
0-0,t3_6wegp,1
0-0,t3_6wegz,1

There are 7.5 millions rows.

The test machine specs:

Our table will be simple, but I’ll talk through some of the business logic.

CREATE TABLE votes (
  voteId serial PRIMARY KEY,
  username text NOT NULL,
  link text NOT NULL,
  score int NOT NULL
);

-- We're going to have a couple indexes on the data but we are going to add them
-- only after we have loaded the data for performance reasons. So don't actually
-- execute these statements now.
CREATE INDEX ON votes (username)
CREATE INDEX ON votes (link)

For those of you screaming that the schema is not normalized, relax. Read Maybe Normalizing Isn’t Normal, When Not to Normalize your SQL Database, and understand that our data is immutable. Normalizing the schema, such that there would be a users and a link table would only complicate the loading process and provide no benefits. I don’t want to split a three column csv into a three table database just so I can check off the box that says “Normalized”.

For performance, the schema will start out defining zero indexes:

If you are loading a freshly created table, the fastest method is to create the table, bulk load the table’s data using COPY, then create any indexes needed for the table. Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.

We’ll take the recommendations in Populating a Database and increase the size maintenance_work_mem and max_wal_size (keep reading for explanations). However, double check all variables set because some variables require a restart and affect all queries, and we’re only interested in the bulk loading queries.

For each test we’ll be clearing the data and all indexes. The process will be repeated several times and the best time will be reported.

Copy Csv

The easiest case:

COPY votes (username, link, score)
  FROM '/tmp/votes.csv'
  WITH (FORMAT 'csv', HEADER false);

CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);

Results: 1m7.799s

That’s about 110,000 rows inserted per second – and we’re just getting started.

Copy Text

Luckily the data we are working with follows Postgres’s text format, so we’ll see if there is any performance difference, as there should be less parsing necessary.

COPY votes (username, link, score)
  FROM '/tmp/votes.csv'
  WITH (FORMAT 'text', DELIMITER ',')

CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);

Results: 1m5.189s

An improvement, but if you have csv data, it would be safer to use the default Postgres csv format.

Copy Binary

The last format available for COPY is binary, and the documentation contains the half hearted statement about the performance of the binary statement.

It is somewhat faster than the text and CSV formats.

Being the curious programmers, we want numbers, so let’s convert our csv into binary. The next problem, how do we convert it? The documentation is sparse and contains various jargon that took me several read through to understand. The format isn’t difficult to implement, as on my first try, I cracked the format for even complicated types. I thought I should document a bit of the format to share the knowledge.

The Format

First thing first, the header. While it contains various extensions, you’d be fine ignoring any extensions and hard-coding the first 16 bytes as is shown in the example code.

Next the data, which can be described with a very simple formula (going to use pseudo code here). I’ve parenthesized the size of data types.

for every row of data:
  write the number of columns to be written (short)
  for every column in row:
    write the amount of data in bytes about to be inserted (integer)
    write data (length varies by type and contents)

For instance if we had a row with three integer columns:

for every row of data:
  # Each row has three integer columns
  writeShort(3)

  for every column in row:
    # Each integer is four bytes long
    writeInt(4)

    # Write the integer
    writeInt(value)

The Code

Let’s take a look at what it would take to convert our csv into Postgres' binary format in Python. The decision to use Python was arbitrary, any language should be able to accomplish this task.

#!/usr/bin/env python2.7

import csv
import fileinput
import sys
import struct
from collections import namedtuple
from itertools import imap

# Converts each line of the csv into a vote
RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
votes = imap(lambda x: x._replace(score=int(x.score)), votes)

# Postgres' binary format is big-endian, so we declare a couple helper objects
# that will convert integers to the binary format of specified length.
# 'shorter' is for writing 16 bit fields
# 'inter' is for writing 32 bit fields
shorter = struct.Struct('>h')
inter = struct.Struct('>i')

# Postgres contains various header information of 15 bytes followed by an
# optional header extension, which we mark as having a length of 0.
sys.stdout.write(b'PGCOPY\n')
sys.stdout.write(b'\xFF')
sys.stdout.write(b'\r\n\x00')
sys.stdout.write(inter.pack(0))
sys.stdout.write(inter.pack(0))

def writeStr(s):
    """Writes a string in Postgres' binary format"""
    # Prefix string content with the length of the string
    sys.stdout.write(inter.pack(len(s)))
    sys.stdout.write(s)

for vote in votes:
    # Each binary row is prefixed with how many columns (tuples) it contains.
    # Since we are inserting username, link, score (voteId is auto generated),
    # we are inserting three fields
    sys.stdout.write(shorter.pack(3))

    # Write out username and link
    writeStr(vote.username)
    writeStr(vote.link)

    # Prefix score content with the length of the score data type (integer),
    # which is 4 bytes
    sys.stdout.write(inter.pack(4))
    sys.stdout.write(inter.pack(vote.score))

# End insert with -1
sys.stdout.write(shorter.pack(-1))

To generate the binary file (optionally we could pipe to Postgres, but that would be measuring the Python code, but we’re not interested in that):

./to-bin.py < votes.csv > votes.bin

How fast was Postgres able to load a format that it natively knew?

Results: 1m6.406s

I don’t know if I can accuractely describe the disappointment when I first saw that number. The binary format, in this case, is not faster. I emphasize “in this case” because the binary format had to work with a significantly larger file (50% bigger!). While in a csv format 1 and -1 take up 1 and 2 bytes respectively, in the binary format they consume 8 bytes. We could change the schema to use smallint but that would only have save 2 bytes per row. Additionally, strings need a four byte prefix to denote length, which would constitute most of the string length. Clearly the binary format isn’t an obvious win, and in depending on the situation could lead to worse performance.

Advanced Format

Since serializing text and integers was easy, I’ll walk through how I figured out how to serialize Postgres arrays into the binary format. A dive into the Postgres source code is needed, so a basic understanding of C is needed (from the documentation):

To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the *send and *recv functions for each column’s data type

Let’s pretend that instead of inserting a single vote, the column contains an array of votes (maybe someone can now vote multiple times on a link?). Now comes the hard part, interpreting array_recv

Here are some info we glean from the function:

Then for each element in the array, we use the previous formula of writing the size of the content followed by the content.

Unlogged Copy

A new feature in Postgres 9.5 is the ability to alter whether a table writes to the WAL. Previous versions (down to 9.1) allowed creating unlogged tables, but the tables were permanently unlogged. Considering that pg_bulkload achieves superior write performance by bypassing the WAL, we should attain similar results. A blog post from 2011 gives a 2x performance bump for unlogged tables, so we’ll try it out.

What are unlogged tables good for? A reputable Postgres blog explains the uses for the (at the time) upcoming 9.1 release:

  • you don’t need it to stay alive over Pg crash
  • you don’t need it replicated to slaves (well - at least using warm/hot standby mechanisms)
  • you do need as fast writes as possible

What happens when you set a previously UNLOGGED table to LOGGED? The wiki explains:

Note that setting an unlogged table to logged will generate WAL which will contain all data in the table, so this would cause a spike in replication traffic for large tables

CREATE UNLOGGED TABLE votes (
  voteId serial PRIMARY KEY,
  username text NOT NULL,
  link text NOT NULL,
  score int NOT NULL
);

COPY votes (username, link, score)
  FROM '/tmp/votes.csv'
  WITH (FORMAT 'csv', HEADER false)

CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);
ALTER TABLE votes SET LOGGED;

Results: 1m28.054s

This number is higher than what I thought it would have been, so I cleared the database and executed the same statement without ALTER TABLE votes SET LOGGED.

New Results: 0m34.219s

This is twice as fast as any of the logged tables! 220,000 rows a second. Turns out the mentioned blog post was accurate with its 2x performance claim. Writing to an UNLOGGED table would be even faster if I had a faster processor because Postgres was taking up 100% CPU for the entire duration. Think about that. A database that is CPU-bound is incredible.

We can conclude from the numbers that though there is no replication, there is large overhead to writing to the WAL. Later in the article, we’ll discuss whether one should take the compromise of speed vs safety.

Time for Fun?

We’ll have a break and have a little bit of fun. Using CTEs (Common Table Expressions), we can elegantly query for the top n voting users that, unlike a LIMIT query, won’t cut off the those at the end of the results that are tied with other users.

WITH scored_users AS (
    SELECT username AS username,
           COUNT(CASE WHEN score = -1 THEN 1 END) AS downvotes,
           COUNT(CASE WHEN score = 1 THEN 1 END) AS upvotes,
           COUNT(*) AS votes
    FROM votes
    GROUP BY username
), results AS (SELECT username, votes,
       (upvotes - downvotes) AS score,
       (upvotes::float / votes) as positive,
       dense_rank() OVER(ORDER BY votes DESC) AS vote_rank,
       dense_rank() OVER(ORDER BY (upvotes - downvotes) DESC) score_rank
  FROM scored_users)
SELECT username, votes, score, positive, vote_rank, score_rank
FROM results
WHERE vote_rank < 6
ORDER BY vote_rank,positive DESC,username;
|   username    | votes | score | positive  | vote_rank | score_rank
----------------+-------+-------+-----------+-----------+------------
 32bites        |  2000 |     0 |       0.5 |         1 |       1000
 42omle         |  2000 |     0 |       0.5 |         1 |       1000
 48VDC          |  2000 |     0 |       0.5 |         1 |       1000
 aa001m         |  2000 |     0 |       0.5 |         1 |       1000
 aberant        |  2000 |     0 |       0.5 |         1 |       1000
 ADAP2K         |  2000 |     0 |       0.5 |         1 |       1000
...

Interesting, there seems to be a cap at 2000 votes (max 1000 upvotes and max 1000 downvotes). The data, when grouped by link name is much more interesting.

WITH scored_links AS (
    SELECT link AS link,
           COUNT(CASE WHEN score = -1 THEN 1 END) AS downvotes,
           COUNT(CASE WHEN score = 1 THEN 1 END) AS upvotes,
           COUNT(*) AS votes
    FROM votes
    GROUP BY link
), results AS (SELECT link, votes,
       (upvotes - downvotes) AS score,
       (upvotes::float / votes) as positive,
       dense_rank() OVER(ORDER BY votes DESC) AS vote_rank,
       dense_rank() OVER(ORDER BY (upvotes - downvotes) DESC) score_rank
  FROM scored_links)
SELECT link, votes, score, positive, vote_rank, score_rank
FROM results
WHERE vote_rank < 6
ORDER BY vote_rank;
|  link   | votes | score |     positive      | vote_rank | score_rank
----------+-------+-------+-------------------+-----------+------------
 t3_beic5 |  1660 |  1102 | 0.831927710843373 |         1 |          2
 t3_92dd8 |  1502 |  1186 | 0.894806924101198 |         2 |          1
 t3_9mvs6 |  1162 |   650 | 0.779690189328744 |         3 |         15
 t3_bge1p |  1116 |   988 | 0.942652329749104 |         4 |          3
 t3_9wdhq |  1050 |   204 | 0.597142857142857 |         5 |        290

Here we see that links that get a lot of votes don’t necessarily get a lot of upvotes. You don’t see it here, as I truncated the data, but the 36th most vote getting link, has a score of -55, but there is a trend that stories that receive many votes, tend to have a positive score.

pg_bulkload

Back to business.

pg_bulkload is a high speed data loading utility for PostgreSQL.

pg_bulkload achieves its performance by skipping the shared buffers and WAL logging (similar to UNLOGGED tables).

Installation

Installation is somewhat non-trivial.

Limitation

pg_bulkload can’t handle default or missing columns. I find this a huge limitation, which immediately eliminates many use cases. All is not lost – we can work and hack our way around it. The fix is to generate our own primary keys and add them to our csv.

Schema Change

Change the schema to extract the sequence outside the table so that we may call the sequence outside the table. Index creation is moved into the initial schema because pg_bulkload is optimized to work with pre-built indexes.

CREATE SEQUENCE voteId_seq;

CREATE TABLE votes (
  voteId integer PRIMARY KEY DEFAULT nextval('voteId_seq'),
  username text NOT NULL,
  link text NOT NULL,
  score int NOT NULL
);

CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link)

Now calculate the 7.5 million sequence ids and join them with the original dataset.

psql reddit -c "
  COPY (SELECT nextval('voteId_seq')
        FROM generate_series(1, $(wc -l < /tmp/votes.csv)))
  TO STDOUT" >/tmp/vote_ids

paste /tmp/vote_ids /tmp/votes.csv > /tmp/votes2.csv

First Results

Calculating the 7.5 million sequence ids took 8 seconds. I feel like that is slow but I have nothing to compare it against. At least pg_bulkload will now load it. Now for the real load:

Results: 0m42.569s

It’s a 35% improvement, which is significant, but it is not significant enough to warrant the trouble with installation, lack of documentation, and I had to generate ids out of band, which knocks it down to about a 20% improvement.

Parallel

pg_bulkload has one more trick up its sleeve: there is a mode where pg_bulkload will read an input file in one process and write the rows in another. One to thing to note, and this is documented in pg_bulkload, is when specifying WRITER = PARALLEL I had to edit /etc/postgresql/9.5/main/pg_hba.conf and use the “trust” method to authenticate. Be careful, trust authentication, can be used inappropriately and leave the database open to security vulnerabilities.

So how did it do?

Results: 0m36.634s

A small improvement did occur. It would be interesting to investigate whether adding more cores would help (I’m guessing not).

Advantages Compared With COPY

COPY can not handle constraint violations:

Duplicate rows are not permitted if they violate a unique constraint. Conversely, MySQL can with the option to REPLACE or IGNORE.

On the other hand, pg_bulkload contains options such as DUPLICATE_ERRORS and ON_DUPLICATE_KEEP = NEW | OLD to customize the behavior on duplicate constraint violations. The only constraints that pg_bulkload won’t check are foreign key constraints.

Investigation into maintenece_work_mem

pg_bulkload contains an interesting statement about maintenece_work_mem, which is a server configuration variable:

The maintenece_work_mem, PostgreSQL parameter, affects the performance of pg_bulkload. The duration becomes almost 15% shorter if this parameter is changed from 64 MB to 1 GB.

To cross reference maintenece_work_mem, let’s check the official documentation

Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes (64MB). Since only one of these operations can be executed at a time by a database session, and an installation normally doesn’t have many of them running concurrently, it’s safe to set this value significantly larger than work_mem. Larger settings might improve performance for vacuuming and for restoring database dumps.

At first, this seems like it shouldn’t really have anything to do with bulk loading data, why would bumping this value up from 64MB to 1GB cause a 15% speedup?

Examinging pg_bulkload source code and the postgres source code is illuminating:

/*
 * We size the sort area as maintenance_work_mem rather than work_mem to
 * speed index creation.  This should be OK since a single backend can't
 * run multiple index creations in parallel.  Note that creation of a
 * unique index actually requires two BTSpool objects.  We expect that the
 * second one (for dead tuples) won't get very full, so we give it only
 * work_mem.
 */

Increasing maintenece_work_mem will allow creating indexes faster as they are sorted using maintenece_work_mem, and since no two indexes can be created at the same time, it is safe to set maintenece_work_mem to a larger value.

WAL Analysis

Databases are interesting in how they record changes. I’ve selected a couple quotes from the book, Database Management Systems. ‘Any change to a database object is first recorded in the log; the record in the log must be written to stable storage before the change to the database object is written to disk’. The database appends work to be done to the log and at an undetermined time in the future it will actually do the work. The alternative would require the database to write the data immediately to provide the same crash recovery guarantees, which is very slow.

The danger of disabling the log can’t be overstated. ‘If a transaction made a change and committed, [and WAL is disabled], some of these changes may not have been written to disk at the time of a subsequent crash’. This could mean that a client or user may see a successful action (eg. a new user registers), only for the database to crash a few seconds (or minutes!) later. Without WAL there is no guarantee if the registration was propagated to the database, so the user would end up terribly confused and need to re-register.

Our UNLOGGED table doesn’t write to the WAL. Let’s say that once the table is loaded, it is immutable (there are no more subsequent inserts) and won’t be replicated to another server – there is still one very large downside. The original commit message for unlogged tables sets the stage:

[Unlogged tables] are not available on standby servers and are truncated whenever the database system enters recovery

If we load the voting data into the database and the following commands are executed:

service postgresql restart
pkill -9 -f postgres
service postgresql start
psql reddit -c "SELECT COUNT(*) FROM votes"

The result will always be 0, even if you wait a significantly long amount of time. If anything at anytime causes the database shutdown irregularly (power loss, kill -9, etc), the database will be truncated because the database doesn’t know whether the table is in a good state. That is why, unless it is trivial to restore the database from some other sort of data, do not use UNLOGGED for long term data.

We can recover a bit of the UNLOGGED speed using a trick with transactions. When you execute COPY against a table that was created in the same transaction, the WAL will be optimized away. From the book, PostgreSQL 9.0 High Performance

The purpose of the write-ahead log is to protect you from partially committed data being left behind after a crash. If you create a new table in a transaction, add some data to it, and then commit at the end, at no point during that process is the WAL really necessary. (pg. 402)

We can see that optimization in action with the following:

BEGIN;
CREATE TABLE votes (
  voteId serial PRIMARY KEY,
  username text NOT NULL,
  link text NOT NULL,
  score int NOT NULL
);

COPY votes (username, link, score)
  FROM '/tmp/votes.csv'
  WITH (FORMAT 'csv', HEADER false);

CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);
COMMIT;

Results: 0m51.810s

Notice that this is a 25% improvement, over the previous COPY with the csv. I find this a fine compromise.

For more information on some of the use cases for UNLOGGED tables, there’s a fine article, Reducing writes with Unlogged Tables

In case you are curious, synchronous_commit will not provide an performance benefit because it is only when dealing with multiple transactions.

Parallel Copy

The COPY insertions we’ve been doing have been sequential, and they have been pretty CPU intensive. If we have multiple cores, can we split a copy into multiple chunks and load them separately, but in parallel. Note that this will take several transactions because a transaction is inherently single threaded and errors will arise if copies are intermixed in a single transaction.

In my tests I have not noticed an improvement in logged tables. The work was divided evenly between the cores, but at half usage, so total insertion time took approximately the same time. However, others have found contrary evidence and parallel COPY commands is listed in PostgreSQL 9.0 High Performance as the fastest way to insert data.

Where I did see major improvement, were UNLOGGED tables. I saw insertion time decrease linearly with the number of cores (ie, all cores are saturated during insertion). And at this point, an index on the data takes longer than the actual data insertion.

In general, I would not recommend separate COPY commands for the same data because it breaks some of the semantics with the all-or-nothing behavior of COPY, which can be confusing if some of the data is loaded and some isn’t.

pgloader

pgloader is a project dedicated to solely loading data into Postgres. There are many things going for it.

Unfortunately, tucked away in the section about performance we find the limitation:

When given a file that the PostgreSQL COPY command knows how to parse, and if the file contains no erroneous data, then pgloader will never be as fast as just using the PostgreSQL COPY command.

pgloader looks to be an amazing tool, but not for our purposes.

Conclusion

loading time
Copy csv 1m7.799s
Copy text 1m5.189s
Copy binary 1m6.406s
Copy unlogged 1m28.054s
Copy unlogged no WAL 0m34.219s
Copy unlogged parallel 0m25.314s
Copy csv in one txn 0m51.810s
pg_bulkload 0m42.569s
pg_bulkload parallel 0m36.634s

Comments

If you'd like to leave a comment, please email [email protected]

2017-10-04 - Marcos

Amazing post! Thanks for sharing all this information with such a great detailing .

2018-07-08 - DitN

An excellent article, with excellent level of detail and clarity.