Disecting the Postgres Bulk Insert and Binary FormatPublished on
Bulk loading rows faster than this bulky turkey can eat
This post delves deep into the world of bulk loading data with PostgresSQL. I’ve written about this subject before in Optimizing Innodb for Bulk Insert for MySQL-esque databases. These posts differ not merely by which database they target, but also in their target audience. The Innodb post is more a write oriented setup, where there are a many writes in batches. On the contrary this post will deal with inserting a data set as fast as possible into the database using
COPY and evaluating pg_bulkload.
For those interested in optimizing more frequent, but smaller number of writes, see:
- PostgreSQL Insert Strategies – Performance Test
- How to Insert Data [into PostgresSQL] - as Fast as Possible
The data is a set of Reddit voting data, which is the the same data used in Processing arbitrary amount of data in Python. This is what the data looks like (username, link, and score).
There are 7.5 millions rows.
The test machine specs:
- Ubuntu 14.04 Server
- 2x Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
- 4GB DDR3 RAM at 1600MHz
- 5400 RPM 8MB cache HDD
- Postgres 9.5 Alpha2
- Relevant Postgres Variables
- maintenance_work_mem: 1GB
- shared_buffers: 1GB
- checkpoint_segments: 16MB
Our table will be simple, but I’ll talk through some of the business logic.
- A surrogate key is defined to make referring to a vote easier. The surrogate key is defined as serial, so it will auto increment on insertion.
- Since the data doesn’t contain any nulls, we’ll include
NOT NULLon the columns as a sanity check
- Once the table is loaded there’ll be indexes on
link, which will simulate the need to do queries by
link. I say “once the table is loaded” because the table will start out with zero indexes (discussed later)
For those of you screaming that the schema is not normalized, relax. Read Maybe Normalizing Isn’t Normal, When Not to Normalize your SQL Database, and understand that our data is immutable. Normalizing the schema, such that there would be a
users and a
link table would only complicate the loading process and provide no benefits. I don’t want to split a three column csv into a three table database just so I can check off the box that says “Normalized”.
For performance, the schema will start out defining zero indexes:
If you are loading a freshly created table, the fastest method is to create the table, bulk load the table’s data using COPY, then create any indexes needed for the table. Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.
We’ll take the recommendations in Populating a Database and increase the size
max_wal_size (keep reading for explanations). However, double check all variables set because some variables require a restart and affect all queries, and we’re only interested in the bulk loading queries.
For each test we’ll be clearing the data and all indexes. The process will be repeated several times and the best time will be reported.
The easiest case:
That’s about 110,000 rows inserted per second – and we’re just getting started.
Luckily the data we are working with follows Postgres’s text format, so we’ll see if there is any performance difference, as there should be less parsing necessary.
An improvement, but if you have csv data, it would be safer to use the default Postgres csv format.
The last format available for
COPY is binary, and the documentation contains the half hearted statement about the performance of the binary statement.
It is somewhat faster than the text and CSV formats.
Being the curious programmers, we want numbers, so let’s convert our csv into binary. The next problem, how do we convert it? The documentation is sparse and contains various jargon that took me several read through to understand. The format isn’t difficult to implement, as on my first try, I cracked the format for even complicated types. I thought I should document a bit of the format to share the knowledge.
First thing first, the header. While it contains various extensions, you’d be fine ignoring any extensions and hard-coding the first 16 bytes as is shown in the example code.
Next the data, which can be described with a very simple formula (going to use pseudo code here). I’ve parenthesized the size of data types.
For instance if we had a row with three integer columns:
Let’s take a look at what it would take to convert our csv into Postgres’ binary format in Python. The decision to use Python was arbitrary, any language should be able to accomplish this task.
To generate the binary file (optionally we could pipe to Postgres, but that would be measuring the Python code, but we’re not interested in that):
How fast was Postgres able to load a format that it natively knew?
I don’t know if I can accuractely describe the disappointment when I first saw that number. The binary format, in this case, is not faster. I emphasize “in this case” because the binary format had to work with a significantly larger file (50% bigger!). While in a csv format 1 and -1 take up 1 and 2 bytes respectively, in the binary format they consume 8 bytes. We could change the schema to use
smallint but that would only have save 2 bytes per row. Additionally, strings need a four byte prefix to denote length, which would constitute most of the string length. Clearly the binary format isn’t an obvious win, and in depending on the situation could lead to worse performance.
Since serializing text and integers was easy, I’ll walk through how I figured out how to serialize Postgres arrays into the binary format. A dive into the Postgres source code is needed, so a basic understanding of C is needed (from the documentation):
To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the *send and *recv functions for each column’s data type
Let’s pretend that instead of inserting a single vote, the column contains an array of votes (maybe someone can now vote multiple times on a link?). Now comes the hard part, interpreting
Here are some info we glean from the function:
- An array’s content is prefixed by a header
- First integer is the number of dimensions in the array. For a single dimension array, put 1.
- Second integer represents the flags. I’ll be marking 0 for flags.
- Third integer is the element type OID, as arrays in Postgres are homogenous. Since the element type is
int4we find the OID from a simple select:
SELECT oid FROM pg_type WHERE typname = 'int4'
- Fourth integer is the number of elements in the array to be inserted
- Fifth integer is the index to start inserting elements. Since Postgres is indexed at one, this value should not be smaller than one.
Then for each element in the array, we use the previous formula of writing the size of the content followed by the content.
A new feature in Postgres 9.5 is the ability to alter whether a table writes to the WAL. Previous versions (down to 9.1) allowed creating unlogged tables, but the tables were permanently unlogged. Considering that pg_bulkload achieves superior write performance by bypassing the WAL, we should attain similar results. A blog post from 2011 gives a 2x performance bump for unlogged tables, so we’ll try it out.
What are unlogged tables good for? A reputable Postgres blog explains the uses for the (at the time) upcoming 9.1 release:
- you don’t need it to stay alive over Pg crash
- you don’t need it replicated to slaves (well - at least using warm/hot standby mechanisms)
- you do need as fast writes as possible
What happens when you set a previously
UNLOGGED table to
LOGGED? The wiki explains:
Note that setting an unlogged table to logged will generate WAL which will contain all data in the table, so this would cause a spike in replication traffic for large tables
This number is higher than what I thought it would have been, so I cleared the database and executed the same statement without
ALTER TABLE votes SET LOGGED.
New Results: 0m34.219s
This is twice as fast as any of the logged tables! 220,000 rows a second. Turns out the mentioned blog post was accurate with its 2x performance claim. Writing to an
UNLOGGED table would be even faster if I had a faster processor because Postgres was taking up 100% CPU for the entire duration. Think about that. A database that is CPU-bound is incredible.
We can conclude from the numbers that though there is no replication, there is large overhead to writing to the WAL. Later in the article, we’ll discuss whether one should take the compromise of speed vs safety.
Time for Fun?
We’ll have a break and have a little bit of fun. Using CTEs (Common Table Expressions), we can elegantly query for the top
n voting users that, unlike a
LIMIT query, won’t cut off the those at the end of the results that are tied with other users.
Interesting, there seems to be a cap at 2000 votes (max 1000 upvotes and max 1000 downvotes). The data, when grouped by link name is much more interesting.
Here we see that links that get a lot of votes don’t necessarily get a lot of upvotes. You don’t see it here, as I truncated the data, but the 36th most vote getting link, has a score of -55, but there is a trend that stories that receive many votes, tend to have a positive score.
Back to business.
pg_bulkload is a high speed data loading utility for PostgreSQL.
pg_bulkload achieves its performance by skipping the shared buffers and WAL logging (similar to
Installation is somewhat non-trivial.
- Download the source from github
- Install all the dependencies
- make; make install
- Connect to the wanted database and execute
CREATE EXTENSION pg_bulkload
pg_bulkload can’t handle default or missing columns. I find this a huge limitation, which immediately eliminates many use cases. All is not lost – we can work and hack our way around it. The fix is to generate our own primary keys and add them to our csv.
Change the schema to extract the sequence outside the table so that we may call the sequence outside the table. Index creation is moved into the initial schema because pg_bulkload is optimized to work with pre-built indexes.
Now calculate the 7.5 million sequence ids and join them with the original dataset.
Calculating the 7.5 million sequence ids took 8 seconds. I feel like that is slow but I have nothing to compare it against. At least pg_bulkload will now load it. Now for the real load:
It’s a 35% improvement, which is significant, but it is not significant enough to warrant the trouble with installation, lack of documentation, and I had to generate ids out of band, which knocks it down to about a 20% improvement.
pg_bulkload has one more trick up its sleeve: there is a mode where pg_bulkload will read an input file in one process and write the rows in another. One to thing to note, and this is documented in pg_bulkload, is when specifying
WRITER = PARALLEL I had to edit
/etc/postgresql/9.5/main/pg_hba.conf and use the “trust” method to authenticate. Be careful, trust authentication, can be used inappropriately and leave the database open to security vulnerabilities.
So how did it do?
A small improvement did occur. It would be interesting to investigate whether adding more cores would help (I’m guessing not).
Advantages Compared With
Duplicate rows are not permitted if they violate a unique constraint. Conversely, MySQL can with the option to REPLACE or IGNORE.
On the other hand, pg_bulkload contains options such as
ON_DUPLICATE_KEEP = NEW | OLD to customize the behavior on duplicate constraint violations. The only constraints that pg_bulkload won’t check are foreign key constraints.
pg_bulkload contains an interesting statement about maintenece_work_mem, which is a server configuration variable:
The maintenece_work_mem, PostgreSQL parameter, affects the performance of pg_bulkload. The duration becomes almost 15% shorter if this parameter is changed from 64 MB to 1 GB.
To cross reference maintenece_work_mem, let’s check the official documentation
Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes (64MB). Since only one of these operations can be executed at a time by a database session, and an installation normally doesn’t have many of them running concurrently, it’s safe to set this value significantly larger than work_mem. Larger settings might improve performance for vacuuming and for restoring database dumps.
At first, this seems like it shouldn’t really have anything to do with bulk loading data, why would bumping this value up from 64MB to 1GB cause a 15% speedup?
maintenece_work_mem will allow creating indexes faster as they are sorted using
maintenece_work_mem, and since no two indexes can be created at the same time, it is safe to set
maintenece_work_mem to a larger value.
Databases are interesting in how they record changes. ‘Any change to a database object is first recorded in the log; the record in the log must be written to stable storage before the change to the database object is written to disk’ (Gehrke, 2003, p. 581). The database appends work to be done to the log and at an undetermined time in the future it will actually do the work. The alternative would require the database to write the data immediately to provide the same crash recovery guarantees, which is very slow.
The danger of disabling the log can’t be overstated. ‘If a transaction made a change and committed, [and WAL is disabled], some of these changes may not have been written to disk at the time of a subsequent crash’ (Gehrke, 2003, p. 586). This could mean that a client or user may see a successful action (eg. a new user registers), only for the database to crash a few seconds (or minutes!) later. Without WAL there is no guarantee if the registration was propagated to the database, so the user would end up terribly confused and need to re-register.
UNLOGGED table doesn’t write to the WAL. Let’s say that once the table is loaded, it is immutable (there are no more subsequent inserts) and won’t be replicated to another server – there is still one very large downside. The original commit message for unlogged tables sets the stage:
[Unlogged tables] are not available on standby servers and are truncated whenever the database system enters recovery
If we load the voting data into the database and the following commands are executed:
The result will always be
0, even if you wait a significantly long amount of time. If anything at anytime causes the database shutdown irregularly (power loss,
kill -9, etc), the database will be truncated because the database doesn’t know whether the table is in a good state. That is why, unless it is trivial to restore the database from some other sort of data, do not use
UNLOGGED for long term data.
We can recover a bit of the
UNLOGGED speed using a trick with transactions. When you execute
COPY against a table that was created in the same transaction, the WAL will be optimized away:
The purpose of the write-ahead log is to protect you from partially committed data(Smith, 2010)
being left behind after a crash. If you create a new table in a transaction, add some
data to it, and then commit at the end, at no point during that process is the WAL
really necessary. (pg. 402)
We can see that optimization in action with the following:
Notice that this is a 25% improvement, over the previous
COPY with the csv. I find this a fine compromise.
For more information on some of the use cases for
UNLOGGED tables, there’s a fine article, Reducing writes with Unlogged Tables
In case you are curious,
synchronous_commit will not provide an performance benefit because it is only when dealing with multiple transactions.
COPY insertions we’ve been doing have been sequential, and they have been pretty CPU intensive. If we have multiple cores, can we split a copy into multiple chunks and load them separately, but in parallel. Note that this will take several transactions because a transaction is inherently single threaded and errors will arise if copies are intermixed in a single transaction.
In my tests I have not noticed an improvement in logged tables. The work was divided evenly between the cores, but at half usage, so total insertion time took approximately the same time. However, others have found contrary evidence and parallel
COPY commands is listed in PostgreSQL 9.0 High Performance as the fastest way to insert data.
Where I did see major improvement, were
UNLOGGED tables. I saw insertion time decrease linearly with the number of cores (ie, all cores are saturated during insertion). And at this point, an index on the data takes longer than the actual data insertion.
In general, I would not recommend separate
COPY commands for the same data because it breaks some of the semantics with the all-or-nothing behavior of
COPY, which can be confusing if some of the data is loaded and some isn’t.
pgloader is a project dedicated to solely loading data into Postgres. There are many things going for it.
- Well documented
- Well maintained
- Flexible (multiple data formats to import: MySQL/SQLite/etc
- Resiliant (loads only the data that conforms to the schema)
- On the homepage, it’s performance is lauded
Unfortunately, tucked away in the section about performance we find the limitation:
When given a file that the PostgreSQL COPY command knows how to parse, and if the file contains no erroneous data, then pgloader will never be as fast as just using the PostgreSQL COPY command.
pgloader looks to be an amazing tool, but not for our purposes.
|Copy unlogged no WAL||0m34.219s|
|Copy unlogged parallel||0m25.314s|
|Copy csv in one txn||0m51.810s|
- If you need a short lived table that can be reconstructed easily, use an
- Else if you need the fastest insertion, use multiple
UNLOGGEDtable and then alter the table to logged at some point in the future
- Else if you need duplicate row behavior, but need the speed of
COPYshould be good enough
- Gehrke, R. R. & J. (2003). Database Management Systems (3rd ed.). McGraw-Hill Higher Education.
- Smith, G. (2010). PostgreSQL 9.0 High Performance. Packt Publishing.