XSLT 2.0 Processing in PostgresPublished on
XML is like china, it has its uses, but for the everyday it should be kept in the cabinet out of reach
Through the xml2 module, Postgres can apply XSL stylesheets to arbitrary texts. There are two big caveats with this approach:
- The xslt functionality is through libxslt, which conforms to XSLT 1.0
- The xml2 module is deprecated, slated for removal, and Postgres doesn’t implement the functionality in the core server, either.
All is not lost for those who want to use XSLT 2.0+ in Postgres. We can embed the gold standard XML Library, saxon, in Postgres using PL/Java. In the end, we will have a regular Postgres function that passes in a stylesheet and data file path, and returns the text result. Since this is a non-trivial, I’ve decided to document the process of setting up, installing, and running an example stylesheet.
Setting up Postgres
We’re going to be using Postgres 9.4 on Ubuntu, and setting up Postgres 9.4 is as simple as following the tutorial outlined on the Postgres site, but for convenience, I’ll copy it down below. One thing to note is that we installing a development dependencies as PL/Java tie into those.
Now you can
sudo -u postgres psql to log into the database!
Setting up PL/Java
PL/Java allows Postgres to call out to the JVM for additional functionality. The default JDK on Ubuntu is OpenJDK, and I happen to prefer the Oracle JDK, so we’ll install that. Additionally there isn’t a maven package for PL/Java so we’ll have to build it on the machine with Postgres on it because of the use of JNI in PL/Java. For those interested in the rationale, there is an interesting article on the PL/Java wiki that explains the choice of JNI.
Now that we have PL/Java built, we are going to add it as an extension to Postgres.
After the extension had been added, I needed to add the install extension jar file to the class path in the module configuration.
Before we commit to writing our XSLT 2.0+ processor, let’s make sure that our extension we installed works. Log into the database and execute.
If you see an output and no errors, we know the installation was a success!
Only have one pljava extension for the entire server! I made the mistake of creating the extension on the “postgres” database and on “mydb” the result was very uninformative “ERROR: Failed to create Java VM”
This part assumes a Maven project.
Add the Saxon library to the pom file.
Since PL/Java works with static functions, we create a class of
Helpers in the
com.example.pljava.test package with a static
The following implementation takes in a file path to the XSLT and a file path to the XML.
Build the project (I prefer to combine everything into an uber jar using Maven Shade). To load it into Postgres, log into the database (no need to
CREATE EXTENSION again), and execute.
Next step. Create a Postgres function to call out to our function.
The language is
javaU because we are accessing the file system, which is an inheritantly unsafe operation. Only a super user can create this function.
Just to illustrate that our solution works, we’ll use the generic identity transform. Since our function works with file paths, we’ll store the following in
/home/nick/data.xml we have
And now the clincher.
<a>Hello</a>. Hurray! We just sucessfully embeded an XSLT 2.0 processor in Postgres.
Looking at the code, we see that each time the
xslt function is called we are parsing the xslt file each time, but intepreting the same xslt transformation is redundant for each data file. We can implement a cache such that xslt files only need to be parsed and interpreted once. The code below solves the issue with Guava’s Cache, which evicts the templates after an hour of no accesses.
We can build our project and replace the exist jar
\timing and 15 values running through the function yielded about 1.25 milliseconds per row, which is on a single core, 2GB RAM, hard drive, and a VM. Obviously your mileage will vary, but I’m expecting this method to scale out beautifully because our
xslt function is logically immutable given the files don’t change. If the contents of the file are subject to change, create the function with
STABLE volatility, else use
For more information, see a great article Introduction to PostgreSQL PL/java, which will go over additional features, such as deployment descriptors (making deployments easier) in more details.