MySQL text ID collation: tread carefully

Published on: August 18, 2023

I was reading an article by PlanetScale, a hosted MySQL platform, called Why we chose NanoIDs for PlanetScale’s API. A decent enough article about Nano IDs and storing them in databases.

While there are no mistakes in the article, I think there is enough nuance omitted to warrant an addendum, otherwise others run the risk of misapplying the solution.

Here is the schema they used in the post (extraneous columns removed for brevity):

CREATE TABLE `user` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `public_id` varchar(12) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_public_id` (`public_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

To demonstrate a potential blunder, let’s imagine our application generates the Nano ID and inserts one row:

INSERT INTO `user`(`public_id`) VALUES('aGQIKrNKWC6s');

At a later point in time, our application generates a new Nano ID, but when it’s attempted to be inserted MySQL will error:

INSERT INTO `user`(`public_id`) VALUES('AgqikRNKWC6s');
--- ERROR 1062 (23000): Duplicate entry 'AgqikRNKWC6s' for key 'user.idx_public_id'

Our second insert fails as the two values are equal when disregarding letter casing. Hopefully the app has a builtin retry that will regenerate the nanoid if the unique constraint fails.

There’s nothing special about the table schema, the default for MySQL text columns is case insensitive collation.

mysql> SELECT COLLATION(`public_id`) FROM user;
+------------------------+
| COLLATION(`public_id`) |
+------------------------+
| utf8mb4_0900_ai_ci     |
+------------------------+

The _ci part of the collation id stands for “case insensitive”.

The PlanetScale article side steps this problem by using a custom Nano ID alphabet that excludes upper case letters and a couple symbols:

0123456789abcdefghijklmnopqrstuvwxyz

I think the article could have been improved if this was emphasized and justified. Shrinking the alphabet from 64 to 36 characters could change an application to need longer Nano IDs to better avoid collisions.

I wonder what the business use case was for only a lowercase alphabet. Are these IDs read aloud over the phone? Or do they point to a potentially case insensitive file server somewhere?

Moreover, the Nano IDs generated are part of a URL, and URLs are case sensitive. So now /id/aa returns the same page as /id/AA, and one must decide if AA is to be redirected or served with a canonical link pointing to the lowercase variant. The exceptionally risky may trust Google to deduplicate pages that differ only in casing of their URLs.

If there was a business reason for preferring case insensitive URLs derived from randomly generated text, it seems prudent to be made explicit in the column schema.

CREATE TABLE `user` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `public_id` varchar(12)
      DEFAULT NULL COLLATE utf8mb4_0900_ai_ci,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_public_id` (`public_id`)
);

Or maybe that is just me coming from a Postgres background where text is not case sensitive by default. I get a bit uneasy when PlanetScale mentions what they’ll do if they need to adjust their Nano ID:

If we ever need to increase this, the change would be as simple as increasing the length in our ID generator and updating our database schema to accept the new size.

But I can see an overly clever, if somewhat lazy, developer that doesn’t want to deploy a schema change, and instead add back capital letters in an attempt to increase the alphabet size. And in the moment of frustration after their attempt fails, switch to a case sensitive collation. Now, the behavior of /id/AA has changed!

This is why it’s so important to tread carefully. It’s too easy to use the default Nano ID alphabet and stick the result in a MySQL text column (as the ID is text after all) with the default case insensitive collation. And in return receive an unintentional increase in collision probability and case insensitive ID queries.

The thought of case insensitive ID queries make me queasy, but if our application alphabet excludes uppercase, it seems like it would be a good idea if our database enforced this, so that we don’t ever accidentally have instances where the canonical representation of an ID is mixed case. Let’s add a CHECK constraint to ensure our DB and application are in sync in terms of alphabet:

CREATE TABLE `user` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `public_id` varchar(12)
      DEFAULT NULL COLLATE utf8mb4_0900_ai_ci
      CHECK (REGEXP_LIKE(`public_id`, '[0-9a-z]{12}', 'c')),
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_public_id` (`public_id`)
);

The 'c' flag is important, as it denotes a case sensitive regex. Without it, mixed case IDs would be permitted. I’ve always operated in an environment where one specifies a flag for case insensitive matching, not the other way around. This might take some getting used to.

I’m an exact match kinda person when it comes to IDs, so I’d rather annotate as such on our text ID columns and swap to a case sensitive collation.

CREATE TABLE `user` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `public_id` varchar(12)
      DEFAULT NULL COLLATE utf8mb4_0900_as_cs,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_public_id` (`public_id`)
);

Some of the other binary collations would have been fine too.

Note that if you are storing a hash or a checksum; something like the output from SHA-256, you’d want to use a BINARY to store the raw data.

Comments

If you'd like to leave a comment, please email [email protected]