Custom UUID as primary key

The main argument against using UUIDs as a primary key in a relational database seems to be that if the UUIDs aren't serialized sequentially tables can become fragmented. At first blush, version 1 UUIDs seem like they would work (they are sequential after all) but because their timestamps are stored in reverse order (29, 30, 31 becomes 92, 03, 13) when the UUIDs are converted into binary values they aren't consecutive.

The Evil Empire

Recognizing this problem, Microsoft introduced a built in function which allows developers to generate database friendly UUIDs; unfortunately, the function is only available to SQL Server users. Ideally we would like to abstract Microsoft's logic out of the database and move it into our application where we would be free to leverage it anywhere; to this end, I decided to analyze SQL Server's output by creating a loop of 1000 inserts and carefully inspecting the results.

CREATE TABLE UUID_TEST(UUID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() UNIQUE NOT NULL, CREATE_ORDER INTEGER);
INSERT INTO UUID_TEST(SEQUENCE) VALUES(?); --where SEQUENCE is set to an incremented value in the for loop
SELECT * FROM UUID_TEST ORDER BY UUID;
DROP TABLE UUID_TEST;

  UUID SEQUENCE
 1  B8586BD7-2062-E111-B36C-CC52AFC9F2ED  0
 2  B9586BD7-2062-E111-B36C-CC52AFC9F2ED  1
 3  BA586BD7-2062-E111-B36C-CC52AFC9F2ED  2
 4  BB586BD7-2062-E111-B36C-CC52AFC9F2ED  3


Upon looking at the sample set of data I noticed a few very obvious things. The first was that the format of the generated ids was consistent with that of a traditional UUID: 16 bytes divided into 5 segments of 4, 2, 2, 2 and 6 bytes. If parsed like a traditional UUID the variant field was well formed (it resolved to Leach-Salz) but the version and timestamp were not.  I also noticed that the first byte where time_high was normally stored changed with greater frequency than the last byte (where a traditional version 1 UUID most often varied). It looked as if a version 1 UUID had been generated in the database and its byte order had been reversed. Indeed, flipping the byte order of the time_high, time_mid, and time_low fields of any SQL Server generated id allowed me to convert it into a valid version 1 UUID; I could verify this by looking at the timestamp:

UUID uuid = UUID.fromString("d76b58b9-6220-11e1-b36c-cc52afc9f2ed"); // sql server generated uuid: B9586BD7-2062-E111-B36C-CC52AFC9F2ED

System.out.println(uuid.version); // outputs "1"
System.out.println(uuid.timestamp()); // outputs "135497356990437561" which equates to the correct date of Tuesday, February 28, 2012 3:28:19 PM GMT

Manually inserting a version 1 UUID into the database produced similar results. The byte order of the time_high, time_mid, and time_low fields of the UUID were stored in reverse. This was the algorithm!

CREATE TABLE UUID_TEST(UUID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() UNIQUE NOT NULL, SEQUENCE INTEGER)
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x626215fe622611e1837f1b41e62f9422, 0); -- insert 626215fe-6226-11e1-837f-1b41e62f9422
SELECT * FROM UUID_TEST;
DROP TABLE UUID_TEST;

  UUID SEQUENCE
 1 FE156262-2662-E111-837F-1B41E62F9422  0


Abstracting out the logic was simply a matter of identifying which segments of the UUID needed to have their byte order changed and encapsulating the necessary transformations in a builder class. To test the class, I used it to generate one million globally unique identifiers and checked to see if SQL Server stored them sequentially. It did! My code worked and seemed portable. I was generating database friendly sequential ids outside of SQL Server!

SQL Server Test

CREATE TABLE UUID_TEST(UUID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() UNIQUE NOT NULL, CREATE_ORDER INTEGER);

INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES('0001abd9-3f62-e111-b305-0fce8c655371', 0); --inserting uuid as a string forces ms to respect the uuid input without mangling it
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES('0001a3db-3f62-e111-b305-0fce8c655371', 1);

SELECT * FROM UUID_TEST ORDER BY UUID;
DROP TABLE UUID_TEST;

SQL Server Test

  UUID SEQUENCE
 1 0001ABD9-3F62-E111-B305-0FCE8C655371 0
 2 0001A3DB-3F62-E111-B305-0FCE8C655371 1
CORRECTLY ORDERED!

Almost. The algorithm worked perfectly under MS SQL Server but operated inconsistently when used with any other database. Note that the two UUIDs above represent the respective timestamps of "Tuesday, February 28, 2012 7:10:17 PM GMT" and "Tuesday, February 28, 2012 7:10:20 PM GMT".  SQL Server inserted these ids in the right order but if you look at their actual numeric values they are NOT sequential. The hex value 0x0001a3db (107483) should PRECEDE 0x0001abd9 (12896).  SQL Server was obviously sorting its ids using something other than their binary value. As I suspected it may be using hints supplied by the proprietary UNIQEIDENTIFIER type, I stopped investigating. This wasn't going to be a portable solution.

MySQL Test

CREATE TABLE UUID_TEST(UUID BINARY(16) UNIQUE NOT NULL, SEQUENCE INTEGER); --use ansii standard types

INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x0001abd93f62e111b3050fce8c655371, 0);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x0001a3db3f62e111b3050fce8c655371, 1);

SELECT * FROM UUID_TEST ORDER BY UUID;
DROP TABLE UUID_TEST;

MySQL Test

   UUID  SEQUENCE
 1  0x0001A3DB3F62E111B3050FCE8C655371  1
 2  0x0001ABD93F62E111B3050FCE8C655371  0
INCORRECTLY ORDERED!!

A New Hope

Despite its failings, my investigation helped to validate some prior inclinations. I had originally intended to pervert the format of a version 1 UUID in order to get it to persist sequentially. I now realized it wasn't a half crocked idea. Those familiar with the version 1 format will know that the first 64 bits are dedicated to identifying the version of the UUID and its timestamp. The timestamp however is subdivided into three parts and stored exactly backward (as briefly touched upon in a proceeding paragraph). In order to coerce the UUID into something sequential, all that I really needed to do was rearrange the timestamp parts so they were stored in the proper order (so that most significant parts were stored before the least significant parts if you care about that sort of thing).


public class CombsBuilder {
    ...
    private long getVersion13MostSignificantBits() {
        //
        // 0xAAAAA00000000000L
        // 0x0000000AAAAA0000L
        //
        long timeLowPartA = (uuid.getMostSignificantBits() & 0xFFFFF00000000000L) >>> 28;
        //
        // 0x00000BBB00000000L
        // 0x0000000000000BBBL
        //
        long timeLowPartB = (uuid.getMostSignificantBits() & 0x00000FFF00000000L) >>> 32;
        //
        // 0x00000000MMMM0000L
        // 0x000MMMM000000000L
        //
        long timeMid = (uuid.getMostSignificantBits() &  0x00000000FFFF0000L) << 20;
        //
        // 0x0000000000000HHHL
        // 0xHHH0000000000000L
        //
        long timeHigh = (uuid.getMostSignificantBits() & 0x0000000000000FFFL) << 52;
        //
        // 0x0000000AAAAA0000L
        // 0x0000000000000BBBL
        // 0x000MMMM000000000L
        // 0xHHH0000000000000L
        // 0x000000000000V000L <-- we don't change where the version is stored because we want to respect that part of the spec
        // ____________________
        // 0xHHHMMMMAAAAAVBBBL
        //
        return timeLowPartA | timeLowPartB | timeMid | timeHigh | 0x000000000000D000L; // insert custom version 13 so we can identify later
    }
    ...
}

Persisting UUIDs generated using this algorithm resulted in sequential writes to SQL Server, MySQL and HqSQL databases. The calls were ANSI SQL compliant as I was now storing the UUID as a BINARY(16). In addition, because I left the node bits in the least significant spot, UUIDs could be generated in parallel on an arbitrary number of hosts and still be written in the order they were created (the format of the UUID was roughly a concatenation of 'create date' and 'host id').

CREATE TABLE UUID_TEST(UUID BINARY(16) UNIQUE NOT NULL, SEQUENCE INTEGER);

INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd27fa11ded65de8970f6, 0);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd280a11ded65de8970f6, 1);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd281a11ded65de8970f6, 2);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd282a11ded65de8970f6, 3);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd283a11ded65de8970f6, 4);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd284a11ded65de8970f6, 5);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd285a11ded65de8970f6, 6);
INSERT INTO UUID_TEST(UUID, SEQUENCE) VALUES(0x1e1624e75e1fd286a11ded65de8970f6, 7);

SELECT * FROM UUID_TEST ORDER BY UUID;
DROP TABLE UUID_TEST;


   UUID SEQUENCE
 1  0x1E1624E75E1FD27FA11DED65DE8970F6  0
 2  0x1E1624E75E1FD280A11DED65DE8970F6  1
 3  0x1E1624E75E1FD281A11DED65DE8970F6  2
 4  0x1E1624E75E1FD282A11DED65DE8970F6  3
 5  0x1E1624E75E1FD283A11DED65DE8970F6  4
 6  0x1E1624E75E1FD284A11DED65DE8970F6  5
 7  0x1E1624E75E1FD285A11DED65DE8970F6  6
 8  0x1E1624E75E1FD286A11DED65DE8970F6  7


I also decided to change the version number of the UUID from 1 to 13 (D). This would prevent accidental collisions with version 1 UUIDs. I am presently implementing this strategy in our back end and welcome any feedback.


ċ
CombsBuilder.java
(4k)
Michael Lambert,
Feb 29, 2012, 6:25 PM
ċ
MicrosoftCombsBuilder.java
(3k)
Michael Lambert,
Feb 29, 2012, 6:25 PM
Comments