Maybe We Don’t Need UUIDv7 After All

In the last weeks, the topic of interest seemed to be UUIDv7 identifiers and how they solve DB index problems caused by random identifiers such as UUIDv4 IDs.

UUIDv4 for Public IDs

UUIDv4 IDs have properties that make them suitable as public identifiers, unlike traditional auto-incremented integers.

What's wrong with UUIDv4?

UUIDv4 identifiers are random, so there is no increasing order after each generation. But, here is some bad news: inserting non increasing keys hurts performance, because of the way DBs work with B+ trees, the structure representing the index of the data.

In a B+ tree, the leaf level (the last level opposed to the root) must be kept ordered and leaves linked together.

When a new key is inserted, it's appended at the end of the leaf level if it's greater than the previously greatest leaf.
If not, we have to make space "in the middle" to insert the new key into the adequate location: that's a more expensive operation. This is the worst case scenario. It may be okay from a computational perspective (it's still a O(log(n)) operation), but don't forget the disk perspective: moving more data is more I/O work and more latency. Using SSD may alleviate this problem, but I/O operations are never cheap anyway.

Using a clustered index can make things worse. Data is organized according to the leaves of the B+ tree index, consequence: moving leaves means moving entire pages of data as well!

For INSERT-heavy workloads, UUIDv4 is a bad candidate: due to its randomness, inserting "in the middle" will happen a lot. UUIDv7 avoids this problem: all new keys will be appended at the end of the leaf level, which the best case scenario.

Enter UUIDv7

Like UUIDv4 identifiers, UUIDv7 identifiers are 16-bytes (128-bit) values generated by an algorithm in such manner as to provide fair assurance that each of them is (and will stay) globally unique. But instead of purely random data, UUIDv7 IDs embed a timestamp in the first bits, thus keeping them always ordered and globally unique (with a fair chance).

Here is the structure: First 48 bits are the timestamp, followed by purely random bits (except for version and variant bits).

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                           unix_ts_ms                          |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |          unix_ts_ms           |  ver  |       rand_a          |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |var|                        rand_b                             |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                            rand_b                             |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

So, the use of UUIDv7 identifiers should be B tree-friendly without compromising nice properties of UUIDv4.

So, let's ditch UUIDv4 and use UUIDv7?

Leaking information

UUIDv4 does not leak information assuming a proper implementation. But, UUIDv7 in fact does: the timestamp of the server is embeded into the ID. From a business point of view it discloses information about resource creation time. It may not be a problem depending on the context. Current RFC draft allows implementation to tweak timestamps a little to enforce a strict increasing order between two generations and to alleviate some security concerns.

Risk of collisions

In the context of distributed system, it's important to be sure different machines generates IDs without risk of collision. It's not clear if UUIDv7 is more risky than UUIDv4 in that regard.

Hash Index does not care

All UUIDv4 problems suddenly become null with a Hash Index instead of a B tree-based structure.

B+ trees (and B trees in general) allows efficient queries on range (expressions that use the =, >, >=, <, <=, BETWEEN operators.) It appears that resources with a public exposed ID rarely need to be queried in range according to their primary key. A strict equality comparison is enough: why would we want to have products with IDs between a712ba94-6e52-4e14-acfe-2995e50393d5 and bf5e3f49-7a09-46e9-985c-6ecd7abe9e2f, right? Another type of index supported at least by PostgreSQL is the Hash Index that will do the job just fine.

Closing remarks