Maybe We Don’t Need UUIDv7 After All
In the last weeks, the topic of interest seemed to be UUIDv7 identifiers and how they solve DB index problems caused by random identifiers such as UUIDv4 IDs.
UUIDv4 for Public IDs
UUIDv4 IDs have properties that make them suitable as public identifiers, unlike traditional auto-incremented integers.
- They don't disclose business information, e.g., number of orders so far.
- They are not guessable: one cannot enumerate all identifiers to discover resources.
- They can be used as the index/key in databases. (with native
UUID
type or bytes.)
What's wrong with UUIDv4?
UUIDv4 identifiers are random, so there is no increasing order after each generation. But, here is some bad news: inserting non increasing keys hurts performance, because of the way DBs work with B+ trees, the structure representing the index of the data.
In a B+ tree, the leaf level (the last level opposed to the root) must be kept ordered and leaves linked together.
When a new key is inserted, it's appended at the end of the leaf level if it's greater than the
previously greatest leaf.
If not, we have to make space "in the middle" to insert the
new key into the adequate location: that's a more expensive operation. This is the worst case scenario.
It may be okay from a computational perspective (it's still a O(log(n))
operation), but
don't forget the disk perspective: moving more data is more I/O work and more latency.
Using SSD may alleviate this problem, but I/O operations are never cheap anyway.
Using a clustered index can make things worse. Data is organized according to the leaves of the B+ tree index, consequence: moving leaves means moving entire pages of data as well!
For INSERT
-heavy workloads, UUIDv4 is a bad candidate: due to its randomness,
inserting "in the middle" will happen a lot. UUIDv7 avoids this problem: all new keys
will be appended at the end of the leaf level, which the best case scenario.
Enter UUIDv7
Like UUIDv4 identifiers, UUIDv7 identifiers are 16-bytes (128-bit) values generated by an algorithm in such manner as to provide fair assurance that each of them is (and will stay) globally unique. But instead of purely random data, UUIDv7 IDs embed a timestamp in the first bits, thus keeping them always ordered and globally unique (with a fair chance).
Here is the structure: First 48 bits are the timestamp, followed by purely random bits (except for version and variant bits).
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unix_ts_ms |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unix_ts_ms | ver | rand_a |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var| rand_b |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand_b |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
So, the use of UUIDv7 identifiers should be B tree-friendly without compromising nice properties of UUIDv4.
So, let's ditch UUIDv4 and use UUIDv7?
Leaking information
UUIDv4 does not leak information assuming a proper implementation. But, UUIDv7 in fact does: the timestamp of the server is embeded into the ID. From a business point of view it discloses information about resource creation time. It may not be a problem depending on the context. Current RFC draft allows implementation to tweak timestamps a little to enforce a strict increasing order between two generations and to alleviate some security concerns.
Risk of collisions
In the context of distributed system, it's important to be sure different machines generates IDs without risk of collision. It's not clear if UUIDv7 is more risky than UUIDv4 in that regard.
Hash Index does not care
All UUIDv4 problems suddenly become null with a Hash Index instead of a B tree-based structure.
B+ trees (and B trees in general) allows efficient queries on range (expressions that use the
=
,
>
,
>=
,
<
,
<=
,
BETWEEN
operators.)
It appears that resources with a public exposed ID rarely need to be queried in range
according to their primary key.
A strict equality comparison is enough: why would we want to have
products with IDs between a712ba94-6e52-4e14-acfe-2995e50393d5
and
bf5e3f49-7a09-46e9-985c-6ecd7abe9e2f
, right? Another type of index
supported at least by PostgreSQL is the Hash Index that will do the job just fine.
Closing remarks
- UUIDv7 has evolved and it's now pretty stable, but it's still subject to changes. This note talks about the last draft at this date: Draft version 11 for RFC4122 revision, published on 5 Sept. 2023 . Bear that in mind when you search for implementations. Be sure the library you found implement the last draft version.
- PostgreSQL has a native
UUID
type, efficiently represented as 128-bit integers. It works for any UUID version, and you can use that as a primary key. -
MySQL does not provide such type, but using
BINARY(16)
type is efficient. Using built-in functionsUUID_TO_BIN
andBIN_TO_UUID
make UUID manipulation more ergonomic. - Do not rely on UUIDv7 identifiers to infer a precise datetime attached to the resource. UUIDv7 uses timestamps to provide the globally increasing order, not to give time per se. Prefer a dedicated column and its secondary index instead.
- Using Hash Index in PostgreSQL is perilous before v10, beacuse they are not WAL-logged and not replicated. From version 10 and after, it's perfectly fine.
- Current versions of InnoDB (MySQL prefered engine) don't support Hash Index. Since PKs are clustered, use a secondary index if you need UUIDv4 identifiers and let PKs auto-incremented.
- This is close to Twitter's Snowflake IDs and ULIDs. That's no accident, in fact, UUIDv7 has been designed after having observed other solutions to this problem (including others).