Selecting a Random Object in S3 (and in SQL, and others)
TL;DR
By using UUIDv4 identifiers as names or hash hex digests, there is a nice trick to efficiently select a random item as long as the database support querying with a range condition (S3, SQL, DynamoDB, etc.)
The use case
I have a collection of pictures I took (more than 5000) stored as objects in a S3 bucket, and I want to select one randomly (the "picture of the day").
Two solutions:
- Name each object with a random name within a defined range, such as
[UUIDv4].jpg
- Name each object with its hash digest, such as
[sha256(pic)].jpg
Both work well as long as you can generate a random name following
the same nomenclature, and then query a range with a StartAfter
parameter. The "next" item returned is your randomly chosen object.
This method works because:
- The names are uniformly distributed values over a given range (e.g., the UUIDv4 allowed range).
- The names are lexicographically sortable.
- The number of possible values is so big that we assume we never generate the same name more than once.
This is the real example for S3 with the Python SDK:
from uuid import uuid4
import boto3
s3_client = boto3.client(...)
random_key = str(uuid4())
resp = s3_client.list_objects_v2(
Bucket=MY_BUCKET,
MaxKeys=1,
StartAfter=random_key,
)
key = resp["Contents"][0]["Key"]
# Now get your object
resp = s3_client.get_object(
Key=key,
Bucket=MY_BUCKET,
)
Beware of some edge cases though.
Edge case 1: the lowest object of the bucket is impossible to choose.
Solution: define a blank object named
00000000-0000-0000-0000-000000000000
. The real
lowest object is now in the second position and can be chosen randomly.
Edge case 2: if the random key is higher than the highest object in the bucket, no object can be chosen. Solution: just retry. (Taking the highest object in this case would be biased.)
Those edge cases are pretty rare with a certain volume (with +5000 objects I've never encountred them), but they must be handled gracefully.
Same principle for SQL, DynamoDB, etc.
The same principle works as long as your database or storage engine supports indexing that works with range conditions, e.g., B trees, skip lists. It wont't work with a hash index, for instance, because it only supports the strict equality comparison.
In SQL, selecting a random row is surprisingly a not-so-trivial problem.
A StackOverflow search will give a handful of results, more or less hacky
depending on the RDBMS you use. But the method desribed here will work
without hack or vendor-specific wizardry.
If you already use UUIDv4 identifiers or indexed hash digests, then you
can apply the same solution.