Most search applications using Apache Lucene
assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.
Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?
One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
BlockTree terms dictionary
The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (de>docFreqde>, de>totalTermFreqde>, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.
The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.
In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!
Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.
I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.
Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.
I tested the following identifier sources:
For the UUIDs and Flake IDs I also tested binary encoding in addition to their standard (base 16 or 36) encoding. Note that I only tested lookup speed using one thread, but the results should scale linearly (on sufficiently concurrent hardware) as you add threads.
- Sequential IDs (0, 1, 2, ...), binary encoded.
- Zero-padded sequential IDs (00000000, 00000001, ...), binary encoded.
- Nanotime, binary encoded. But remember that nanotime is tricky.
- UUID V1, derived from a timestamp, nodeID and sequence counter, using this implementation.
- UUID V4, randomly generated using Java's de>UUID.randomUUID()de>.
- Flake IDs, using this implementation.