The store
We saw in Search, Step by step that the store is a segment-local key/value store in which keys are doc id and value are full documents.
Note that the documents returned by the index are not necessarily complete. Only the fields marked as STORED
in the schema will be retrievable during search.
The document store is excessively simple and consists in two parts :
- a sequence of compressed blocks containing one or more documents
- an index that helps identifying which block contains the document.
A wide spectrum of general-purpose compression algorithms are available off-the-shelf. Picking one is a matter of making a trade-off between compression ratio and compression/decompression speed.
For tantivy uses LZ4. A compression algorithm that decompresses extremely fast BUT has weak compression rate. This is the relevant choice under the assumption that most of the index fits in RAM.
Blocks
Blocks are built as follows. Documents are serialized and appended to the current block being written. Once the size of the block exceeds a block size, it is closed, compressed, and serialized on disk. The address (as on offset in the file) at which the blocks was serialized, as well as the document id are also serialized.
The choice of the block size, too, is a trade-off between speed and compression ratio. When fetching a document, the block containing the document is entirely decompressed. On the other hand, blocks are compressed individually and smaller blocks would imply a weaker compression ratio.
Our block size is 16 KB
.
Index
The index is simply a serialized list of (doc_id, offset). It is one of the rare datastructure that is actually loaded in anonymous memory when a segment is opened.
Because the doc ids are incremental, tantivy just binary searches through the index to find the index containing the document requested.