Inverted Index
In "Search, step-by-step" we explained what inverted lists were and what they were useful for.
An inverted list encodes a sorted list of document ids. Also -if enabled in the schema- it stored the term frequency for each document.
In order to save RAM, it is crucial to compress these long lists of integers.
Integer compression
For the sake of good locality, doc ids and term frequencies are encoded together. Because they are compressed using different schemes, they are not interlaced but rather encoded in blocks of 128 documents :
A block of 128 doc ids is followed by a block of 128 term frequencies.
Term frequency block
The block of 128 term frequencies, are simply encoded using bit packing. Assuming the largest value in the block is 10, 4 bits are sufficient to encode each of the term frequency, as 2^4 - 1 = 16 >= 10
. The block therefore starts by the byte 4
, followed by the 128 * 4 bits = 512 bits
required to encode the 128
term frequencies.
DocId
s can take great values but they are sorted. We therefore start by delta-encoding them. We replace the list of doc ids, by the consecutive intervals between them.
For instance, assuming the document id list goes
7, 12, 15, 17, 25
The delta-encoded document id list is
7, 5, 3, 2, 8
The resulting deltas can then be bitpacked as we did for term frequencies.
Tantivy actually delegates integer compression to a state-of-the-art C++ library called simdcomp . This library takes advantage of SIMD instructions.
This block encoding scheme can only encode a multiple of 128 documents. The remaining documents' DocId
and term frequencies compressed using variable byte encoding.
In variable byte encoding, integers are encoded over a variable number of bytes.