Term dictionary
As discussed in Search, step-by-step, our term dictionary is stored in a .fst
file. Its role is to associates
terms (&[u8]
) to some small TermInfo
struct.
There are different ways to store a dictionary of term on disc. Trie
and Finite state transducer
(or Fst
) are rather popular solutions in search engines, for various reasons. They make it possible to enumerate a range of terms, and they can be efficiently intersected with an automaton.
FST tend to be more compact, and therefore use less RAM
and have better locality, while Tries are easier to implement, and may use a little less CPU.
Tantivy's term dictionary is using the excellent BurntSushi's fst crate.
Because the fst crate only associate &[u8]
to longs, our .fst
files actually split into two parts.
The first part is our finite state transducer. Its values are u64
that serve as an address in the second part of the file, at which the TermInfo
is encoded.
Finally, the file end by a u32 encoding the size in bytes of the second part of our file.
This structure is implemented as a simple FstMap<V>
that is a simple wrapper over the actual Fst
.