Term dictionary

As discussed in Search, step-by-step, our term dictionary is stored in a .fst file. Its role is to associates terms (&[u8]) to some small TermInfo struct.

There are different ways to store a dictionary of term on disc. Trie and Finite state transducer (or Fst) are rather popular solutions in search engines, for various reasons. They make it possible to enumerate a range of terms, and they can be efficiently intersected with an automaton.

FST tend to be more compact, and therefore use less RAM and have better locality, while Tries are easier to implement, and may use a little less CPU.

Tantivy's term dictionary is using the excellent BurntSushi's fst crate.

Because the fst crate only associate &[u8] to longs, our .fst files actually split into two parts.

The first part is our finite state transducer. Its values are u64 that serve as an address in the second part of the file, at which the TermInfo is encoded.

Finally, the file end by a u32 encoding the size in bytes of the second part of our file.

This structure is implemented as a simple FstMap<V> that is a simple wrapper over the actual Fst.

results matching ""

    No results matching ""