Search, step by step
Let's go through what happens when you search for President Obama
in tantivy.
We will assume the index has a schema similar to the index in the tutorial of tantivy-cli .
This will help building a common vocabulary.
Our fields will be
- body
- title
Query Parser
Let's assume our user queries for (President Obama
). The query first goes through the query parser, which will understand the query as the clause (title:president OR
title:obama OR body:president OR body:obama)
.1 Note that the words have been lowercased.
Term
The couple title:barack
is called a Term
. In Tantivy,
Term
struct are just a thin wrapper around &[u8]
in which
- The first byte is an id representing the field.2
- The remaining bytes are representing the value taken by the field.
Query
The result of the query parser is a Query
trait object. Query
are in charge of defining the set of documents that are matching the query, as well as how these document should be scored.
The query will access the segments' data via a Searcher
object that is created every time a query is run.
Searcher
The searcher's role is a wrapper around a consistent set of segment, or to be more accurate, SegmentReader
s.
As explained in Increment indexing, segments are being continuously merged. Therefore, every segment eventually gets deleted.
Many request are ran in more than one step. For instance, generating a SERP typically requires to
- identify the hits to be displayed
- retrieves the documents'content from the
Store
and generate snippets.
Using the same Searcher
for the two steps ensures that the a consistent set of Segment
is used during the two steps.
Term dictionary
The terms are then looked for in the term dictionary. The term dictionary associates any term with a TermInfo
object which contains
The number of documents containing our term
The address of the so-called inverted list associated with the term.
The address of the positions associated with the term
The addresses above are simply unsigned integers. They express an offset within respectively the inverted index file (.idx
) and the positions file (.pos
) file.
The position file will not be useful to compute this query, so we can forget it for a moment.
Inverted lists, or postings
Inverted lists are key to full-text search. The inverted list associated to a Term
is the sorted list of the DocId
of documents which contain the term. It is also commonly called posting list or postings.
Optionally, tantivy
makes it possible to store term frequencies along with the DocId
. The term frequency is the number of occurrences of the term within the document.
The term dictionary gave us the offset within the .idx
file, at which we should start reading our inverted list.
Details about how this inverted index is laid out on disk can be found in Inverted Index.
DocId
The document id identifies a document within a segment. Upon insertion, all of our documents received a document id. This document id is
- Local. It only identifies a document within a segment. There can be many documents within an
Index
bearing the id3
, but only one document in a specific segment. - Internal. As explained in Increment indexing, the doc id associated to a document may change. While you may have to manipulate them within the context of a specific
Searcher
, they should never outside oftantivy
. - Auto incremented. After creating a new segment, the first document added will bear the document id
0
, the next one the document id1
, and so on.
Joining our inverted lists and scoring
Because our inverted list are sorted by doc_ids
, they can be merged in linear time. Our query will consider all of the documents that contain at least one of our keywords. Logically, we are computing a OR
clause.
In term of sets, we are computing the union of our
inverted lists.
tantivy simply performs a K-Way merge using a BinaryHeap
.
Scoring
As we go through the document, it also updates a TfIdf
score accumulator. This accumulator computes a similarity score by keeping track of which terms were in the document, as well as their frequency.
Computing the TfIdf also requires to have access to the so-called fieldnorm, that measure how long a document field was.
Intuitively, TfIdf
considers shorter fields as more important. As a result, a document with a term in its title will get a better score than a document with the same term in its body.
Collectors
Collection consists in iterating through the scored documents and pushing each of them to a Collector
.
The collector is then in charge of deciding what to do with the information.
Assuming we are trying to display a search engine result page (or SERP), we will want to keep track of the 10 most relevant documents. This is precisely the role of the top-K collector.
If we want to display the total number of hits at the top of the SERP, we will need a CountCollect
. It just increments a counter each time a document is collected.
After the query has finished running, the user can harvest the results directly from the collectors.
The top-K collector gives return the 10 best document DocAddress
. These DocAddress
are really just a (segment_id, doc_id)
pair .
Store
The documents' content can optionally be stored in a key value store called simply Store
or DocumentStore
in Tantivy. The key here, is the DocumentId
Our Searcher
object then makes it possible to retrieve our document contents by calling searcher.doc(doc_address)
.
As the store is local to a segment, the Searcher
is in charge of extracting segment_id
from the DocAddress
and dispatches the get
request to the right Segment
.
Details about the store are available in here.
1 As a result, the number of fields is limited to 256
.
Arguably, this is probably not a sensible choice, and it may change in the future. Lucene uses full field names and a separator for this, as the fst compresses them anyway.