Search, step by step

Let's go through what happens when you search for President Obama in tantivy. We will assume the index has a schema similar to the index in the tutorial of tantivy-cli .

This will help building a common vocabulary.

Our fields will be

  • body
  • title

Query Parser

Let's assume our user queries for (President Obama). The query first goes through the query parser, which will understand the query as the clause (title:president OR title:obama OR body:president OR body:obama).1 Note that the words have been lowercased.

Term

The couple title:barack is called a Term. In Tantivy, Term struct are just a thin wrapper around &[u8] in which

  • The first byte is an id representing the field.2
  • The remaining bytes are representing the value taken by the field.

Query

The result of the query parser is a Query trait object. Query are in charge of defining the set of documents that are matching the query, as well as how these document should be scored.

The query will access the segments' data via a Searcher object that is created every time a query is run.

Searcher

The searcher's role is a wrapper around a consistent set of segment, or to be more accurate, SegmentReaders.

As explained in Increment indexing, segments are being continuously merged. Therefore, every segment eventually gets deleted.

Many request are ran in more than one step. For instance, generating a SERP typically requires to

  • identify the hits to be displayed
  • retrieves the documents'content from the Store and generate snippets.

Using the same Searcher for the two steps ensures that the a consistent set of Segment is used during the two steps.

Term dictionary

The terms are then looked for in the term dictionary. The term dictionary associates any term with a TermInfo object which contains

  • The number of documents containing our term

  • The address of the so-called inverted list associated with the term.

  • The address of the positions associated with the term

The addresses above are simply unsigned integers. They express an offset within respectively the inverted index file (.idx) and the positions file (.pos) file.

The position file will not be useful to compute this query, so we can forget it for a moment.

Inverted lists, or postings

Inverted lists are key to full-text search. The inverted list associated to a Term is the sorted list of the DocId of documents which contain the term. It is also commonly called posting list or postings.

Optionally, tantivy makes it possible to store term frequencies along with the DocId. The term frequency is the number of occurrences of the term within the document.

The term dictionary gave us the offset within the .idx file, at which we should start reading our inverted list.

Details about how this inverted index is laid out on disk can be found in Inverted Index.

DocId

The document id identifies a document within a segment. Upon insertion, all of our documents received a document id. This document id is

  • Local. It only identifies a document within a segment. There can be many documents within an Index bearing the id 3, but only one document in a specific segment.
  • Internal. As explained in Increment indexing, the doc id associated to a document may change. While you may have to manipulate them within the context of a specific Searcher, they should never outside of tantivy.
  • Auto incremented. After creating a new segment, the first document added will bear the document id 0, the next one the document id 1, and so on.

Joining our inverted lists and scoring

Because our inverted list are sorted by doc_ids, they can be merged in linear time. Our query will consider all of the documents that contain at least one of our keywords. Logically, we are computing a OR clause. In term of sets, we are computing the union of our inverted lists.

tantivy simply performs a K-Way merge using a BinaryHeap.

Scoring

As we go through the document, it also updates a TfIdf score accumulator. This accumulator computes a similarity score by keeping track of which terms were in the document, as well as their frequency.

Computing the TfIdf also requires to have access to the so-called fieldnorm, that measure how long a document field was.

Intuitively, TfIdf considers shorter fields as more important. As a result, a document with a term in its title will get a better score than a document with the same term in its body.

Collectors

Collection consists in iterating through the scored documents and pushing each of them to a Collector.

The collector is then in charge of deciding what to do with the information.

Assuming we are trying to display a search engine result page (or SERP), we will want to keep track of the 10 most relevant documents. This is precisely the role of the top-K collector.

If we want to display the total number of hits at the top of the SERP, we will need a CountCollect. It just increments a counter each time a document is collected.

After the query has finished running, the user can harvest the results directly from the collectors.

The top-K collector gives return the 10 best document DocAddress. These DocAddress are really just a (segment_id, doc_id) pair .

Store

The documents' content can optionally be stored in a key value store called simply Store or DocumentStore in Tantivy. The key here, is the DocumentId

Our Searcher object then makes it possible to retrieve our document contents by calling searcher.doc(doc_address).

As the store is local to a segment, the Searcher is in charge of extracting segment_id from the DocAddress and dispatches the get request to the right Segment.

Details about the store are available in here.


1 As a result, the number of fields is limited to 256. Arguably, this is probably not a sensible choice, and it may change in the future. Lucene uses full field names and a separator for this, as the fst compresses them anyway.

results matching ""

    No results matching ""