Tf-Idf

The current default pertinence score in tantivy is called called tf-idf.

Tf-idf is coming from the information retrieval theory. For more detail about its justification, you may want to read its dedicated wikipedia page.

The overall tfidf formula used in tantivy is as follows :

$coord(doc, query) \sum_{term~in~query} \sqrt{tf(doc, term) \over {fieldnorm(doc, field)}} idf(term)$

In which,

$coord(...)$ is just boosting document which have a lot of terms in the query. Its value is simply ${num~terms~in~query~and~document} \over {total~num~terms~in~query}$ . For instance, if the query has two terms body:a and body:b and our document only contains body:a, the coord factor is equal to $1 / 2$ .
$tf(doc, term)$ is the number of occurences of the term in the document being scored.
$fieldnorm(doc, field)$ is the length of considered term's field for the document. If our query term is body:a, then the associated field is body, and the fieldnorm is the number of tokens in the body field of our document.
$idf(term)$ is defined as $1 + ln \left(total~num~docs \over num~docs~containing~the~term \right)$ . If no document contain the term, its value is artificially set to 1, but this does not really matter.

Note that tf-idf has been designed with large documents in minds and suffers from many downsides in practise. Most notably, it derives from a bag-of-word model and does not take in account term proximity.

TfIdf

Tf-Idf

results matching ""

No results matching ""