TF – IDF: The Definitive Guide


TF-IDF stands for Term Frequency-Inverse Document frequency. TF-IDF weighting scheme is often used by search engines in scoring and ranking a document’s relevance given a query. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

The statistic tf-idf is intended to measure how important a word is to a document or a corpus.

1.Term frequency:

The number of times a term occurs in a document is called term frequency

3.Inverse document frequency:

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Test documents:

Listing documents:

Test Query:

Combining Doc and Query:

Creating Corpus and clean it:

Creating a term-document matrix:

Creating a function for calculating tf-idf:

Calling the tf-idf function:

Getting the tf-idf value of query and docs separately:

Getting each and every Doc score:

Converting the results into a dataFrame:

Printing the result according to the score in decreasing order:


Printing only the 1st and 2nd row:


Results are indicating relevance feedback for each and every particular doc.

TF-IDF is a very useful method to find a keyword and check how much relevance it is to a particular document. By implementing keyword that is more relevant to the actual document can increase visibility in SERP and it can give a good ranking opportunity for any.


Leave a Reply

Your email address will not be published. Required fields are marked *