In Part I of the series, I explained how Boolean Model and Term Frequency techniques work. In this blog, we will look at TF-IDF approach that is adopted by more than 80% of modern search engines.
Let us have the same corpus and search query that we used in the previous post.
Doc 1: Software Engineer
Doc 2: Systems Engineer
Doc3: Java Developer
A query is fired as below
Query: Java Software Engineer
TF-IDF stands for Term Frequency-Inverse Document Frequency. We were introduced to TF (Term Frequency) in the previous post. Let us start with understanding IDF (Inverse Document Frequency).
IDF theory states that if a term appears in fewer number of documents, it is an important term. Consequently, the documents in which these terms appear will be more relevant to the user.
This theory is based on the observation that terms like “the”, “is”, “for” etc are common and hence would recur many times in the documents. These terms, when included in the search query, could skew the search results by surfacing irrelevant documents. Actual technical or domain related terms would be less common across documents and hence they are more important terms as compared to “the”, “is” etc and should be given more weightage. So, the documents that contain these “rare” terms would be more relevant.
Inverse Document Frequency is calculated for each term.
The formula for IDF of a term is: log2(Total number of documents /# of documents that contain the term)
Using the above formula, let us calculate the IDF for all the 5 terms in our example.
The total number of documents in our corpus is 3.
1. IDF of Software = log2(3/1) = 1.584
2. IDF of Engineer = log2(3/2) = 0.584
3. IDF of Systems = log2(3/1) = 1.584
4. IDF of Java = log2(3/1) = 1.584
5. IDF of Developer = log2(3/1) = 1.584
Multiplying TF (Term Frequency) with IDF will give you the TF-IDF value for each document.
Below is the Term Frequency table from our earlier post.
The table below shows the TF-IDF values.
|Doc 1||(1*1.584) = 1.584||(1*0.584) = 0.584||(0*1.584) = 0||(0*1.584) = 0||(0*1.584) = 0|
|Doc 2||(0*1.584) = 0||(1*0.584) = 0.584||(1*1.584) = 1.584||(0*1.584) = 0||(0*1.584) = 0|
|Doc 3||(0*1.584) = 0||(0*0.584) = 0||(0*1.584) = 0||(1*1.584) = 1.584||(1*1.584) = 1.584|
|Query||(1*1.584) = 1.584||(1*0.584) = 0.584||(0*1.584) = 0||(1*1.584) = 1.584||(0*1.584) = 0|
Refer back to the vector diagrams in the previous post. The documents and terms are again plotted on the multi-dimensional vector by the search engines. For example, the coordinates of the term Software would be (1.584,0,0), the coordinates of Engineer would be (0.584,0.584,0) and so on.
If you notice, the coordinates of the term Software has changed from (1,0,0) to (1.584,0,0) when we moved from just Term Frequency to TF-IDF. It is more fine-tuned. TF-IDF values remove some of the wrinkles of Term Frequency approach and hence is a better representation of the weightage of the terms within the documents in the corpus.
Vector Products again come to our help in determining how “similar” are the documents to the search query.
The table below shows the Vector Products of the documents and the search query, based on the TF-IDF values above.
|Doc 1 & Query||(1.584*1.584) = 2.509||(0.584*0.584) = 0.341||(0*1.584) = 0||(0*1.584) = 0||(0*0) = 0||2.850|
|Doc 2 & Query||(0*1.584) = 0||(0.584*0.584) = 0.341||(1.584*0) = 0||(0*1.584) = 0||(0*0) = 0||0.341|
|Doc 3 & Query||(0*1.584) = 0||(0*0.584) = 0||(0*0) = 0||(1.584*1.584) = 2.509||(1.584*0) = 0||2.509|
If you notice the above table, Doc1 with a Total of 2.850 is the most “similar” document in the corpus to the search query. It is followed by Doc3 which has a score of 2.509. So, when the search results page shows up, Doc1 would be displayed right at the top followed by Doc3 and Doc2.
If you refer back to the previous post, with only Term Frequency, Doc1 was the most relevant document but Doc2 and Doc3 had the same scores. So, there was no way to determine which one was more relevant. Here, with TF-IDF, Doc3 has emerged as more relevant when compared to Doc2.
Doc3 being considered as more relevant has to do with the way IDF works. As per the definition, if the term is less frequent, it is given more weightage. In our example, the term “Java” occurs only in Doc3 and occurs nowhere else in the corpus. Whereas the term “Engineer” in Doc2 also appears in Doc1. So, its weightage is lowered.
The TF-IDF-based approach is a simple yet powerful model that works great for most cases, especially full-text searches, but it has some limitations
- Misspelled words may occur rarely in documents and hence would end up being given a very high weightage by IDF. This may not necessarily be correct
- TF-IDF approach looks at each term in the document individually and assumes the terms to be independent (there is no relation between the terms). But practically many terms are related. For example, in terms such as carbon copy or credit card, the relationship between the terms is lost in this model.
- TF-IDF, emerged as a useful technique, primarily based on the user behavior. Early searchers used full text queries like “what is the population of China”. In these situations, it was best to give low weightages to words like “is”, “the” etc so that the results are more relevant. But modern searchers probably type “China population”. With such queries, where all the words are important, too much TF-IDF can actually end up producing poor results.
Despite its limitations, TF-IDF remains only of the most popular approaches for determining relevance. What is explained above is a basic version of TF-IDF. Search engines implement TF-IDF with complex weighting schemes to ensure high degree of relevance in search results.
In the next post, we will see a slightly more involved implementation of TF-IDF.