In regard to the TF-IDF request we have been receiving in our suggestion pages, here’s our full answer on why we don’t use it for our content analysis feature.



What’s TF-IDF?


TF-IDF (short for Term Frequency-Inverse Document Frequency) was Google’s method to figure out the relevance of pages in its index to a given query. The method was used and updated to Google search Algorithm mostly during the 2013 period after the Humming Bird update (refer to an article posted in 2014 by Google) with its main purpose is to find out the importance of a given keyword to a given page by looking at how often they appear on a page (TF) and how often they are expected to appear on an average web page, based on a larger set of documents (IDF). 


In short, they try to use math to solve a human problem (extracting meaning)


This method, though popular back in the day, was never truly gotten pick up once Google started optimizing their AI to the indexing algorithm, what was once seemingly impossible to figure out (meaning in website context) is now easily detected, classified, and extracted using their advanced NLP algorithm (which you now experience it quite frequently in your SEO work). 


TF-IDF is an 8 years old technology and is no longer used by Google for its indexing Algo (refer to the discussion by Google’s John Muller about TF-IDF). You won’t be able to extract keywords that rank easily using this method in 2021. 


2021 is all about Natural Language Processing (NLP) and that’s what we will be introducing in our next couple of updates. Integration directly with Google NLP to bring forth suggestions for context sentiment, entities to include, and more.. to ensure your content would fit best with Google bot’s understanding. 


For reference

Google Article on TFIDF in 2014 - https://ai.googleblog.com/.../teaching-machines-to-read...

John Muller's discussion on Search Engine Journal in 2019 - https://www.searchenginejournal.com/google-tf-idf/304361/