TF/IDF. WTF? Part II

In our previous post on this topic, we started exploring some basic text analysis techniques. We looked at term counts or term frequencies as a measure to help automatically generate tags. The list of words by term frequency is also referred to as a “word cloud” sometimes, especially when the ‘cloud’ is visualized by arranging the words in a circular shape, and having more frequent words displayed in larger text.

In the first cut at a tagging algorithm, we probably started to see a few good candidates, but the first problem we would have run into is that one of these words, or something similar, is the most common: ‘a’, ‘an’, ‘the’, ‘of’, ‘that’, etc.

Continue reading “TF/IDF. WTF? Part II”

TF/IDF. WTF? Part I

In this post, we will go over some basic theory behind statistical analysis of text. This theory encompasses some of the ideas used in a lot of modern big data analytics and is surprisingly relevant and useful even today, over forty years after it was invented.

Let’s start with a simple question – what is any document about? Alternatively, we can pose this question as, if we had a document that needed to be tagged, how can we do that automatically, that is, write a program to do it? Continue reading “TF/IDF. WTF? Part I”