Sameer Siruguri

My Blog

Archive for the category “text analysis”


In our previous post on this topic, we started exploring some basic text analysis techniques. We looked at term counts or term frequencies as a measure to help automatically generate tags. The list of words by term frequency is also referred to as a “word cloud” sometimes, especially when the ‘cloud’ is visualized by arranging the words in a circular shape, and having more frequent words displayed in larger text.

In the first cut at a tagging algorithm, we probably started to see a few good candidates, but the first problem we would have run into is that one of these words, or something similar, is the most common: ‘a’, ‘an’, ‘the’, ‘of’, ‘that’, etc.

Read more…


In this post, we will go over some basic theory behind statistical analysis of text. This theory encompasses some of the ideas used in a lot of modern big data analytics and is surprisingly relevant and useful even today, over forty years after it was invented.

Let’s start with a simple question – what is any document about? Alternatively, we can pose this question as, if we had a document that needed to be tagged, how can we do that automatically, that is, write a program to do it? Read more…

Post Navigation