In this post, we will go over some basic theory behind statistical analysis of text. This theory encompasses some of the ideas used in a lot of modern big data analytics and is surprisingly relevant and useful even today, over forty years after it was invented.
Let’s start with a simple question – what is any document about? Alternatively, we can pose this question as, if we had a document that needed to be tagged, how can we do that automatically, that is, write a program to do it?
There are many ways of answering this question but let’s start with the easiest way – we will make a very simplifying assumption which is always the best way to start any scientific or technical endeavor. In this case, the very simplifying assumption is:
The tags for a document are in the document itself.
This doesn’t seem like a bad assumption to start with, because what a document is about, after all, is what it talks about. However, with any assumption, it’s worth thinking about when the assumption might be proved false.
It’s not hard to find counter-examples. If we wanted to generates tags for the Bible, would we know that it was a document about Christianity, under the simplifying assumption we made? No, because the word “Christianity” or “Christian” does not appear in the New Testament. The Declaration of Independence in fact only calls itself “the declaration of the thirteen United States of America,” and not “The declaration of independence.” The phrase “coming of age” does not occur in The Catcher In The Rye. So we go into our solution knowing that our assumption would leave some documents incompletely (and perhaps incorrectly) tagged. We will just postpone that problem for later (and blog about it at some other point!)
If the words are indeed in the document itself, then which ones are they? We’ll now have to figure out what we mean by a “word” in the document. For the purposes of this post, we will make yet another simplifying assumption (they tend to multiply very quickly) – that a word is any string of non-space characters. So in this paragraph, ‘multiply’ is a word and ‘”word”‘ (with the double quotes) is a word.
Immediately, that feels like it’s not going to work – we know that the periods and commas in a sentence don’t add to the “words” in it so let’s make the definition of “word” a bit tighter – it’s a string of non-space and non-punctuation characters 1)Well, we’ll have to handle single quotes differently, so that “isn’t” is treated as a single word, rather than as two words, “isn” and “t”.
In Ruby, we can do something like this:
def words(body) # return all the words in the text that's in the argument body body.split(/[^a-zA-Z0-9']+/) end
Great. So now we can generate a list of all words, how do we know which one is the tag?
We need some way to compare them and decide which ones are better. In other words, we need a ranking function. In fact, while we are starting to define technical-sounding terms, let’s add one more – term to mean “word.”2)We say ‘term’ instead of ‘word’ because we want to refer to things like years (2011) as a “word” but that’s confusing given how we use the word “word” in English. So what we need is a term ranking function.
Now the assumption we make to decide on a term ranking functions is not a particularly intuitive one – we’ll just skip a few steps and say that a good place to start is to rank terms by their counts of occurrences. Technically, we call this number the term frequency of a term (even though we are counting sums, rather than rates or frequencies, but we’ll see in a subsequent post why we use the word frequency.)
The basis of this assumption is the idea that the more frequently a term appears in a document, the more likely it is that the document is talking about whatever is described by that term.
For example, the word “God” (or one of its variants) is indeed among the most common in the Bible; and “whale” is probably among the most common words in Moby Dick.
Let’s add a simple counting function to our code.3)A less efficient but more easy to read version of this counting function is written up in this gist.
def compute_counts(word_list) word_list.each do |word| word_counts[word] ||= 0 word_counts[word] += 1 end word_counts end
We return a hash where the keys are the words and the values are the counts.
We can now run the most basic “auto tagging” algorithm – parse a document into its “term” parts, and sort in descending order by counts.
In the next post on this topic, we will look at some basic cleanup we can do to make this initial list better.
References [ + ]
|1.||↑||Well, we’ll have to handle single quotes differently, so that “isn’t” is treated as a single word, rather than as two words, “isn” and “t”.|
|2.||↑||We say ‘term’ instead of ‘word’ because we want to refer to things like years (2011) as a “word” but that’s confusing given how we use the word “word” in English.|
|3.||↑||A less efficient but more easy to read version of this counting function is written up in this gist.|