In our previous post on this topic, we started exploring some basic text analysis techniques. We looked at *term counts* or *term frequencies* as a measure to help automatically generate tags. The list of words by term frequency is also referred to as a “word cloud” sometimes, especially when the ‘cloud’ is visualized by arranging the words in a circular shape, and having more frequent words displayed in larger text.

In the first cut at a tagging algorithm, we probably started to see a few good candidates, but the first problem we would have run into is that one of these words, or something similar, is the most common: ‘a’, ‘an’, ‘the’, ‘of’, ‘that’, etc.

Just to ground this in an actual document, let’s take the text of this article in The Guardian about the Trans Pacific Agreement, and attempt to use our auto tagger to tell us what the document is about. The first 5 terms then will be *the*, *to*, *in, **of* and* and.*

What is wrong with including these words as tags? They are so common that they rarely mean anything useful in describing a document’s topic. The technical term for these is *stop words* (or *stopwords*, all one word) because they act like punctuation in a document (hence, “stop,” as in the use of “stop” to mean period in old-timey telegram messages.)

So our first step at cleaning up would be to remove stopwords from the input before counting the term frequencies. Lists of stop words can be easily obtained online, for most languages.

The next step will be to “normalize” all the terms by downcasing them so that “Trade” and “trade” count as the same term. With these two cleanups done, we now get the following terms:

- trade
- senate
- jobs
- tpa
- vote

That indeed looks very much like what the document is about! Let’s run this on a few more documents to see what patterns emerge:

- An article in the New Yorker on Marc Andreessen: andreessen, v, company, told, a16z, people
- Reaction to shooting by officer in Wisconsin: robinson, kenny, police, decision, shot
- Prescribing video games for ADHD: game, brain, games, gazzaley, chicken
- Frank Gehry, architect, and the software he uses: gehry, construction, design, software, project

From each of these top 5 lists, we can notice a few problems:

- The stop word list is not perfect, and words like “v” and “told” get through.
- Variants on the same basic word get counted differently – for example,
*game*vs*games* - Some words are too generic within the context of the article to be meaningful differentiators –
*project*in the Gehry article, or*company*in the Andreessen article.

# The Document Universe

Solving some of these problems requires more advanced text analysis techniques. For example, to count all variants of a word the same, we’ll have to implement a process called *word stemming, *which is beyond the scope of this post.

To address the other issues, we turn now to the next concept – the IDF portion of the text analysis world. Let’s motivate this step by asking a different question than the one we started with – when are two documents about the same thing?

The simple way to answer this question is to say that two documents are about the same thing, if the tags are the same for all three. But we will quickly find that this is too restrictive a definition. Most documents that are about the same general topic, will have slightly different sets of tags.

For example, let’s look at three articles, 1, 2, and 3, that are all about craft beer making. Go ahead, click on those links and take a look at the articles first.

When we run these through our text analysis code, we get the following top 10 tags:

- brewing, brewery, beer, brewer, hess, business, open, founder, edmunds, job
- beer, grain, step, wort, water, sugar, mash, image, sparge, brew
- beer, craft, brewery, start, working, industry, dream, partners, help, homebrew

The lists are very close but not quite the same, and yet if we compared them to an article on wine making, we can see the obvious differences:

wine, grapes, fermentation, skins, tank, red, process, barrel, hand, wines

So how do we quantify this difference?

## Document Similarity and Term Vectors

To quantitatively measure the similarity between documents, or rather between their auto-tag lists, we have to introduce a new concept, perhaps the most mathematical one so far – *term vectors*.

To understand how we use term vector, let’s start by visualizing them in two dimensions. Suppose that all documents are described by just two tags (rather than the full list of all terms (words) in the document.) And when I say “two tags,” I mean the *same *two tags. This is a bit of a weird assumption but hang with me here.

The advantage of this assumption is that we can now equally compare all documents to each other, because they are all described using the same terms. The only difference is the term frequency (count) of those terms in each document. This means that we can now draw each document as a “point” on a two-dimensional graph.

Two documents, then, are the same (or similar) if they are close enough to each other in this graph. The further apart they are, the less similar they are to each other.

## Normalization

At this point, we can understand better why we call the scores *term frequencies,* rather than *term counts.* If two document mention the two tags in *relatively equal *quantities, but in different *absolute* quantities, that doesn’t mean they are dis-similar. For example, suppose the two terms we are tagging documents with are “social” and “media”. A tweet message might mention them 4 times each, and a lengthy essay about Facebook and Twitter, that’s 100 times the length of the tweet, might mention them 400 times each. These two documents should still appear close together on the graph.

To make that happen, we *normalize *the counts – that is, we divide the counts by the length of the document. In our code so far, we might make the following change, for example:

`def`

`compute_frequencies(word_list)`

` `

`word_list.`

`each`

`do`

`|word|`

` `

`word_counts[word] ||= `

`0`

` `

`word_counts[word] += `

`1`

` `

`end`

` `

`word_counts.each do |w, c|`

` `

`word_counts[w] = c.to_f/word_list.size`

` `

`end`

` `

`end`

Here, we replace `compute_counts`

with `compute_frequencies`

and use the length of the word list as the *normalizing factor*. Now the term counts are normalized to the length of the document, and one way to think about that is that the counts track the number of times each term occurs *per length of document*; hence, term frequency rather than term count.

## Cosine Similarity

To actually compare two documents then, we need to visualize what makes two documents similar and what makes them dis-similar, if we look only at these two terms. Note that because we are computing frequencies, the scores for the two terms cannot sum to more than 1.

Now we can say that the documents are dissimilar if each term is contained in one, and only one, of the documents. If the main terms we are tagging the document with are completely different, then surely their dissimilarity is at its peak!

Similarity is a bit harder to intuit. The idea is that if the terms of interest are distributed similarly *relative to each other,* then the documents are similar. So if the two terms are half each of all the words in the document, or a third each, or a quarter each, and so on. Of course, they can’t sum to more than 1, so they can’t each be, say, two-thirds of the document!

Visually, this means that if the scores are plotted on a graph, then the documents are most dissimilar if the two lines form a right angle to each other, and are most similar when the two lines lie on top of each other.

Well, we want to express this intuition mathematically. So what we want is a (mathematical) function that has the property of becoming 0 for a right angle input, and becoming 1 for a 0 angle input. For various reasons, a very popular function that has these properties is the *cosine* function. That is, we define the similarity score between these two documents to be *cos(Θ), *where Θ is the angle between the lines formed on the graph. These lines are also called the *vectors* representing the two points.

# Deeper Into Similarity

To understand why we calculate similarity the way we do, it’s helpful to go beyond two tags, to a few more. Let’s say we have four documents all of which have some combination of animal and fruit words. Let’s imagine that our four documents look like this, with the following top term frequencies (we’ll imagine there are more terms we aren’t listing here that sum up the total frequencies to 1):

**Document A:**dog*(0.3)*; cat*(0.3)*; apple*(0.1)***Document B:**dog*(0.4)*; cat*(0.3)*; pineapple*(0.15)*; snake*(0.15)*

**Document C:**apple*(0.1)*; pineapple*(0**.1)*; dog*(0.1)*; cat*(0.1)*; snake*(0.1)*;**Document D:**apple*(0.1)*; pineapple*(0**.1)*; dog*(0.1)*

So which of these documents are the most similar, and why? We are outside the world of two dimensions, so envisioning the angles between these vectors (that are in at least 5 dimensions) is hard.

This is a good discussion topic – maybe *C *and *B *are the most similar because they are the pair that share the most number of terms (4.) Or maybe *A* and *B *are the most similar because their common terms are distributed relatively similarly.

# Calculating the cosine similarity

How do we get the value of cos Θ? To derive this from basic ideas, you’ll need to understand a little bit of vector algebra and a little bit of trigonometry, neither of which we’ll go into here. The short-ish answer is that the cosine similarity between two points (x_{1},y_{1}) and (x_{2}, y_{2}) is the **dot product divided by the product of the vector magnitudes**.

Specifically, it’s calculated like this:

Now we have to take a big mental leap because we know that our documents are not described by just two tags. Our tag lists are in the tens or even hundreds. Additionally, remember that each term is a dimension in itself – so if a document doesn’t have a particular term in it, that means its score is 0. The vector that defines a document is then as large as there are possible comprehensible words in the document’s language, including all known names of people, companies, chemical substances, etc.

That would seem to make each document an extremely large vector, but because a term that doesn’t exist will have a score of 0, it will not participate either in the dot product (numerator) or the vector magnitudes (denominator), when computing cosine similarity.

So we only care for those terms that occur in *both* documents that we are comparing. This makes document comparison much simpler by *limiting the document universe to only the words that are already mentioned in all the documents of interest.*

Let’s write a couple of methods to calculate cosine similarity, one for its numerator

def dot_product(a, b) common_keys = a.keys & b.keys dot_product = common_keys.inject(0) do |memo, k| memo += a[k] * b[k] memo end end

And one for its denominator:

def magnitude(a) sqr_mag = a.inject(0) do |memo, h| memo += h[1] * h[1] end Math.sqrt(sqr_mag) end

The cosine similarity of two documents is then written as:

def cosine_sim(a, b) dot_product(a, b)/(magnitude(a) * magnitude(b)) end

# Better Tag Detection

Having the cosine similarity allows us to take a less-naïve approach to attaching tags to documents. The tag for a document can take advantage of not just the content in that document but the content in other documents that are similar to it. If two documents are about the same topic, then the tags describing the topic should occur fairly regularly in *both* of them.

What we are doing is extending our First Assumption (“The topic of a document is in the document itself”) and making it:

The topic of a document is in that document as well as in other documents similar to it.First Assumption (modified):

Additionally, we now have the cosine similarity metric to help us. The common terms that we should pick, given two documents, should be those that contribute the most to this metric. If we simply sort each term that’s common to both documents, in decreasing order of the product of their respective term frequencies, we can get the terms sorted by the highest “contribution.”

Here’s what happens when we run this algorithm on a 3 documents, **A, B, **and **C**, about beer making:

Document A: beer, grain, step, wort, water, sugar, mash, image, sparge, brew, Document B: brewing, brewery, beer, brewer, hess, business, open, founder, edmunds, job, Similarity is 0.2268473916106815 Terms by commonality frequency are: beer, brewing, want, brew, going, step, yeast, way, open, bottles, Document B: brewing, brewery, beer, brewer, hess, business, open, founder, edmunds, job, Document C: beer, craft, brewery, start, working, industry, dream, partners, help, homebrew, Similarity is 0.3893728936308465 Terms by commonality frequency are: beer, brewery, craft, brewing, industry, brewers, homebrew, brewer, job, take,

If we look now only at the terms that are in both lists sorted by “commonality frequency,” then the ones that jump out are terms like “brewing,” and “beer.” As we add more similar documents to this algorithm (in effect, training it over time,) this list will grow to include more relevant terms like these.

Clearly, the new terms are more about beer making in general, than the terms we got by looking at each document in isolation!

So far, we have figured out how to use cosine similarity to improve the quality of the auto tags assigned to documents. In the next, and final, post on this topic, we will look at how to take a step further and look at collections of documents, and the principles involved in using these collections to create “classification models” that are used to auto-categorize documents.

Nice exposition Sameer!