TF/IDF. WTF? Part II

In our previous post on this topic, we started exploring some basic text analysis techniques. We looked at term counts or term frequencies as a measure to help automatically generate tags. The list of words by term frequency is also referred to as a “word cloud” sometimes, especially when the ‘cloud’ is visualized by arranging the words in a circular shape, and having more frequent words displayed in larger text.

In the first cut at a tagging algorithm, we probably started to see a few good candidates, but the first problem we would have run into is that one of these words, or something similar, is the most common: ‘a’, ‘an’, ‘the’, ‘of’, ‘that’, etc.

Continue reading “TF/IDF. WTF? Part II”

TF/IDF. WTF? Part I

In this post, we will go over some basic theory behind statistical analysis of text. This theory encompasses some of the ideas used in a lot of modern big data analytics and is surprisingly relevant and useful even today, over forty years after it was invented.

Let’s start with a simple question – what is any document about? Alternatively, we can pose this question as, if we had a document that needed to be tagged, how can we do that automatically, that is, write a program to do it? Continue reading “TF/IDF. WTF? Part I”

Some Common Tricks with Ruby on Rails

Today we will use Ruby on Rails along with some Javascript libraries to code a few common patterns –

  • Autocomplete in a text box
  • Uploading files by dragging and dropping them into an area inside a browser.

Autocomplete

The Javascript library, jQuery.token-input, is great for quickly building a Javascript-based autocomplete solution. You have to do only three things to get started: Continue reading “Some Common Tricks with Ruby on Rails”