HTML and SQL Basics Using A Ruby Web Scraper Example
In this lesson, we will learn some basics about two languages that are fundamental to how (many) web applications work – HTML and SQL. To make this a bit real, the goal of this exercise will be to write a script in Ruby that can do the following: look at the newest stories on Reddit, and insert into a local database the following information about the top 10 of those stories – the author, how many points it has, its category, its title, and the date it was posted.
HTML, or Hypertext Markup Language, is the syntax understood by most browsers. It has been around for over thirty years and was designed initially by Tim Berners-Lee (who recently appeared on the Internets pleading for your help in keeping the Internets open – go watch this video!)
SQL, or Structured Query Language, is used in creating and querying what are called “relational” databases – where the stored data refers to various objects that have relationships to each other (a typical example would be an e-commerce database, where products are related to the companies that sell them.)
This lesson uses Ruby to build out some simple pieces of code that illustrate how these languages work. Knowledge of Ruby is essential at least at a basic programming level – knowledge of the libraries used here is not. The various applications referenced in this post work best on a nix machine, such as a flavor of Linux or OSX. If you’re running Windows, install Vagrant or another virtual machine that provides a nix experience on Windows.
We will use SQLite3 as our relational database.
The first two lessons you will first read are on HTML and SQL – check them out on the beginners’ programming site that I recently started with a friend, called Conversational Coding.
Once you have that figured out, we can go back to our initial idea of reading the Reddit page of top posts.
The first thing you have to do when scraping any webpage is to examine its HTML structure. The easiest way to do this is to use the HTML visualization and debugging aids on your browser – both Firefox and Chrome have some pretty slick interfaces to help you navigate the HTML or the DOM for any webpage.
We’ll use Firefox and its “Inspect Element” feature for this lesson – right click on any webpage and you’ll see it there. (On Chrome, you can get similar information using the Developer Tools interface under the Settings button at the top-right of the Chrome window.)
When you look through the HTML tree structure that opens up at the bottom of the Firefox window, you can see how different parts of the page will highlight as you move the mouse around the tree structure.
Here, we are selecting a
div that matches a single post – consequently, that post is highlighted in the browser screen.