Sameer Siruguri

My Blog

HTML and SQL Basics Using A Ruby Web Scraper Example

In this lesson, we will learn some basics about two languages that are fundamental to how (many) web applications work – HTML and SQL. To make this a bit real, the goal of this exercise will be to write a script in Ruby that can do the following: look at the newest stories on Reddit, and insert into a local database the following information about the top 10 of those stories – the author, how many points it has, its category, its title, and the date it was posted.

HTML, or Hypertext Markup Language, is the syntax understood by most browsers. It has been around for over thirty years and was designed initially by Tim Berners-Lee (who recently appeared on the Internets pleading for your help in keeping the Internets open – go watch this video!)

SQL, or Structured Query Language, is used in creating and querying what are called “relational” databases – where the stored data refers to various objects that have relationships to each other (a typical example would be an e-commerce database, where products are related to the companies that sell them.)

This lesson uses Ruby to build out some simple pieces of code that illustrate how these languages work. Knowledge of Ruby is essential at least at a basic programming level – knowledge of the libraries used here is not. The various applications referenced in this post work best on a nix machine, such as a flavor of Linux or OSX. If you’re running Windows, install Vagrant or another virtual machine that provides a nix experience on Windows.

We will use SQLite3 as our relational database.

The first two lessons you will first read are on HTML and SQL – check them out on the beginners’ programming site that I recently started with a friend, called Conversational Coding.

  1. HTML: What is HTML and how do you process it? Also, what is a DOM?
  2. SQL: How to talk to a database

Once you have that figured out, we can go back to our initial idea of reading the Reddit page of top posts.

The first thing you have to do when scraping any webpage is to examine its HTML structure. The easiest way to do this is to use the HTML visualization and debugging aids on your browser – both Firefox and Chrome have some pretty slick interfaces to help you navigate the HTML or the DOM for any webpage.

We’ll use Firefox and its “Inspect Element” feature for this lesson – right click on any webpage and you’ll see it there. (On Chrome, you can get similar information using the Developer Tools interface under the Settings button at the top-right of the Chrome window.)

Reddit Top News articles

When you look through the HTML tree structure that opens up at the bottom of the Firefox window, you can see how different parts of the page will highlight as you move the mouse around the tree structure.

Highlighting an element

Here, we are selecting a div that matches a single post – consequently, that post is highlighted in the browser screen.

 

Single Post Navigation

Leave a Reply

Your email address will not be published. Required fields are marked *