Wednesday, October 07, 2015

Web Scraping with Ruby

I was learning a little on how to use Ruby to do some web scraping today, following this article which is admittedly out of date:

I had to make a few edits along the way to fix some bugs and thought I would share. Look for "UPDATE" below.

# --- Get a list of all US Presidents' names, loop through to check for any with last names >6 characters and see if they have died. Calculate the average age for those Presidents ---

# open the required libraries
require 'rubygems'
require 'nokogiri'
require 'open-uri'

# Using nokogiri, fetch Wikipedia's list of presidents page
# UPDATE: Had to use https
list_of_presidents = Nokogiri::HTML(open(''))

# Using another nokogiri method, grab the second column (td) from every row (tr), and from those, grab the first hyperlink (a) which should also contain the President's name
# UPDATE: Had to modify XPath to match Wikipedia's current layout
an_array_of_links = list_of_presidents.xpath("//tr/td[2]/b/a[1]")
# UPDATE: I changed the output a little, so used a few different variables
long_name_count = 0 dead_prez_count = 0 alive_prez_count = 0 total_age = 0 an_array_of_links.each do |link_to_test| # This above statement can be read as: for each element in an_array_of_links, do # the following code (until the end line) # And as you go through each element, the variable use to reference the element will be named "link_to_test" # break up the name string by spaces, then get the last element in the resulting array to get the last name last_name = link_to_test.content.split(' ')[-1] if last_name.length > 6 long_name_count += 1 the_link_to_the_presidents_page = link_to_test["href"] # The value of href is going to be something like "/wiki/George_Washington". # That's an address relative to the Wikipedia site # so we need to prepend "" to have a valid address...
 # UPDATE: Had to use https
the_link_to_the_presidents_page = ""+the_link_to_the_presidents_page # now let's fetch that page the_presidents_page = Nokogiri::HTML(open(the_link_to_the_presidents_page)) # check if they died
 # UPDATE: Had to change this to remove .content for cases where there is no death_date, then change the the 'if' immediately below
death_date = the_presidents_page.xpath("//th[contains(text(), 'Died')]/following-sibling::*")[0]
        if death_date && death_date.content && death_date.content[0]
            # check what their age was
            # UPDATE: Had to change this to add .content and add * to the regex to get all digits of the President's age
age_at_death = death_date.content.match(/aged.+?([0-9]*)/)[1] if age_at_death # we only get here if there was a "Died" table cell AND a text pattern similar to: "aged XX" puts "Age of #{link_to_test.content} is: #{age_at_death}"
                # UPDATE: Had to change to age_at_death.to_i to use the full age, not just a single digit
total_age += age_at_death.to_i # technically, age_at_death is a String. to_i will make it a Number so we can safely add it to total_age dead_prez_count += 1 end else puts "#{link_to_test.content} is still alive" alive_prez_count += 1 end end end puts "Total Presidents: #{an_array_of_links.count}" puts "...with Surname >6 characters: #{long_name_count}, (#{an_array_of_links.count - long_name_count} had short names)" puts "...that have died: #{dead_prez_count}, (#{alive_prez_count} are still alive)" puts "Of dead presidents, their total age is #{total_age} and average age is #{total_age / dead_prez_count}" # OK, we're at the end of the each loop. Go back to the top

No comments: