April 18, 2011

HOWTO: Accessing Web Pages in Ruby

I was surprised how little of this was online when I searched, but it may be that this is so well known that the most popular web results were more complicated scenarios. I recently had a need to use filenames as a key into a web-based API, then get the page result from the API and do some simple processing. I was happy with how simple this was to do in Ruby. This demonstrates a wget-like functionality within Ruby.

Basically, I'll let my code do the talking. This demonstrates two particular features of Ruby - Dir and open-uri, both of which are pretty cool. This took me about 15 minutes to code up ... I was impressed (no, not with myself, with the language!).

# Problem: I downloaded thousands of images
# from a public repository, image-net.org.
# I chose 70 to use in an experiment. I saved
# those 70 with their original filenames, but
# didn't record the original URLs, which I now
# realize I need for the purposes of an experiment.
# image-net has an API for getting URLs for an entire 
# image set, but I only used a handful (< 1% in 
# many cases) from any given set. 
# The only link between the downloaded image and 
# the API is the filename.
# Solution: Ruby.
# include the open-uri tools
require 'open-uri'

# Get list of files in the current folder 
# Some of them I care about, some I don't
dir_contents = Dir.entries(Dir.pwd)

# Setup some arrays ...
imgs = [] # for image IDs, "[wnid]_[picnum]"
wnids = [] # for image set ids (only the wnid part)

# Parse filename for image IDs, wnids

dir_contents.each{ |dc|
   if dc =~ /((n.+)_.+)\.JPEG/ then
      imgs.push($1) # save the image ID
      if (!wnids.include?($2)) then
         wnids.push($2) # save unseen wnid parts

# URL prefix
base = "http://www.image-net.org/api/text/imagenet.synset.geturls.getmapping?wnid="

# Roll through image sets
wnids.each{ |wnid|
   open(base + wnid) { |page| # download the page, get handle
      page.each_line{ |line| # for each line on the page
         line =~ /(.+)\s+(.+)/ # parse the img ID and URL
         if imgs.include?($1) then
            puts $2 # print the URL if I used it

No comments:

Post a Comment