Searching MediaWikis with Ruby (or treating MediaWiki a little like a database)

Here’s a little treat that I threw together to help me search Wikipedia for film data. It’s pretty simple. Example usage below:

wikipedia = MediaWiki::Search.new "http://en.wikipedia.org"
result = wikipedia.search("Firefox")    
if results.is_a? String
    # Then I've obtained the content for an actual page in a MediaWiki
else 
    # I have an Array containing Hashes with metadata about the top 20 candidates
    # Now, for giggles, I'll get the content for the first hit
    html = Net::HTTP.get_response URI.parse(result.first[:url])
end

The returned Array of Hashes, in the second case, has three keys: :url - The complete URL to the page in the MediaWiki :title - The Wikipedia page title :weight - The percentile weight supplied by the MediaWiki search.

Pretty simple? I thought so.

Code follows:

require \'net/http\'
require \'uri\'

module Wikipedia
  class Search

    SEARCH_PG_PTN = \'wgPageName = \"Special:Search\";\'
    SEARCH_ITEM_START_PTN = \'<li style=\"padding-bottom: 1em;\">\'
    SEARCH_ITEM_END_PTN = \'</li>\'
    ITEM_HIT_START = \'<a href=\'

    # <i>wiki_url</i>: The URL of the wiki to search
    def initialize(wiki_url)
      wiki_url.chop! if wiki_url =~ /\\/$/
      @wiki_url = wiki_url
      @search_url = wiki_url + \"/wiki/Special:Search\"
    end

    # borrowed from Net::HTTP
    def fetch(uri_str, limit = 10)
      raise ArgumentError, \'HTTP redirect too deep\' if limit == 0
      response = Net::HTTP.get_response(URI.parse(uri_str))
      case response
        when Net::HTTPSuccess     then response
        when Net::HTTPRedirection then fetch(response[\'location\'], limit - 1)
        else response.error!
      end
    end

    def search(query_str)
      response = Net::HTTP.post_form URI.parse(@search_url), :search => query_str, :go => \"Go\"
      if response.body =~ /#{SEARCH_PG_PTN}/
        multiple_search_results_in response
      elsif response.is_a? Net::HTTPRedirection
        fetch(response[\'location\']).body
      else
        response.body  
      end
    end

    private
    def multiple_search_results_in(response)
      raw = response.body.split(/#{SEARCH_ITEM_START_PTN}|#{SEARCH_ITEM_END_PTN}/)
      hits = raw.select { |l| l =~ /^#{ITEM_HIT_START}/ }
      hits.collect do |h|
        h =~ /href=\"(.*)\" title.*>(.*)<\\/a>.*Relevance: (.*)%/
        { :url => @wiki_url + $1, :title => $2, :weight => $3 }
      end
    end
  end
end

Update: And it still isn’t 100% functional yet either. It turns out, that after beating my head for a whole weekend against MediaWiki, that there are several permutations of user inputted metadata into MediaWiki. Simply fetching the movie poster would be simple enough. However, the writer, director, and starring sections are quite a bit more complex. I’m actually fairly close. Actually, the problem can be more easily solved if I simplify the regexps to just yank out the contents of entire cells within the table, just strip the HTML, and otherwise leave the cell contents intact without making assumptions about the comment (and further borking the formatting).

2 comments ↓

#1 green on 08.26.08 at 10:20 am

hey i’m very interested in a wikipedia movie parser. did you finish you project? i just started to learn ruby .. and rials and i guess if I find nothing else I will take the code above and try to complete it

#2 Evan on 08.26.08 at 10:59 am

Huh. You know what? I did get it working — but oddly I didn’t publish it to rubyforge at the time.

Lemme go dig it up and plop it on github. IIRC, it wasn’t perfect but it was adequate for my needs to dig up a plot summary and movie poster.

Actor, director, and other information, it turned out, was inconsistently formatted between movies. I could extract the information but parsing out individual names of people was annoying due to the # of permutations (not saying it was impossible — but I wasn’t feeling as creative at the time).

The film posters, however, were predictable enough. As was capturing the first textual paragraph which, typically, is a plot summary.

I’ll slap it on the TODO list. Might even have a chance to look at it tonight from the hotel room.

Leave a Comment