Searching MediaWikis with Ruby (or treating MediaWiki a little like a database)

Here’s a little treat that I threw together to help me search Wikipedia for film data. It’s pretty simple. Example usage below:

wikipedia = MediaWiki::Search.new "http://en.wikipedia.org"
result = wikipedia.search("Firefox")    
if results.is_a? String
    # Then I've obtained the content for an actual page in a MediaWiki
else 
    # I have an Array containing Hashes with metadata about the top 20 candidates
    # Now, for giggles, I'll get the content for the first hit
    html = Net::HTTP.get_response URI.parse(result.first[:url])
end

The returned Array of Hashes, in the second case, has three keys: :url - The complete URL to the page in the MediaWiki :title - The Wikipedia page title :weight - The percentile weight supplied by the MediaWiki search.

Pretty simple? I thought so.

Code follows:

require \'net/http\'
require \'uri\'

module Wikipedia
  class Search

    SEARCH_PG_PTN = \'wgPageName = \"Special:Search\";\'
    SEARCH_ITEM_START_PTN = \'<li style=\"padding-bottom: 1em;\">\'
    SEARCH_ITEM_END_PTN = \'</li>\'
    ITEM_HIT_START = \'<a href=\'

    # <i>wiki_url</i>: The URL of the wiki to search
    def initialize(wiki_url)
      wiki_url.chop! if wiki_url =~ /\\/$/
      @wiki_url = wiki_url
      @search_url = wiki_url + \"/wiki/Special:Search\"
    end

    # borrowed from Net::HTTP
    def fetch(uri_str, limit = 10)
      raise ArgumentError, \'HTTP redirect too deep\' if limit == 0
      response = Net::HTTP.get_response(URI.parse(uri_str))
      case response
        when Net::HTTPSuccess     then response
        when Net::HTTPRedirection then fetch(response[\'location\'], limit - 1)
        else response.error!
      end
    end

    def search(query_str)
      response = Net::HTTP.post_form URI.parse(@search_url), :search => query_str, :go => \"Go\"
      if response.body =~ /#{SEARCH_PG_PTN}/
        multiple_search_results_in response
      elsif response.is_a? Net::HTTPRedirection
        fetch(response[\'location\']).body
      else
        response.body  
      end
    end

    private
    def multiple_search_results_in(response)
      raw = response.body.split(/#{SEARCH_ITEM_START_PTN}|#{SEARCH_ITEM_END_PTN}/)
      hits = raw.select { |l| l =~ /^#{ITEM_HIT_START}/ }
      hits.collect do |h|
        h =~ /href=\"(.*)\" title.*>(.*)<\\/a>.*Relevance: (.*)%/
        { :url => @wiki_url + $1, :title => $2, :weight => $3 }
      end
    end
  end
end

Update: And it still isn’t 100% functional yet either. It turns out, that after beating my head for a whole weekend against MediaWiki, that there are several permutations of user inputted metadata into MediaWiki. Simply fetching the movie poster would be simple enough. However, the writer, director, and starring sections are quite a bit more complex. I’m actually fairly close. Actually, the problem can be more easily solved if I simplify the regexps to just yank out the contents of entire cells within the table, just strip the HTML, and otherwise leave the cell contents intact without making assumptions about the comment (and further borking the formatting).

Posted by evan on Saturday, May 12, 2007

blog comments powered by Disqus