Searching MediaWikis with Ruby (or treating MediaWiki a little like a database)
Here’s a little treat that I threw together to help me search Wikipedia for film data. It’s pretty simple. Example usage below:
wikipedia = MediaWiki::Search.new "http://en.wikipedia.org"
result = wikipedia.search("Firefox")
if results.is_a? String
# Then I've obtained the content for an actual page in a MediaWiki
else
# I have an Array containing Hashes with metadata about the top 20 candidates
# Now, for giggles, I'll get the content for the first hit
html = Net::HTTP.get_response URI.parse(result.first[:url])
end
The returned Array of Hashes, in the second case, has three keys: :url - The complete URL to the page in the MediaWiki :title - The Wikipedia page title :weight - The percentile weight supplied by the MediaWiki search.
Pretty simple? I thought so.
Code follows:
require \'net/http\'
require \'uri\'
module Wikipedia
class Search
SEARCH_PG_PTN = \'wgPageName = \"Special:Search\";\'
SEARCH_ITEM_START_PTN = \'<li style=\"padding-bottom: 1em;\">\'
SEARCH_ITEM_END_PTN = \'</li>\'
ITEM_HIT_START = \'<a href=\'
# <i>wiki_url</i>: The URL of the wiki to search
def initialize(wiki_url)
wiki_url.chop! if wiki_url =~ /\\/$/
@wiki_url = wiki_url
@search_url = wiki_url + \"/wiki/Special:Search\"
end
# borrowed from Net::HTTP
def fetch(uri_str, limit = 10)
raise ArgumentError, \'HTTP redirect too deep\' if limit == 0
response = Net::HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response[\'location\'], limit - 1)
else response.error!
end
end
def search(query_str)
response = Net::HTTP.post_form URI.parse(@search_url), :search => query_str, :go => \"Go\"
if response.body =~ /#{SEARCH_PG_PTN}/
multiple_search_results_in response
elsif response.is_a? Net::HTTPRedirection
fetch(response[\'location\']).body
else
response.body
end
end
private
def multiple_search_results_in(response)
raw = response.body.split(/#{SEARCH_ITEM_START_PTN}|#{SEARCH_ITEM_END_PTN}/)
hits = raw.select { |l| l =~ /^#{ITEM_HIT_START}/ }
hits.collect do |h|
h =~ /href=\"(.*)\" title.*>(.*)<\\/a>.*Relevance: (.*)%/
{ :url => @wiki_url + $1, :title => $2, :weight => $3 }
end
end
end
end
Update: And it still isn’t 100% functional yet either. It turns out, that after beating my head for a whole weekend against MediaWiki, that there are several permutations of user inputted metadata into MediaWiki. Simply fetching the movie poster would be simple enough. However, the writer, director, and starring sections are quite a bit more complex. I’m actually fairly close. Actually, the problem can be more easily solved if I simplify the regexps to just yank out the contents of entire cells within the table, just strip the HTML, and otherwise leave the cell contents intact without making assumptions about the comment (and further borking the formatting).
Posted by evan on Saturday, May 12, 2007
blog comments powered by Disqus
My name is Evan Light and, yes, I am a nerd. I'm also a professional software developer who, after spending one too many years contracting to the federal government, escaped into the far more enjoyable commercial world. Having spent several years using C and even more using Java (the latter very nearly caused me to give up programming entirely), I consider myself fortunate to have discovered Ruby and to use it as part of my daily work.