Here’s a little treat that I threw together to help me search Wikipedia for film data. It’s pretty simple. Example usage below:
wikipedia = MediaWiki::Search.new "http://en.wikipedia.org"
result = wikipedia.search("Firefox")
if results.is_a? String
# Then I've obtained the content for an actual page in a MediaWiki
else
# I have an Array containing Hashes with metadata about the top 20 candidates
# Now, for giggles, I'll get the content for the first hit
html = Net::HTTP.get_response URI.parse(result.first[:url])
end
The returned Array of Hashes, in the second case, has three keys: :url - The complete URL to the page in the MediaWiki :title - The Wikipedia page title :weight - The percentile weight supplied by the MediaWiki search.
Pretty simple? I thought so.
Code follows:
require \'net/http\'
require \'uri\'
module Wikipedia
class Search
SEARCH_PG_PTN = \'wgPageName = \"Special:Search\";\'
SEARCH_ITEM_START_PTN = \'<li style=\"padding-bottom: 1em;\">\'
SEARCH_ITEM_END_PTN = \'</li>\'
ITEM_HIT_START = \'<a href=\'
# <i>wiki_url</i>: The URL of the wiki to search
def initialize(wiki_url)
wiki_url.chop! if wiki_url =~ /\\/$/
@wiki_url = wiki_url
@search_url = wiki_url + \"/wiki/Special:Search\"
end
# borrowed from Net::HTTP
def fetch(uri_str, limit = 10)
raise ArgumentError, \'HTTP redirect too deep\' if limit == 0
response = Net::HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response[\'location\'], limit - 1)
else response.error!
end
end
def search(query_str)
response = Net::HTTP.post_form URI.parse(@search_url), :search => query_str, :go => \"Go\"
if response.body =~ /#{SEARCH_PG_PTN}/
multiple_search_results_in response
elsif response.is_a? Net::HTTPRedirection
fetch(response[\'location\']).body
else
response.body
end
end
private
def multiple_search_results_in(response)
raw = response.body.split(/#{SEARCH_ITEM_START_PTN}|#{SEARCH_ITEM_END_PTN}/)
hits = raw.select { |l| l =~ /^#{ITEM_HIT_START}/ }
hits.collect do |h|
h =~ /href=\"(.*)\" title.*>(.*)<\\/a>.*Relevance: (.*)%/
{ :url => @wiki_url + $1, :title => $2, :weight => $3 }
end
end
end
end
Update: And it still isn’t 100% functional yet either. It turns out, that after beating my head for a whole weekend against MediaWiki, that there are several permutations of user inputted metadata into MediaWiki. Simply fetching the movie poster would be simple enough. However, the writer, director, and starring sections are quite a bit more complex. I’m actually fairly close. Actually, the problem can be more easily solved if I simplify the regexps to just yank out the contents of entire cells within the table, just strip the HTML, and otherwise leave the cell contents intact without making assumptions about the comment (and further borking the formatting).



2 comments ↓
hey i’m very interested in a wikipedia movie parser. did you finish you project? i just started to learn ruby .. and rials and i guess if I find nothing else I will take the code above and try to complete it
Huh. You know what? I did get it working — but oddly I didn’t publish it to rubyforge at the time.
Lemme go dig it up and plop it on github. IIRC, it wasn’t perfect but it was adequate for my needs to dig up a plot summary and movie poster.
Actor, director, and other information, it turned out, was inconsistently formatted between movies. I could extract the information but parsing out individual names of people was annoying due to the # of permutations (not saying it was impossible — but I wasn’t feeling as creative at the time).
The film posters, however, were predictable enough. As was capturing the first textual paragraph which, typically, is a plot summary.
I’ll slap it on the TODO list. Might even have a chance to look at it tonight from the hotel room.
Leave a Comment