February 2008 Archives

Rinda or "Hardware is Cheap So Let's Use More of it!"

Recently, I was writing some simple statistical calculation software running tests on small data sets with a floating sample “window” through the data. Basically, this becomes a O(n^2) over Mann-Whitney U. It was fast enough for small data sets – that is until I was asked to scale the data size by about an order of magnitude.

What to do? We had plenty of hardware and I was developing on a quad-core Xeon (!!!) so why not throw more hardware at it. I was only using one core. So how do I get to the other three?

Enter Rinda – a Ruby implementation of Linda.

Rinda allows multiple Ruby processes to easily share data. This is accomplished by having a central Ruby VM, known as the Ring server, act as a centralized object repository that other local Ruby VMs client themselves to. These client VMs communicate with each other through UDP by:

  • Writing tuples to the Ring
  • Reading/Taking tuples from the Ring through a simple tuple-based query-based system (FYI: for anyone who already knows Erlang, this should look familiar). The key difference between a read and a take, as you may imagine, is that a read does only that whereas a take combines a read and a delete as a single operation.
This provides CRUD although without an explicit Update; Update = “take” + update the taken object + “write”.

You may be asking, in the context of Ruby, WTF is a tuple? It’s just an Array of arbitrary objects. Heady stuff, right? Again, this ought to look LISP/Erlang/probably-insert-your-functional-language-of-choice.

Let’s look at some simple code examples:

Setting up the Ring server is a piece of cake. Eric Hodel’s write up from a few years back served as my guide. The below is lifted directly from his page and is adequate for simple parallel processing but bear in mind that it does nothing to clean up “stale” objects – objects that are put into the Ring but never removed.

  #!/usr/bin/env ruby -w
  # ringserver.rb
  # Rinda RingServer

  require 'rinda/ring'
  require 'rinda/tuplespace'

  # start DRb
  DRb.start_service

  # Create a TupleSpace to hold named services, and start running
  Rinda::RingServer.new Rinda::TupleSpace.new

  # Wait until the user explicitly kills the server.
  DRb.thread.join

Obviously, clients need to be able to connect to the Rinda Ring. This is handily accomplished through the Rinda::RingFinger.

  # pretend this is in "myrinda.rb"
  require 'rinda/ring'

  def tuplespace
    DRb.start_service

    # Fetch the first TupleSpace
    Rinda::RingFinger.primary 
  end

  ts = tuplespace

And there you have it. You’ve connected to your Ring, if it’s up, or you’ve received an Exception because it isn’t ;-). You may have noticed that the comment indicates that we’re fetching a TupleSpace from the RingServer. Evidentally, Rinda Rings can contain multiple TupleSpaces. You may wish to use specific TupleSpaces for “development” and “production” environments – but they are still going to use the same Rinda Ring.

Now let’s have a client connect and write some data.

  require 'myrinda'

  ts = tuplespace  
  ts.write([:blog, "Shedding Light on Ruby", "http://evan.tiggerpalace.com"])
  ts.write([:blog, "Northern Virginia Ruby User's Group", "http://novarug.org"])

This writes the tuples [:blog, “Shedding Light on Ruby”, “http://evan.tiggerpalace.com”] and [:blog, “Northern Virginia Ruby User’s Group”, “http://novarug.org”] to the Ring. See the commonalities between the two tuples? There’s a reason for that. Bear with me.

But, Evan, how do we get this data out of the Ring? This is where I think it gets cool. Let’s pretend that the below code example is another client, connecting to the ring, and trying to interrogate the Ring for information about blogs.

  require 'myrinda'

  ts = tuplespace
  blogs = ts.take([:blog, nil, nil])
  blogs.each do |b|
    puts "#{b[1]: #{b[2]}"
  end

The above code should print the following to the console:

  Shedding Light on Ruby: http://evan.tiggerpalace.com
  Northern Virginia Ruby User's Group: http://novarug.org

We’re telling the Tuplespace that we’re looking for Arrays/tuples of 3-arity that begin wth the symbol :blog and getting an array of matches back. The contents of the Array passed to take/read is used to perform a pattern match (ala Erlang) against the contents of the TupleSpace. The nil entriesin the Array tell the Ring server that we’re interested in the existence of these fields but not their values.

Finally, the TupleSpace#take method also lets you pass in a timeout. However, by default, TupleSpace#take will cause the caller to block until data arrives in the Ring that matches the take’s query.

If you want to distribute processing on some heavy number (or other) crunching, if your input data can be processed in individual chunks independently, then you can parallelize the processing. Just write a “worker” that consumes tuples of one type from a Ring, chews on the data, and spits results out in another tuple on the ring (remember that a “type” of a tuple is just the format of the tuple: it’s arity and payload – symbols help here). Then you need your VM that was doing the heavy lifting before to partition the data, write the data to the Ring in tuples recognizable to the “workers”, and the workers will do their business. Your writing VM will then have to interrogate the Ring for completed tuples. If output tuples == input tuples then the process is complete. You take your results, reassemble them as necessary, and off you go.

And that’s pretty much sums it up. Sure, Rinda has an Achille’s heel: you can only have one Ring server on a LAN so it’s a SPOF – but for not ridiculously heavy lifting, Rinda is useful enough. Speaking from my experience, I’ve had as many as eight cores on a single network banging on statistical calculations.

In summary:

  • Pros: Simple parallelizable programming model for Ruby. Adequate for non-critical tasks.
  • Cons: Does not scale well due to Ring SPOF

Posted by evan on Feb 25, 2008

MediaWiki Film lookup gem

After far too much goofing off, I’ve finally gotten off of my tuckus (metaphorically only ;) ) in order to write some code. In a few short hours of work, I’ve almost finished a first pass at my Wikipedia film gem. It’s sole purpose is to help me automate the download of movie synopses and posters for movies stored on iTunes for display on our AppleTV.

I’d left this project fallow for at least a half a year now. It was amusing to return to it later, with far better Ruby chops, and get it working. Now that the unit tests pass and the movie lookup driver seems to handle the majority of the bizarre errors that can occur as a result of the imperfect taxonomy used by Wikipedia, I’ll probably post a link in the next few days.

Posted by evan on Feb 09, 2008