Skip to content

Power of two random choices

Introducing this pragmatic load balancing algorithm and describing a real world use case

Published at: 2024-02-29

The power of two random choices

The paper presents a simple idea:

  • Random load balancing can lead to bad maximum load: the worst case of an overloaded node
  • Making two random choices lowers the maximum load significantly
  • Making n >= 2 random choices continues lowering the maximum load

It's basically a very easy way to avoid the pitfall of one-choice random load balancing and have a good load balancing result with low effort.

My issue: hot ssh connections

I wrote a Go service that needed to copy files using rsync from a destination to multiple NFS sources when notified via a message queue. To invoke rsync, I simply used os.Exec("rsync", args...).

In the first iteration of the code, I let rsync establish its own ssh connection. What we learned is that when receiving ~X000 copy requests through the queue at the same time, my service would launch X000 rsync processes to the same destination; sshd on the NFS destinations would reject many of these connections.

My first fix was to introduce a pool of persistent ssh connections - launched using os.Exec("ssh", args...) and using standard Unix ssh control paths. The SSH connections were represented in structs like so:

struct SshConn {
    controlPath string
}

I would create these (in a staggered manner) at app startup to start with a stable pool of politely-established connections on which to multiplex different rsync processes to copy files.

To associate an rsync process with an ssh connection, what did I do? You guessed it: randomly pick an ssh connection to use for an rsync process. Basically, I just did sshConn := sshPool.chooseRandom() and used it in the rsync process like rsync -e "ssh -o 'ControlPath=.... This resulted in the problem that in some cases, files would get super stuck. The rsync processes would crawl to a complete halt. Why? Hot connections! The random load balancing was leading to certain ssh connections getting overloaded.

The fix

I was hanging around with paper readers at the time and the power of two choices paper was making the rounds. It was almost poetic how it lined up, but I was able to do the following:

struct SshConn {
    controlPath string
    load        int    // i forget how i made this goroutine-safe... sync.mutex?
}

Actually, now that I tell the story, it wasn't true two-random-choices, but an informed random choice:

  • Pick two random choices
  • Select the one with the lower load, i.e. in use by fewer rsync processes
  • Increment sshConn.load (in a goroutine-safe manner) from the goroutine kicked off per rsync process

Conclusion

This was a pretty short story and vague (it's from 2017 and I've forgotten and omitted many details). But the idea is, think about how a paper may help you in an actual task in front of your face. That's the best (or only, in my true opinion) way to learn something: using it.

Also the inverse: look at your day-to-day tasks with an optimistic eye that despite their predictable surface (copying files from one place to another), they have some details that can be solved using a more interesting toolbox. Hone that toolbox.

Comments