Power of two random choices
Introducing this pragmatic load balancing algorithm and describing a real world use caseThe power of two random choices
The paper presents a simple idea:
- Random load balancing can lead to bad maximum load: the worst case of an overloaded node
- Making two random choices lowers the maximum load significantly
- Making n >= 2 random choices continues lowering the maximum load
It's basically a very easy way to avoid the pitfall of one-choice random load balancing and have a good load balancing result with low effort.
My issue: hot ssh connections
I wrote a Go service that needed to copy files using rsync from a destination to multiple NFS sources when notified via a message queue. To invoke rsync, I simply used os.Exec("rsync", args...)
.
In the first iteration of the code, I let rsync establish its own ssh connection. What we learned is that when receiving ~X000 copy requests through the queue at the same time, my service would launch X000 rsync processes to the same destination; sshd on the NFS destinations would reject many of these connections.
My first fix was to introduce a pool of persistent ssh connections - launched using os.Exec("ssh", args...)
and using standard Unix ssh control paths. The SSH connections were represented in structs like so:
struct SshConn {
controlPath string
}
I would create these (in a staggered manner) at app startup to start with a stable pool of politely-established connections on which to multiplex different rsync processes to copy files.
To associate an rsync process with an ssh connection, what did I do? You guessed it: randomly pick an ssh connection to use for an rsync process. Basically, I just did sshConn := sshPool.chooseRandom()
and used it in the rsync process like rsync -e "ssh -o 'ControlPath=...
. This resulted in the problem that in some cases, files would get super stuck. The rsync processes would crawl to a complete halt. Why? Hot connections! The random load balancing was leading to certain ssh connections getting overloaded.
The fix
I was hanging around with paper readers at the time and the power of two choices paper was making the rounds. It was almost poetic how it lined up, but I was able to do the following:
struct SshConn {
controlPath string
load int // i forget how i made this goroutine-safe... sync.mutex?
}
Actually, now that I tell the story, it wasn't true two-random-choices, but an informed random choice:
- Pick two random choices
- Select the one with the lower
load
, i.e. in use by fewer rsync processes - Increment
sshConn.load
(in a goroutine-safe manner) from the goroutine kicked off per rsync process
Conclusion
This was a pretty short story and vague (it's from 2017 and I've forgotten and omitted many details). But the idea is, think about how a paper may help you in an actual task in front of your face. That's the best (or only, in my true opinion) way to learn something: using it.
Also the inverse: look at your day-to-day tasks with an optimistic eye that despite their predictable surface (copying files from one place to another), they have some details that can be solved using a more interesting toolbox. Hone that toolbox.