We are currently in the process of migrating to a different Linux distribution at work and with that introduce Puppet to manage these machines (about 130 desktop PCs and 20 Servers).
Recently we started noticing failed puppetd runs from time to time. Cause of the problem was, that our puppet clients turned out to be running in batches, so every 30 minutes our puppetmaster suffered a huge load spike.
Having read R.I.Pienaar’s Scheduling Puppet with MCollective blog post, I figured I needed a similar solution. Installing MCollective crossed my mind for a moment, but it seemed overkill. Instead I came up with the following simple, pragmatic solution.
I restart each puppetd process at a host specific time, thus spreading them reasonably well across puppet’s run interval.
The crucial component is my external node classifier, which I use to calculate the host specific time. It’s a ruby script that knows about all our machines and is easily able to compute a nodes rank, in our case based on the hostname, modulo 30, puppet’s run interval. This rank is used to terminate a cronjob restarting puppetd daily.
Lets look at an example. Say you have hosts a.foo.com, b.foo.com and c.foo.com, then their rank would be 0, 1 and 2 respectively. On host a.foo.com, the cronjob would run at 11:00, on b.foo.com at 11:01 and on c.foo.com at 11:02. (I chose the window between 11:00 and 11:30, as I figured most people would be at work at that time and thus their desktop PCs switched on.)
As you can see, the result is far from perfect, but still quite impressive:
I know, load graphs are mostly useless, but it’s all the data I have ATM. ;)
Where 4 cores couldn’t handle load spikes, 2 cores are evenly used and I might even get by with just 1.
I would very much like to see some kind of scheduling support in the puppetmaster, organizing clients for maximal spread.