Mcollective

Posted13 July 2015

I recently had a bit of an Mcollective problem to look at. It turned out to be very simple to fix, but finding it was pretty gnarly.

For the uninitiated, Mcollective is a server daemon that uses ActiveMQ (or other messaging technologies) to communicate with client daemons across an estate of (Puppetised) machines. The idea is that you can use Mcollective to orchestrate Puppet runs, start and stop services and potentially collect information from your clients into some sort of consolidated view on your Puppet masters.

Mcollective seems to me to be both straightforward and esoteric. It's actually sort of as you'd expect things to be - it has server and client parts, and lots of helpful API to abstract out the underlying messaging system, and even to some degree the Puppet infrastructure it's there to support. That's all pretty nice, but writing agents is relatively tricky - I haven't found anything resembling a testing setup that you can use to do any sort of unit or integration testing. The best we seem to have is to deploy it to a development network and hope for the best. That and the scant comments in the code, sadly, make this a pretty typical Puppet Labs product.

That said, our requirements aren't too taxing, so writing the agents isn't too much work. There's the usual head-bending to figure out which bits of your code go in the server and client parts of Mcollective, but that's to be expected. In most of our use cases, we're either data collecting off the client and then writing it into our Redis data stores, or else we're telling clients to do something.

In the data collecting case, we could be doing this by Mcollective Registration Agent (which actually we do most of), or doing some sort of command line inspection. Registration agents run every 5 minutes, so it's good for general infrastructure information, but not much good for monitoring. We tend to use it as a way to get what's in the Puppet manifests into Redis. That means we can do inventories of what's actually been deployed on the clients by asking our Puppet Masters.

Apart from 'ping', I don't think we have any command line data collectors. About a year ago I did play around with using Mcollective to trigger and report on Nagios checks on the clients. This is quite an interesting area, although not one we use at my employer. The idea is to be able to deploy (say) a Tomcat check on all your Tomcat boxes. Then, Nagios (or whatever) uses Mcollective to tell the clients to perform their Tomcat checks and return the results. What you end up with is an aggregate of the statuses of all your Tomcats, and if any one of them is in a 'Critical' state, you generate an alert. The cool thing about this is that a common problem (like say maybe the database is down) will cause just one alert, even though a few dozen Tomcats may be affected. It's an elegant solution to over-alerting, and avoids needing to do too much alert suppression or rate limiting or whatever (which risks losing alerts).

The other way we use Mcollective is to trigger things to happen on the clients. THe primary use-case is to trigger Puppet runs. One of my colleagues wrote an improved Puppet agent for Mcollective (which I believe is on Github), which starts the runs and properly checks their outcomes. The stock agent isn't so good at all that - we found it would report success and failure incorrectly at times.

Whetever way it's being used, my particular problem this time around was that the Registration Agents were's talking to Redis properly. It turned out that all I needed to do was to update the Redis Gem on the box, but finding that was the case took a bit of 'printf' style debugging in Mcollective code to figure out. It would be really handy to have a proper testing capability somewhere when working with Mcollective...

Tags: #puppet #mcollective #activemq #ruby

Ralph Bolton

Mcollective