Posted 2 October 2015
I've used Check MK quite a bit at my previous employer, and I've used Nagios (from which it evolved) plenty over several years until then. When the subject of monitoring came up at work, it seemed like an obvious choice. It's got the usual esoterics you'd expect (!), but for the most part it's pretty easy to get going.
I was just about to get GUI hacking and add all my client hosts to it when I thought there is a better way to do this. I've seen Puppet use some Exported Resources to automatically add hosts to Check MK when the agent is installed on a client machine. Doing this with Ansible is (probbaly) possible, but since I'll be installing the Check MK agents on every machine, for now I'm just using the list of hosts Ansible knows about as the source of truth.
Getting the agent installed on all the machines in the estate isn't too hard with Ansible. You need to install an RPM and do a bit of hosts.allow sort of work, but it's not too hard. The agent uses xinetd, which seems both good and terrible to me. I guess it's safer than needing some proprietary daemon that may or may not be secure, and as it's a read-only service it's probably a good enough solution.
As I say, Check MK can be a bit esoteric, and adding hosts dynamically to the server is no exception (once they're in, the Inventory feature is great for discovering what services to monitor though). To be fair to Check MK, it's a product in development flux - it used to be one thing, now it's another and soon it'll be something else again. As a result some of the documentation is a bit misleading. All that said, I'm not the first to do this and so there are some tools and scripts around to help. I've got to go through the "install, create, delete, uninstall" loop a few times to get things to work properly, which is a shame, but not unexpected.
Once all that's done, Check MK should be a good basis for our future operations monitoring. I'm hoping to soak up all the random emails and "oops, that's not right" moments and get them into a single screen that we can circulate as the place to see how the world is working.
Years ago I wrote an Expert System and some glue logic to grab a load of Nagios state and work out things like "Internet is up, Email is up", with 'drill-down' so people could see what wasn't working if we had a problem. It'd be nice to dust that off again - we'll see if there's a need/time...
More blog posts:Previous Post: Jenkins Pipelines | Next Post: Ansible