Ansible Connected CheckMK

Posted12 July 2016

CheckMK is one of many monitoring systems around (I've talked about it a bit before). It's roots are in Nagios, which was once pretty much the defacto solution we all used. CheckMK has a lot of it's own eccentricities, and the UI is a long way from perfect. However, it's actually a pretty solid monitoring solution, it scales to pretty well when you want to run a hundred checks on dozens of servers. There's lots of room for improvement, but CheckMK does a lot of good things.

One problem with CheckMK and many other tools is that you've got to configure it. Trying to enter every hostname and the checks it should run when it's created and delete it again when it's removed used to be an almost one-time job, but in the cloudy world we now work in, that's just not sustainable. CheckMK makes an attempt to solve this by using 'inventoried' checks. That is, it asks a client what checks it is reporting, and then uses that as it's basic config for that host. If a new check gets deployed or enabled, then one of the built-in checks will tell you that something isn't being monitored yet (so you can re-inventor the host and thus add it to monitoring).

Inventories are both good and bad. They're good because all you have to do is throw a load of checks onto a box, run the inventory and then they're monitored. They're bad because some checks are written in such a way that they pretend they don't exist if the application they're checking isn't running. That means that if you happen to run an inventory when there are services stopped, then your monitoring gets screwed up. The obvious way to solve this is to only deploy checks onto boxes when you want them, and make sure they always return something (so presumably a Critical or Unknown if the application isn't running). Ansible (or other config management) can really help here - when you deploy an application, you also deploy a check for it (which is as simple as dropping a script into a specific directory). If all your applications are deployed by Ansible, then they're all guaranteed to be monitored too (after an inventory) - perfect!

After you've dealt with the details of inventories of individual hosts, you need to make sure that new hosts get monitored. Having a host up and running but unknown to monitoring is a perennial problem in IT. Again, Ansible can solve this problem by configuring the host into CheckMK for you. Ansible can do this relatively easily by iterating through groups['all'] and such like. There are two trip-hazards here though, the first being that CheckMK doesn't make this very easy to do, and the second being that you need to run Ansible against your monitoring host after the new host has been added. Depending how you set things up, you may also need Ansible to do some Fact Gathering before you run it against your monitoring server too.

In the setup I've implemented, we use Ansible's local facts mechanism (facts.d) to run a script which actually calls the CheckMK agent to get the output of all of its local checks (the ones we've deployed to the box). Calling the agent directly (rather than looking in the scripts directory) is actually easier because it means we'll get multi-item checks included and we'll also get so-called 'cached checks' (which don't run so frequently as all the others) too. The list of check names becomes an Ansible Fact, which will be collected next time we do some sort of Fact Gathering.

Additionally, we run a Fact Gathering task every 15 minutes, which populates a Redis Fact Cache. This means that a new server will be queried soon after it's config is added to Ansible, and we'll get a list of the CheckMK checks it's got installed on it. Any new checks deployed anywhere will also be known pretty soon after deployment.

Lastly, the Ansible setup for CheckMK. This is the behemoth of the workflow - The details of it are quite complex (CheckMK rules.mk and groups.mk don't lend themselves well to templating in the traditional sense). There are also a couple of other intermediate steps we have to do too. The first is that we write out a file for each host we know about; in that file we write one CheckMK check name per line. This means that if the Facts change, so does the file. If the file changes, we know we have to do a CheckMK inventory on that host. If any inventories take place then we also have to restart CheckMK to make it pick up the changes. Lastly, we have to delete the intermediate files if the host is removed from Ansible. All this means the role is relatively complex, although actually nothing it does is especially out-of-the-ordinary for Ansible. The key thing we do though is that we build the main.mk list of known hosts from the Ansible configs - that means it's impossible for a host to exist in Ansible's world without also getting it under CheckMK's watchful eye too.

All this is relatively complex, and not a simple learning-curve to on-board into. It took quite a few iterations to implement correctly, although now it's done it really has taken most of the hassle out of CheckMK configs. It's still possible to use CheckMK's UI to configure some things, although usually I find I can use the UI to make the correct configs and then build those configs back in Ansible to config-control whatever I did in the UI. Ultimately though, new hosts get added, old ones get removed and all the services on a box get checked, regardless of what they are or when they got deployed, and that means we've got good control over our environment.

Tags: #checkmk #ansible

Ralph Bolton

Ansible Connected CheckMK