Monitoring the Cave
At Data Cave, we have a lot of equipment in our building. From routers and switches to water chillers and pumps to electrical switchgear and diesel generators. There’s a lot of “stuff” happening, all the time. To help us keep tabs on all of this, we have setup several pieces of software.
Firstly, there’s Nagios. It monitors everything of significance in the building, and alerts us to any problems that may arise. At the time of this writing, it’s checking 240 services across 108 hosts. There are warnings if server room temps get out of range, checks to ensure surveillance cameras are online, notifications when there are changes in internet routing tables, just to name a few examples. Nagios is free and open software, and it’s become a sort of industry standard. Yes, there are shortcomings, which many alternatives try to address. However, it can handle any monitoring-and-alerting scenario you can imagine, and it never crashes. Just in case it does, though, we have an instance of monit watching it, and an automatically-synced failover Nagios instance, which has never engaged to date.
There are many expensive closed-source alternatives to Nagios. They will try to tell you that they are easier to configure. I haven’t found that to be the case. You still must learn how to configure things their way, and that’s just a different learning curve. At a previous place of employment, I was told I had to scrap a working Nagios setup and replace it with a proprietary package. It had taken me 4 weeks to fully configure my instance of Nagios. Even with the head start of knowing everything I wanted to monitor, it still took me 3 weeks (and several support calls) to get the proprietary solution working. To add insult to injury, there were a couple of checks I was making with Nagios that were simply impossible with the commercial package. Of course, your mileage may vary.
Secondly, there’s Cacti. Again, it’s become a bit of a standard, but, again, there are alternatives designed to address its weaknesses. While Nagios is a monitoring and alerting tool, Cacti graphs the data and reveals trends. It was designed for tracking networking bandwidth, but we’ve extended it to monitor power consumption as well. For graphing trends of less importance than bandwidth and power, we use pnp4nagios.
On top of Nagios and Cacti, we’ve created two Rails applications. The first is Haystack. It’s a dead-simple Simple Network Management Protocol (SNMP) polling application. Once a minute, it grabs all the SNMP data it can from each piece of equipment’s manufacturer’s SNMP counter tree, and simply stores it in a database. There’s a very simple interface for searching through and pulling out data of interest, but it’s primarily there in case something happens for which we need to review the data in detail.
The second custom package we’ve written is ComHub, which we’ve released as open-source software on GitHub. ComHub receives the messages from all of our systems, and then routes them to the appropriate people. Each person in ComHub can subscribe to various keywords, choose methods for receiving alerts, configure time windows for when to receive them, and setup filters for things they don’t care about. Rather than blindly sending SMS messages through the cellular carriers gateways via email, and hoping they arrive, the system uses Twilio to deliver text messages. With this service, we can not only get notification that the user’s cell phone actually received the alert, but also process responses to the alerts in the application. We use this primarily for acknowledging alerts in Nagios, or, optionally, putting the service or host into a planned downtime window.
Running hand-in-hand with Nagios and Cacti is SNMPTT. It receives any and all SNMP traps that might get sent out from any piece of equipment. It filters some useless informational alerts for us, and then routes the messages on through ComHub. It can be integrated with Nagios, but we found it easier to make ComHub the integration point.
In the coming months, I’ll talk more about how we integrate our smart door controller system into this setup.