As a data center, Data Cave has a lot of equipment. (A lot of interesting bits have been discussed on this blog.) All of it is constantly being monitored and measured by a number of different software programs, including open source favorites like Nagios, Cacti, and SNMPTT. If problem is encounterd, alerts will be raised to notify us of the issue. We have built devices that monitor “dumb” equipment, and integrate them into our systems, and we also have “smart” equipment that can alert us to problems on their own. Some subsystems have their own alerting mechanisms entirely, that we have made to cooperate with our own.
Typically, these kinds of systems will send their notifications via email. It’s the common denominator. However, almost everyone is carrying smart phones these days, and prefers getting a text message instead of an email. Even for technical folks who carry smart phones that are integrated with their email systems, text messages are usually quicker and more efficient.
Enter ComHub. It’s Data Cave’s web application, based on Ruby on Rails, designed to work around these limitations. It can receive messages in any number of ways, make decisions on who needs to be alerted, and then send notifications via several different methods, as appropriate.
Once in the system, recipients are processed, time windows are queried, subscriptions to keywords are checked, filters are referenced, and notifications are sent out. This is where ComHub really shines. Using a service like Twilio to deliver messages to cell phones, ComHub has a direct link to the cell network.
ComHub routes messages from multiple subsystems, throttles messages if we have cascading failure, and even sends acknowledgements back to originating systems. In the usual workflow, email can be avoided entirely, though it is used as a backup. Though there are rough edges, and tons of ideas to implement, we are using it in production right now, and have made it available as open source software. You can download a copy from GitHub.
ComHub’s integration with all of our other systems means nothing of interest happens inside Data Cave without multiple people knowing about it. If anything needs our attention, we can respond immediately, usually from wherever we happen to be.
We take our monitoring seriously.