Monitoring and beerPublished on December 2, 2011 by Filip Van Tittelboom
Our partners expect their web applications to be available 24/7. So one of the less visible services we provide is ‘monitoring’. If a problem should arise, we want to be notified. If that problem occurs after business hours, the operator on call should be notified. This article will try to explain how we implement these requirements.
We use Nagios to monitor all our hosts and all services running on those hosts. When things go wrong an alarm on our desktop highlights. Then an e-mail and SMS are sent out. The great thing about Nagios is the ability to extend. There is a large collection of readily available check plugins to monitor services such as Apache, MySQL, system load, etc. But it also provides a way to write custom plugins. That way, we can check very specific aspects of any service or application.
I know what you are all thinking: “but what if your nagios server fails?”. We use a redundant setup. We run two nagios servers on two different cloud providers. One acts as the master and one as slave. The slave constantly checks the status of the master. If the master does not respond for some reason, the slave gets into action.
We implement an on call rotation divided among three operators here at iLibris. An operator on call is available after business hours, 24 hours a day, and in the weekends. Perhaps the operator on call is unavailable (even scruffy sysadmins take a shower once in a while)? In that case the escalation kicks in: everybody gets notified. If an operator lets that happen, he’ll have to pay a beer-tax, though.
Anyway, this is how we monitor our systems. Our goal is to fix any problem before it’s even noticed.