Given that the very development of Icinga arose from the need for additional functionalities in open source monitoring, it’s little surprise that the tool has become indispensable for so many IT professionals. Its configurability and flexibility allow for a sophisticated approach to monitoring, which is both scalable and extensible to large, complex environments.
The 2016 State of Monitoring Report confirmed the popularity of Icinga, with a large number of respondents naming it their primary systems monitoring tool. Of those that reported using the tool, almost 40% stated that they work in an enterprise organization with 1000+ employees. The vast majority reported that they push changes to production several times per day, and that infrastructure changes are deployed a few times per week. But at the same time, 64% stated that their IT team employs 20 or less people. Given the complexity and scale of the environments that Icinga monitors, and the ability of the tool to create sophisticated and detailed checks, it is perhaps understandable that resource-strapped IT teams encounter challenges managing and triaging high volumes of alerts.
Too much of a good thing?
In effect, sysadmins can become victims of their own success. They’ve configured the tool to effectively identify potential problems – but identifying and separating critical issues from warnings and high-impact from low-impact ones is challenging, particularly when dealing with complex, noisy environments.
Plus, most sysadmins don’t just have Icinga alerts to consider. According to the State of Monitoring, almost all Icinga users utilize anywhere from two to 10 additional tools in their stack, which means that multiple systems may trigger alerts related to the same issue. Lacking any form of unification between the various tools, IT teams are left with a pile of manual work to establish relationships amongst alerts before triage can begin. That chaos leads to missed SLAs, long mean time to resolve (MTTR), and, ultimately, customer outages. When you consider the sheer amount of machine data generated by agile, fragmented systems, there’s really no question that the scale goes beyond what any human – or even a small army of humans – is capable of managing.
Alert correlation – why it matters
BigPanda helps companies improve detection, accelerate remediation, and increase productivity through automated alert correlation. For a tool like Icinga, which is capable of doing so much, alert correlation provides an additional – and critical – layer of insight that allows sysadmins to effectively prioritize alerts and know what to do next.
To illustrate, let’s consider a typical “day in the life” of your average app. At any given time, checks on the app could complain of a whole host of issues: disk space, CPU consumption, low memory. Depending on the nature of these issues, some might be urgent while others aren’t. Some that aren’t urgent may be precursors to future high severity events. How do you know what matters and what to do next? Enter BigPanda.
The BigPanda correlation engine consumes alerts generated by Icinga, in addition to any other monitoring tool an organization might be using, and then normalizes and groups them into related, high-level incidents. This not only significantly reduces the number of items that IT has to deal with, but it also centralizes alerts from all monitoring tools into a single pane of glass for easy management and tracking. BigPanda automatically enriches incidents with contextual information – such as runbooks, metrics, related incidents, and configuration items and code deploys – helping you quickly gain the understanding required to properly prioritize and remediate incidents.
The BigPanda integration: How it works
BigPanda intelligently correlates alerts across monitoring systems by evaluating three main parameters:
Check: analyzing similar checks or error conditions across alerts and alert sources is a strong indicator that items are related.
Time: the rate at which related alerts occur. Alerts occurring around the same time are more likely to be related than alerts occurring far apart.
Context: the host, host group, service, application, cloud, or other infrastructure element that emits the alerts. Alerts are more likely to be related when they come from the same components of your infrastructure.
Every new alert generated by Icinga will be automatically correlated against existing active incidents in BigPanda. If there’s no match, a new incident will be created. The system works out-of-the-box, without requiring customers to define an explicit set of rules or build a dependency model for their environments. However, BigPanda customers can customize the correlation rules to suit unique requirements based on their specific systems, applications, topologies, and division of duties. On average, BigPanda users benefit from compression rates upwards of 90% between raw alerts and correlated incidents.
The best part? BigPanda can be set up in minutes with clicks, not code. We offer a tight and seamless integration that supports both Icinga and Icinga2, and is ready in a few easy steps.
This is a guest blogpost by our partner BigPanda.