Icinga 2 v2.5 released

We’ve come a long way with our new release Icinga 2 v2.5. After the 2.4 release in November we’ve focussed on fixing many of the remaining bugs. 2.5 isn’t just a feature release – it includes all the bugfixes from the past months.

 

InfluxDB

vagrant_icinga2_influxdb_grafanaA big thank you to Simon Murray & DataCentred for contributing the new InfluxDB feature! Dive into the documentation details or just try it out yourself. We’ve also added a new Vagrant box “icinga2x-influxdb” just for Icinga 2, InfluxDB and Grafana :)

 

Timeperiod Excludes

Don’t want to be notified during the holidays? Add an on-call exclusion for a specific time period? We’re really happy that Philipp Dallig contributed the long awaited time period exclusion and inclusion feature to Icinga 2. You’ll also find updated examples for specific time ranges in the documentation.

 

IDO Performance

The IDO database feature now supports an incremental config dump. Future restarts only update what really changed instead of a full config dump. This tremendously decreases the database reconnect time on restart making it 10 times (!) faster than before. We’ve tested this with large scale customer environments (Example with 60k services, 2000 clients, 60k dependencies running in an HA cluster).

2.4.10

[2016-08-17 13:06:03 +0200] information/IdoMysqlConnection: Finished reconnecting to MySQL IDO database in 320.08 second(s).

2.5.0

[2016-08-17 14:31:17 +0200] information/IdoMysqlConnection: Finished reconnecting to MySQL IDO database in 29.4937 second(s).

 

API

There are two new endpoints added which allow you to fetch global variables (/v1/variables) as well as defined template names (/v1/templates). Notification state and type filters can now be specified as string values (e.g. “OK”).

There’s also a new API action /v1/actions/generate-ticket which allows you to fetch the ticket required for client setups with CSR auto-signing. That way your automated setups will work like a breeze – make sure to check the updated documentation bits too.

 

Cluster

We’ve fixed a bug where one faulty client would cause other clients to disconnect. While analysing cluster stability issues we’ve also added more detailed log messages. There is a known issue with messages routing for zones with more than two endpoints – we recommend to only have two endpoints in a zone for now.

Uwe Ebel contributed the API/Cluster configuration attributes for the accept cipher list as well as the minimum TLS version – thanks a lot!

 

Documentation

icinga2_distributed_automation_docker_clientWe weren’t happy with the documentation chapters explaining how the cluster works and how the Icinga 2 client has to be installed. It was complicated and our community channels were literally flooded with questions. We’ve taken the hard road and purged away the old content.

You’ll now find two new chapters inside the documentation:

  • Service Monitoring – a good starting point for plugin integration and more examples based on the numerous CheckCommand definitions (thanks everyone for contributing!).
  • Distributed Monitoring with Master, Satellites and Clients – from roles to zones to setup to configuration modes (“top down” and “bottom up”) to scenarios. All done with real-world examples and newly added images helping you getting your distributed environment going.

The distributed monitoring chapter also explains how to automate your client setup in a Docker client by example. We’ve also made sure to add best practices and advanced hints which we learned from you, our community :)

 

More Release Highlights

  • Debian and RHEL packages for vim/nano syntax highlighting.
  • DateTime type for formatting timestamps in the configuration DSL.
  • Performance improvements for config validation.
  • Many bugfixes for check execution, notifications, downtimes, etc.

 

Changes

When upgrading your distributed environment ensure to upgrade your master and satellite instances to v2.5 first. Clients using v2.4.x may still work, but should be upgraded as well.

An IDO database schema update is required (2.5.0.sql). The categories attribute requires the array notation (deprecation warning is logged).

The icinga2.conf file includes plugins-contrib, manubulon, nscp, windows-plugin by default on new installations. This helps deploying checks even more easy but may collide with your own CheckCommands synced in global zones.

 

Update Icinga 2

Prior to upgrading your production environment you should test the new release in your staging environment as always. Make sure to read the full Changelog. Note: There was a release critical bug in 2.5.0 so we decided to go for a fixed 2.5.1 release.

Note, 2nd: The notification bug was not fully fixed. Go for 2.5.3 including this fix and one for group members in DB IDO.

Updated packages are available soon.

Special thanks to all contributors and testers making Icinga 2 v2.5 great! Simon, Philipp, Uwe, Rune, Tobias, Blerim, Bernd, Eric, Markus, Hannes, Dirk, Matthias, Emanuel … you know who you are :)

Share your Icinga 2 love

What really drives us making Icinga a great monitoring solution is community feedback and appreciation.

One thing which is really really cool – when someone sends you an email and says “Look. Icinga 2 works fine. Awesome work.” – attaching a screenshot with a hell of CPU cores and RAM. I cannot tell you this time which company he’s working for (only that the company is a NETWAYS customer we’ve been working with). This is JUST AWESOME.

icinga2_2.4.10_htop_144_cpu_cores

For reference I’ve also kindly requested a screenshot from the config validation to get an idea about the numbers and time it takes. We’ve also seen customer environments even bigger with 100k services and 6000 Icinga 2 clients. More to come ;)

icinga2_2.4.10_config_validation

Now that we are like “WTF”, “oh. wow.” and “I want that hardware.” – I’d like to ask you to do us a favor :)

No matter how big our small your Icinga 2 environment is – please send us your screenshots of “htop” and “icinga2 daemon -C” (the numbers at the end) to info@icinga.org :-)

Additional performance details are highly appreciated as well.

  • $ time icinga2 daemon -C (compare 2.4.10 and the upcoming 2.5.0)
  • IDO Mysql/Pgsql Reconnect-Logging (compare 2.4.10 and the upcoming 2.5.0)

Icinga 2 doesn’t phone home so our motivation seeing Icinga in production worldwide comes from yours truly – share your Icinga love with us :)

I promise to highlight your stories in the future here. If we’ll meet at an Icinga camp I’ll bring you some of those famous “dragee keksi” too ;)

Analyse Icinga 2 problems using the console & API

Lately we’ve been investigating on a problem with the check scheduler. This resulted in check results being late and wasn’t easy to tackle – whether it’ll be the check scheduler, cluster messages or anything else. We’ve been analysing customer partner environments quite in deep and learned a lot ourselves which we like to share with you here.

One thing you can normally do is to grep the debug log and analyse the problem in deep. It is also possible to query the Icinga 2 API fetching interesting object attributes such as “last_check”, “next_check” or even “last_check_result”.

But what if you want to calculate things for better analysis e.g. fetch the number of all services in a HA cluster node where the check results are late?

The “icinga2 console” CLI command connected to a running Icinga 2 node using the API is key here.

Primarly the icinga2 console allows for testing config expressions but can also be used to fetch all objects. Helped with the Icinga 2 DSL capabilities the console will fire the “execute-script” action towards the Icinga 2 API. Note: Now we are really into programming things here. If you say – hey, I’m not a coder – keep on learning the Icinga 2 DSL. If you require in-depth help with problems, kindly join the community channels and/or ask our partners for professional support.

 

Preparations

Start the “icinga2 console” using the –connect parameter. You can hide the API credentials in your shell environments which is more secure than passing them to the connect string.

$ ICINGA2_API_USERNAME=root ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://localhost:5665/'

 

Fetch the last check time from a service

The following example fetches the service object “icinga” for the local NodeName host and its “last_check” attribute. This involves a function call to get_service().

 => get_service(NodeName, "icinga").last_check
1469784497.333508

In case you prefer a readable unix timestamp the upcoming 2.5 release adds the possibility to format the time value as string (DateTime).

 => DateTime(get_service(NodeName, "icinga").last_check).to_string()
"2016-07-29 13:17:57 +0200"

 

Fetch all services and their last check

Fetching all service objects and printing their name and last_check attribute involves a temporary array and a for loop iterating over all service objects. The final “res” call will print its output to the console.

 => var res = []; for (s in get_objects(Service)) { res.add([s.__name, s.last_check]) }; res

 

Fetch all services where the check result is late

Now it is time to apply a filter for the services list retrieved from “get_objects(Service)”. Using versions prior 2.5 can solve this by adding your own custom prototype method to the Array class like this:

Array.prototype.filter = function(p) { var res = []; for (o in this) { if (p(o)) { res.add(o) } }; res }

In case you’re already using Icinga 2 v2.5 you can use the built-in method. Gunnar implemented that method as part of issue #12247. You may also persist this configuration inside the icinga2.conf file – it is just a restart away.

The Array#filter method requires a function callback as parameter. This function is executed and evaluated for each array element returning a boolean value. All elements which match will be inserted into the newly returned array.

Either you’ll take a globally defined function or you’d just define a lambda function inline. The following function passes “s” as parameter and then compares the value of “s.last_check” being less than the current time minus 2 times the value of “s.check_interval”. That way you can easily compare check results being late without any hardcoded offset but normalised on the configured check interval.

s => s.last_check < get_time() - 2 * s.check_interval)

Now let’s just fetch all service object names and the formatted last_check timestamp into the “res” array where the last_check time is greater than our defined check_interval offset. Note: The get_time() function returns the current unix timestamp.

 => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res
[ [ "10807-host!10807-service", "2016-06-10 15:54:55 +0200" ], [ "mbmif.int.netways.de!disk /", "2016-01-26 16:32:29 +0100" ] ]

If you are not necessarily interested in names but a general count, just use Array#len on the returned “res” array.

 => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len()
2.000000

icinga2_console_connect_example

Count late check results in HA setup

If you are especially interested on the services being checked on a specific HA cluster node the calculation needs some adjustments. A host or service object which is not checked on the current HA endpoint is marked as “paused = true”. Vice versa all scheduled and checked objects are marked as “paused = false”.

The solution is simple based on what you’ve learned above already. Change the result set into a dictionary like this:

var res = {}

The key is extracted from the current service “s” attribute “paused”, the value increments the current value for this key. That way we’ll end up with a dictionary containing “false” and “true” as keys and the number if kate check results for both. If you are asking – why should I care about checked objects with “paused = true”, they are run on the other endpoint in my HA cluster? Simple as it sounds – the check results are replicated from the other node to the local one. If they are not fresh they are either not actively scheduled/executed, or the cluster communication is not fully intact.

 => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res
{
	@false = 2.000000
	@true = 1.000000
}

Depending on the HA node where the icinga2 console is connected to, this result should exchange counts between true and false.

 

Check how often parent services are used

This example iterates through all Dependency objects and provides a unique count on all “parent_service_name” definitions. We used that to gather insights with clients using command_endpoint and a possible health check. get_objects() works for any Config Object type. In this example we’re iterating over the Dependency objects and counting how often parent service names occur.

 => var res = {}; for (dep in get_objects(Dependency)) { res[dep.parent_service_name] += 1 }; res
{
    "icinga" = 54297.000000
    "vmware-health" = 5.000000
}

Such an analysis helps to understand whether checks are not executed bound to dependencies or causing additional checks.

 

How many hosts with service checks using Command Endpoints are not connected

This analysis is based on the assumption that a command endpoint check (“Command Execution Bridge”) changes its state to UNKNOWN (3) and puts the string “connected” into its output.

This customer setup consists of a three level cluster – a HA master zone, satellites for specific regions and clients which are checked from the satellite zones using command endpoint.

The idea is to filter by the satellite zones and check how many client endpoints below those satellites are not connected.

  • Iterate over all service objects using get_objects(Service)
  • Check if their state is 3 (UNKNOWN) and the output of the last_check_result matched “connected”
  • Store the matching service host name in the “res” dictionary. Use the service zone as key and append the service name as array element

Now we would have a “res” dictionary with all zones as keys and an array of matching host names for each zone. That array may have duplicates (multiple services not connected for one host).

 => var res = {}; for (s in get_objects(Service)) { if (s.state==3) { if (match("*connected*", s.last_check_result.output)) { res[s.zone] += [s.host_name] } } };  res

Therefore we apply an additional iteration over the “res” dictionary. Note: Array#unique doesn’t exist in versions prior 2.5 but you can build it like this:

Array.prototype.unique = function() { var res = []; for (o in this) { if (o !in res) { res.add(o) } }; res }
  • Iterate over the “res” dictionary and override the current key
  • Make the array elements unique and only store the length

Now we have a “res” dictionary which holds all zones and the number of hosts where one or more services are currently not connected.

 => var res = {}; for (s in get_objects(Service)) { if (s.state==3) { if (match("*connected*", s.last_check_result.output)) { res[s.zone] += [s.host_name] } } };  for (k => v in res) { res[k] = len(v.unique()) }; res
{
	Asia = 31.000000
	Europe = 214.000000
	USA = 207.000000
}

 

Find services which use a command_endpoint from a parent zone

Those checks won’t work due to security restrictions. Though it is tremendously hard to figure out why they are not executed. Gunnar has therefore used the Icinga 2 DSL to implement a function which provides such lookups.

This little helper function is registered globally for later usage. It just extracts the zone for the current requested endpoint name.

globals.zone_for_endpoint = function(endpoint) { for (zone in get_objects(Zone)) { if (endpoint.name in zone.endpoints) { return zone } }; null }

This is the big one which checks against the hierarchy for the used command_endpoint. We’re gonna use that as comparator function callback later on.

  • If there isn’t any command_endpoint attribute set, return false (“hierarchy is valid”)
  • Fetch the check_endpoint object from the given command_endpoint name
  • Fetch the check_zone_name string from the check_endpoint
  • Set the authoritative zone (auth_zone_name) from the checkable’s zone name
  • Iterate over all zones and their parents, starting from the current checkable zone (auth_zone_name)
  • If the command_endpoint’s zone (check_zone_name) matches the hierarchy is valid
  • If not, jump to the next zone level above and check again
globals.is_invalid_hierarchy = function(c) {
  if (!c.command_endpoint) {
    return false
  }
  var check_endpoint = get_object(Endpoint, c.command_endpoint)
  var check_zone_name = zone_for_endpoint(check_endpoint).name
  var auth_zone_name = c.zone
  while (auth_zone_name) {
    var auth_zone_name = get_object(Zone, auth_zone_name).parent
    if (auth_zone_name == check_zone_name) {
      return true
    }
  }
  return false
}

That way this function returns a boolean expression which can be evaluated for all checkable objects. Note: Execute that from a satellite (or any child) zone. The result would now print all service objects and their attributes. Another trick – Array#map takes a lambda function which exchanges each array element (service object) with just the full service name (s.__name). In versions prior to 2.5 you can manually define it like this:

Array.prototype.map = function(m) { var res = []; for (o in this) { res.add(m(o)) }; res }

The result is pretty straight forward:

 => get_objects(Service).filter(s => is_invalid_hierarchy(s)).map(s => s.__name)
[ "icinga-master01.domain.com!proc ntp", "icinga-master02.domain.com!proc ntp" ]

 

Conclusion

While it may look terribly hard to implement and understand – once you’re in the flow you’ll never look back. Let us know which debugging tricks and analysis you’ve already done using the Icinga 2 console & API :)

Note: The icinga2 console is solely used read-only for debugging purposes in these examples. Keep in mind that an “execute-script” action pushed towards a running Icinga 2 is the same as you would operate as “root” on your server. Don’t try to modify or delete things unless you know what you’re doing.

Monthly Snap July – Dev Updates, Events & Social

Sometimes you have so much to do that it is pretty hard to give an update what’s going on. Sounds familiar? Welcome to my world ;-) I’d like to try a new format of information updates on a monthly basis – idea kindly borrowed from my employer’s blog.

These details should inform you what’s cool, going on, cooking, and coming up. Please let us know what you think about it!

Development Updates

 

Icinga 2

icinga_director_meme

Kudos to Christian Stankowic

We’ve debugged Icinga 2 in the past month quite in deep at several customer environments. There are plenty of fixes coming with the next 2.5 major release. This includes a bug fix for command endpoint message routing, client disconnects when another clients fails (“not signed by CA”) and numerous other bug fixes. One thing which came up – if you have more than two endpoints in one zone, there is a known bug with check result messages. It is currently advised to only have two endpoints until we investigate further on the issue.

Last week we’ve fixed a bug in the check scheduler (one of those release critical issues). Right now we are heavily investigating on a possible IDO deadlock and further notification issues. Once these are fixed we’ll happily continue testing Icinga 2 and make v2.5 a stable release you can count on. Our plans target August as release month. Get your hands dirty and help test the snapshot packages!

Icinga Web 2

Not much to say this time about the web framework. There is some consolidation work going on with official modules under the hood. The Icinga Director is under heavy development as always. Follow its development closely on Github and the issue tracker.

Icinga Exchange and Accounts

Under the hood the developers are working on bug fixes and integrating new features such as syncing git tags and releases or fixing the tag search. An upgrade of the live system is expected soon.

Upcoming events

If you are an early bird – there is a 25% discount for Icinga Camp Berlin 2017 waiting for you. We’re also looking for speakers and community members at our Icinga Camps in

Team Icinga will also attend OSMC in late November accompanied by lots of cool talks related to monitoring and Icinga.

Social

Community members are active everywhere. We do see a lot of questions asked over at monitoring-portal.org. Chime in and lend us a hand with sharing your knowledge! Thanks in advance :)

 

The Icinga 2 book (currently German only) gets a lot of nice feedback. The two authors Lennart and Thomas told us that they are in contact with the publishers to create an English version as well. For those waiting for an ebook – now available (again, German only).

 

Moving from Nagios to Icinga 2 – a journey worth a look? Even if you don’t speak German you should definitely follow her blog posts. So much good feedback – and Marianne is also actively reporting issues. Thanks a lot for your appreciation!

 

Jens is actively migrating the current Icinga 1.x environment to Icinga 2 at Müller. Most recently he discovered the possibilities of the Icinga Director.

 

The upcoming Nagstamon 2.0 release features Icinga Web 2 support. Kindly test and give the developer feedback!

 

Icinga Web 2 and also the Icinga Director are a result of many discussions and plenty of hours of development. We are proud that our users love it :)

 

Thomas (the author of the Icinga 2 book) was working on a Logstash check plugin for the new stats API available with 5.0. He didn’t realize that Jordan Sissel himself sent in a patch ;)

Last but not least – a Grafana Dashboard using the Icinga 2 API. What the heck? ;-)

 

Oh and Blerim is now officially part of our team. You’ll hear more from him in the next months :)

Monitoring MySQL database size

Our community support channels provide interesting insights into how things are being monitored in various environments. Sometimes it is not only about finding the right configuration syntax or fiddling with the perfect cluster setup. This time I’d like to share a solution for a common problem that I discovered while I was helping another Icinga user. :-)

Monitor the size of a database

Sounds easy if you are familiar with MySQL and the common check plugins – putting it all together might get complicated, especially for beginners.

Luckily, the question already provided a sample SQL query for fetching the database size:

MariaDB [(none)]> select sum(data_length + index_length) / 1024 / 1024 as "db size" from information_schema.tables where table_schema = 'icinga';
+-------------+
| db size     |
+-------------+
| 31.09375000 |
+-------------+
1 row in set (0.01 sec)

Two questions arise:

  • Is there a plugin which automatically checks the database size from a given parameter?
  • Alternatively, can I just run this query and compare the returned integer value in MB?

Find a plugin and integrate it

There’s a basic check_mysql plugin that is part of the monitoring plugins project. Additionally,  check_mysql_health has proven itself in many environments: it offers fast and easy monitoring. This plugin is also part of the Icinga training sessions demonstrating its power.

Once you’ve successfully installed it into the default Icinga 2 PluginDir (I’m skipping detailed installation instructions here), let’s go for a CheckCommand definition. The Icinga 2 Template Library (ITL) already provides such a definition inside the contributed plugins section. Include the plugins in the file /etc/icinga2/icinga2.conf:

include <plugins-contrib>

Also, make sure to define the constant PluginContribDir in the file /etc/icinga2/constants.conf:

const PluginContribDir = "/usr/lib64/nagios/plugins"

Now it is time to read about the required parameters in the documentation. We’ll need that information for setting the appropriate custom attributes later.

Create a Host and Service Apply Rule

mysql_health_db-size_icingaweb2One thing I’m always keen on: use a custom attribute dictionary on the host and allow to pass as many custom parameters to the service objects as possible. Combine this with an apply for rule and use the possibilities of the Icinga 2 DSL.

In order to use the mysql_health CheckCommand we’ll need to delegate at least the following custom attributes:

  • mysql_health_hostname: Defaults to the host’s address attribute (optional).
  • mysql_health_username: MySQL database user with the appropriate permissions for the information_schema database here.
  • mysql_health_password: MySQL database user password.
  • mysql_health_mode: “sql”, since we want to run a generic SQL query here.
  • mysql_health_name: SQL query string we want to execute; ensure that it returns a single number/count.
  • mysql_health_name2: In combination with the “sql” mode this sets the performance data label/output prefix.
  • mysql_health_units: The default calculation uses MB, so we’ll tell the plugin to use it as performance data unit.
  • mysql_health_warning: Warning threshold in MB.
  • mysql_health_critical: Critical threshold in MB.

This is a long list but once you’ve carefully read the plugin documentation and tested the various parameters it will become more clear.

Let’s construct an apply for rule generating services based on the host custom attribute databases (this is a dictionary/hash with the database name as key and multiple parameters, e.g. for thresholds).

apply Service "db-size-" for (db_name => config in host.vars.databases) {

Define the intervals and include the mysql_health check command:

  check_interval = 1m
  retry_interval = 30s

  check_command = "mysql_health"

Check whether the host dictionary provides additional configuration (such as different database username or password) and set a default. In this example the root password has access to information_schema.

  if (config.mysql_health_username) {
    vars.mysql_health_username = config.mysql_health_username
  } else {
    vars.mysql_health_username = "root"
  }
  if (config.mysql_health_password) {
    vars.mysql_health_password = config.mysql_health_password
  } else {
    vars.mysql_health_password = "icingar0xx"
  }

Now specify the sql mode and build the query. Cool thing – the query is based on the current database name that we are generating a service object for. That way we don’t need two apply rules for the databases icinga and icingaweb2 later on.

  vars.mysql_health_mode = "sql"
  vars.mysql_health_name = "select sum(data_length + index_length) / 1024 / 1024 from information_schema.tables where table_schema = '" + db_name + "';"
  vars.mysql_health_name2 = "db_size"
  vars.mysql_health_units = "MB"

Optionally, inherit warning and critical thresholds defined in the host dictionary databases. Its value is mapped into the config dictionary in the local service apply for scope. Inherit additional parameters into the service custom attributes in vars.

  if (config.mysql_health_warning) {
    vars.mysql_health_warning = config.mysql_health_warning
  }
  if (config.mysql_health_critical) {
    vars.mysql_health_critical = config.mysql_health_critical
  }

  vars += config

Question for the reader: How should the host object look like in order to generate services? :-)

The answer is simple – based on existing examples and the documentation, it is pretty straight forward:

object Host "icingamaster" {
  address = "127.0.0.1"
  check_command = "hostalive"

  /* database checks */
  vars.databases["icinga"] = {
    mysql_health_warning = 4096 //MB
    mysql_health_critical = 8192 //MB
  }
  vars.databases["icingaweb2"] = {
    mysql_health_warning = 4096 //MB
    mysql_health_critical = 8192 //MB
  }
}

Voilà – validate your configuration and reload the Icinga 2 service.

Since this is a real world example, I’ve also integrated it into the icinga2x Vagrant box. :-)

Conclusion

mysql_health_db-size_grafanaWhile it is not always clear which plugin is the best, it’s always worth looking into the existing ITL CheckCommand definitions. Maybe there already is one which also provides the perfect answer to your questions. If not, hop onto Icinga Exchange and submit the newly created CheckCommand definition to the upstream. :-)

check_mysql_health provides many possibilities to monitor your databases (local or remote) and is fairly easy to setup. Once you gather the required monitoring metrics, e.g. by manually executing the plugin or querying your database, it is all of the same (CheckCommand, host, and service configuration).

Once the Icinga 2 configuration validation returns OK, reload the daemon and enjoy fancy monitoring in Icinga Web 2 and Graphs in your preferred metrics dashboard.