Lately we’ve been investigating on a problem with the check scheduler. This resulted in check results being late and wasn’t easy to tackle – whether it’ll be the check scheduler, cluster messages or anything else. We’ve been analysing customer partner environments quite in deep and learned a lot ourselves which we like to share with you here.

One thing you can normally do is to grep the debug log and analyse the problem in deep. It is also possible to query the Icinga 2 API fetching interesting object attributes such as “last_check”, “next_check” or even “last_check_result”.

But what if you want to calculate things for better analysis e.g. fetch the number of all services in a HA cluster node where the check results are late?

The “icinga2 console” CLI command connected to a running Icinga 2 node using the API is key here.

Primarly the icinga2 console allows for testing config expressions but can also be used to fetch all objects. Helped with the Icinga 2 DSL capabilities the console will fire the “execute-script” action towards the Icinga 2 API. Note: Now we are really into programming things here. If you say – hey, I’m not a coder – keep on learning the Icinga 2 DSL. If you require in-depth help with problems, kindly join the community channels and/or ask our partners for professional support.

 

Preparations

Start the “icinga2 console” using the –connect parameter. You can hide the API credentials in your shell environments which is more secure than passing them to the connect string.

$ ICINGA2_API_USERNAME=root ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://localhost:5665/'

 

Fetch the last check time from a service

The following example fetches the service object “icinga” for the local NodeName host and its “last_check” attribute. This involves a function call to get_service().

 => get_service(NodeName, "icinga").last_check
1469784497.333508

In case you prefer a readable unix timestamp the upcoming 2.5 release adds the possibility to format the time value as string (DateTime).

 => DateTime(get_service(NodeName, "icinga").last_check).to_string()
"2016-07-29 13:17:57 +0200"

 

Fetch all services and their last check

Fetching all service objects and printing their name and last_check attribute involves a temporary array and a for loop iterating over all service objects. The final “res” call will print its output to the console.

 => var res = []; for (s in get_objects(Service)) { res.add([s.__name, s.last_check]) }; res

 

Fetch all services where the check result is late

Now it is time to apply a filter for the services list retrieved from “get_objects(Service)”. Using versions prior 2.5 can solve this by adding your own custom prototype method to the Array class like this:

Array.prototype.filter = function(p) { var res = []; for (o in this) { if (p(o)) { res.add(o) } }; res }

In case you’re already using Icinga 2 v2.5 you can use the built-in method. Gunnar implemented that method as part of issue #12247. You may also persist this configuration inside the icinga2.conf file – it is just a restart away.

The Array#filter method requires a function callback as parameter. This function is executed and evaluated for each array element returning a boolean value. All elements which match will be inserted into the newly returned array.

Either you’ll take a globally defined function or you’d just define a lambda function inline. The following function passes “s” as parameter and then compares the value of “s.last_check” being less than the current time minus 2 times the value of “s.check_interval”. That way you can easily compare check results being late without any hardcoded offset but normalised on the configured check interval.

s => s.last_check < get_time() - 2 * s.check_interval)

Now let’s just fetch all service object names and the formatted last_check timestamp into the “res” array where the last_check time is greater than our defined check_interval offset. Note: The get_time() function returns the current unix timestamp.

 => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res
[ [ "10807-host!10807-service", "2016-06-10 15:54:55 +0200" ], [ "mbmif.int.netways.de!disk /", "2016-01-26 16:32:29 +0100" ] ]

If you are not necessarily interested in names but a general count, just use Array#len on the returned “res” array.

 => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len()
2.000000

icinga2_console_connect_example

Count late check results in HA setup

If you are especially interested on the services being checked on a specific HA cluster node the calculation needs some adjustments. A host or service object which is not checked on the current HA endpoint is marked as “paused = true”. Vice versa all scheduled and checked objects are marked as “paused = false”.

The solution is simple based on what you’ve learned above already. Change the result set into a dictionary like this:

var res = {}

The key is extracted from the current service “s” attribute “paused”, the value increments the current value for this key. That way we’ll end up with a dictionary containing “false” and “true” as keys and the number if kate check results for both. If you are asking – why should I care about checked objects with “paused = true”, they are run on the other endpoint in my HA cluster? Simple as it sounds – the check results are replicated from the other node to the local one. If they are not fresh they are either not actively scheduled/executed, or the cluster communication is not fully intact.

 => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res
{
	@false = 2.000000
	@true = 1.000000
}

Depending on the HA node where the icinga2 console is connected to, this result should exchange counts between true and false.

 

Check how often parent services are used

This example iterates through all Dependency objects and provides a unique count on all “parent_service_name” definitions. We used that to gather insights with clients using command_endpoint and a possible health check. get_objects() works for any Config Object type. In this example we’re iterating over the Dependency objects and counting how often parent service names occur.

 => var res = {}; for (dep in get_objects(Dependency)) { res[dep.parent_service_name] += 1 }; res
{
    "icinga" = 54297.000000
    "vmware-health" = 5.000000
}

Such an analysis helps to understand whether checks are not executed bound to dependencies or causing additional checks.

 

How many hosts with service checks using Command Endpoints are not connected

This analysis is based on the assumption that a command endpoint check (“Command Execution Bridge”) changes its state to UNKNOWN (3) and puts the string “connected” into its output.

This customer setup consists of a three level cluster – a HA master zone, satellites for specific regions and clients which are checked from the satellite zones using command endpoint.

The idea is to filter by the satellite zones and check how many client endpoints below those satellites are not connected.

  • Iterate over all service objects using get_objects(Service)
  • Check if their state is 3 (UNKNOWN) and the output of the last_check_result matched “connected”
  • Store the matching service host name in the “res” dictionary. Use the service zone as key and append the service name as array element

Now we would have a “res” dictionary with all zones as keys and an array of matching host names for each zone. That array may have duplicates (multiple services not connected for one host).

 => var res = {}; for (s in get_objects(Service)) { if (s.state==3) { if (match("*connected*", s.last_check_result.output)) { res[s.zone] += [s.host_name] } } };  res

Therefore we apply an additional iteration over the “res” dictionary. Note: Array#unique doesn’t exist in versions prior 2.5 but you can build it like this:

Array.prototype.unique = function() { var res = []; for (o in this) { if (o !in res) { res.add(o) } }; res }
  • Iterate over the “res” dictionary and override the current key
  • Make the array elements unique and only store the length

Now we have a “res” dictionary which holds all zones and the number of hosts where one or more services are currently not connected.

 => var res = {}; for (s in get_objects(Service)) { if (s.state==3) { if (match("*connected*", s.last_check_result.output)) { res[s.zone] += [s.host_name] } } };  for (k => v in res) { res[k] = len(v.unique()) }; res
{
	Asia = 31.000000
	Europe = 214.000000
	USA = 207.000000
}

 

Find services which use a command_endpoint from a parent zone

Those checks won’t work due to security restrictions. Though it is tremendously hard to figure out why they are not executed. Gunnar has therefore used the Icinga 2 DSL to implement a function which provides such lookups.

This little helper function is registered globally for later usage. It just extracts the zone for the current requested endpoint name.

globals.zone_for_endpoint = function(endpoint) { for (zone in get_objects(Zone)) { if (endpoint.name in zone.endpoints) { return zone } }; null }

This is the big one which checks against the hierarchy for the used command_endpoint. We’re gonna use that as comparator function callback later on.

  • If there isn’t any command_endpoint attribute set, return false (“hierarchy is valid”)
  • Fetch the check_endpoint object from the given command_endpoint name
  • Fetch the check_zone_name string from the check_endpoint
  • Set the authoritative zone (auth_zone_name) from the checkable’s zone name
  • Iterate over all zones and their parents, starting from the current checkable zone (auth_zone_name)
  • If the command_endpoint’s zone (check_zone_name) matches the hierarchy is valid
  • If not, jump to the next zone level above and check again
globals.is_invalid_hierarchy = function(c) {
  if (!c.command_endpoint) {
    return false
  }
  var check_endpoint = get_object(Endpoint, c.command_endpoint)
  var check_zone_name = zone_for_endpoint(check_endpoint).name
  var auth_zone_name = c.zone
  while (auth_zone_name) {
    var auth_zone_name = get_object(Zone, auth_zone_name).parent
    if (auth_zone_name == check_zone_name) {
      return true
    }
  }
  return false
}

That way this function returns a boolean expression which can be evaluated for all checkable objects. Note: Execute that from a satellite (or any child) zone. The result would now print all service objects and their attributes. Another trick – Array#map takes a lambda function which exchanges each array element (service object) with just the full service name (s.__name). In versions prior to 2.5 you can manually define it like this:

Array.prototype.map = function(m) { var res = []; for (o in this) { res.add(m(o)) }; res }

The result is pretty straight forward:

 => get_objects(Service).filter(s => is_invalid_hierarchy(s)).map(s => s.__name)
[ "icinga-master01.domain.com!proc ntp", "icinga-master02.domain.com!proc ntp" ]

 

Conclusion

While it may look terribly hard to implement and understand – once you’re in the flow you’ll never look back. Let us know which debugging tricks and analysis you’ve already done using the Icinga 2 console & API :)

Note: The icinga2 console is solely used read-only for debugging purposes in these examples. Keep in mind that an “execute-script” action pushed towards a running Icinga 2 is the same as you would operate as “root” on your server. Don’t try to modify or delete things unless you know what you’re doing.