Custom Trigger Condition - Trigger alert only if device is down for an hour

Hello all,

I’ve been stuck for a while on setting a few custom alerts. Tbh i might overthinking it but i would appreciate any thoughts on this.

So, i have heartbeat checking a few IoT devices to see if they are up or down. (Yes, simple as that :P)
My problem though is that i want to alert only if a device is down more than an hour.

This means that, i will need to divide this hour and have it run every 20 minutes to achieve 3 checks. If all these three checks end up to be down, then send an email.

Ok, now i know that i might have to use either Trigger Condition or Query to define all these but its too unfamiliar to me.

Any thoughts/guidance would be really appreciated.

Thanks for your time.

Best,
Dimitris

Hi @dimitris,

One possible configuration for your Monitor could be that you have your query check for the last hour (how frequently the Monitor itself runs is up to you). Then for your Trigger condition, it will depend on what exactly your heartbeat metric looks like but let’s say for example it was a field that is being indexed into a log index that the Monitor will be checking over. We’ll call this field is_iot_device_up and this can be a boolean that is true when the heartbeat reports the device being up and false when it’s down.

So with that you can have your Trigger condition be something like:

def deviceWasDown = false;
for (hit in ctx.results[0].hits.hits) {
	if (!hit._source.is_iot_device_up) {
		deviceWasDown = true;
	}
}

return deviceWasDown;

This would mean the Trigger condition is met if any of the documents in the last hour show the device being down. The downside here is that if the metric is being reported often and there was small transient downtime, the condition is still met with the example above.

To avoid a situation like that, you could change the query to also be a date histogram aggregation so that you could bucket the documents into time intervals of your choosing and then iterate over those buckets to have more control over the occurrence count of the downtime for your Trigger condition.

Let me know if you have any other questions.