Understanding Integration Insights

I have a quick question... I started observing more frequent errors in the integrations. This prompted making a change or two in the Azure function that processes the webhook along with another tweak or two. I think it was primarily timing out due to so many devices hammering the webhook at nearly the exact same second. I took some time to flatten the peak of when each device reports it's data which improves things. However, I'm still seeing some errors in the logs but not on the integration traffic page:

My question... why would the integration traffic tab under fleet health list 0 errors for the 2:40 time slot:

But then when I look under integrations for that same timestamp 2:40 PM, it logs a total of 9 errors:

Is this simply just a timeout error that occurred once, the webhook integration is attempted a second time, the second time is a success or why would it be logged as an error in one place but not the other?

Just trying to make sure I understand these metrics and shore up the integrations ensuring the overall webhook integration stays healthy.

I don't know the reason for sure, but it seems likely that it is either:

Since every integration is tried 3 times in case of failure, the individual trigger event eventually succeeded, so it was not flagged as an error in the integration traffic report. When viewing the details, the individual attempts are always shown.

The other possibility is because of rounding, the slice for error would have been so small it was not rendered in the graph.

1 Like

When I highlight over the graph in fleet health it also indicates 0 faults so I don't think it's a rounding error. It must be something that errors on the first attempt, logs it, but then succeeds on the retry. Although not ideal having any errors I guess as long as it eventually makes it, that's OK. I think I'll try and flatten the peak of when the fleet of devices report even more. The entire fleet used to report at nearly the exact same second based on system time.

My first attempt at leveling the peaks of when devices report is using 16 different reporting offsets and then randomly assign each device a different offset to use. I.e. instead of all devices in the fleet reporting at 01:00:00. I randomly assigned 16 different offsets, with each offset being 5 seconds apart. So 1/16 of the devices report at 01:00:05, and 1/16 of the devices report at 01:00:10. Etc. etc. This helped immensely... with now near 0 errors in fleet health but occasional errors in the logs.

I may make that even more granular and use 1 second offsets over a 1 minute period. Or maybe use groups of 5 second offsets over the full 5 minute period so the data ingest is "flat".

That should work. Another common technique is to measure on fixed buckets, say aligned by an hour, but then delay a random interval of up to a few minutes before publishing the data, that way your back-end isn't hit by traffic all at once.

1 Like