[Fixed] Webhooks callbacks are not reliable!

@Dave

Tremendous effort and stick-to-itiveness that is greatly appreciated.

responseTopic is a great add, I can’t wait to get in there and start using that!!!

Thanks!!

1 Like

No drops since yesterday evening. Thanks @Dave!

1 Like

Fantastic @Dave, I was looking forward to this!

1 Like

Thank you @Dave, this has been a thorn in my side since day one and now - Poof - it’s no longer an issue. Not a single drop since yesterday, everything is running predictably, this is excellent.

Should we change the name of the initial post to or something?

4 Likes

Hi @mayhew1955,

I’m glad it’s behaving much better for you now! As things continue to scale I’m sure I’ll need to keep an eye on it, but in the meantime I’m very happy to hear that. :slight_smile:

Up to you if you feel good about updating the thread topic, if you wanted to, I might suggest something like: [Fixed?]

Thanks!
David

1 Like

I’m having trouble with webhook responses… I keep having to delete the webhook and recreate it to get it working.
After getting it working, on my dashboard I can see that I keep getting a response error. The data message says “Rate limit exceeded” (see attachment). Is this error coming from Particle’s rate limits? If so what are the rate limits? And why would the hook stop responding after a period of time? Could it be because I have specified a deviceID in the hook? Or is that an old issue?


Thanks
Stu

This might be for you (found via docs Search field for “webhooks limit” ;-))

https://docs.particle.io/guide/tools-and-features/webhooks/#limits

1 Like

Webhooks gets disabled after 10errors in a row.
https://docs.particle.io/guide/tools-and-features/webhooks/#errors

There are some limits too as @ScruffR linked :slight_smile:

2 Likes

Thanks @ScruffR and @MORA, That’s very clear around the limits! :smile: I can only assume then that I must be getting 400 errors from the server which is disabling the webhook. Is there a way to see if the hook has been disabled, other that it just not working?

Are there any other reasons a hook would be disabled, other than >10 errors from the server?

Also, does having the device ID specified in the hook still affect things?

Could it be that your webhook is not actually disabled but just offended the 10 triggers per minute per device and would recover after some timeout?
https://docs.particle.io/guide/tools-and-features/webhooks/#limits-by-user

If you didn't get a "disabled" message like stated here, it should still be alive but might be muted

https://docs.particle.io/guide/tools-and-features/webhooks/#errors

@ScruffR When I get the rate-limit errors (as in the screenshot), I still get a response every so often (i.e. so it is recovering as you described). This is happening in a test sketch I created in order to debug the webhook issue. This output was happening after deleting and then recreating the webhook.

HOWEVER, on my main project file, I am only using the webhook very occasionally. The real issue occurs inbetween turning on the device for a few days, when the webhook doesn’t work at all - no rate limit errors or responses when i looked on the dashboard. However, since I was not monitoring on the dashboard or the CLI at the time it stopped working, I can’t tell if/why/when the webhook was disabled.

Have you got an idea when you had used this hook last?
There was a problem with all webhooks some two weeks ago, and many people had to recreate their webhooks due to that error (as a quick-fix, till Particle could figure the cause and a long term solution).
So if you had not used the hook round about that time, then this might be the actual cause.

Another thing I think to remember is, that webhooks that are used too infrequently (no action for months) will get kicked.

One time it could have stopped working was probably around two weeks ago. I recreated it sometime last week and then turned the device on this morning and had the same issue - no responses through the webhook. But now working after recreating again…

Thanks all for posting the limits! I’m working on changing the behavior so it doesn’t self-destruct, but instead just sleeps for a time. I’m also investigating the reports of hooks stopping responding, and adding extra feedback so you can know what state your hook is in (if its sleeping).

Thanks!
David

3 Likes

Great, its a bit silly it requires resetting the webhook over the not so friendly command line just because a website was down/broken 10 times in a row.

1 Like

I’ve had webhooks become unresponsive when the destination (AWS HTTP gateway) is up and working and no limits have been breached (that I could see…), on particle subscribe mine there is no response. I’ve also seen that if you add a new webhook to flush out the system, but keep the old webhook with the same destination, then it’ll start working, but the responses will be tagged with the old webhook name. 2 months of debugging at: System issue? Webhooks failing but status.particle.io says everything is green

1 Like

Hi @Dave, lately i’ve been seeing some interesting behaviour with webhooks. a) if I return a non-success, ie warning, the hook will retry and cause it to go into too many errors, sleeping.
EDIT: to be clear, if I return an API error from my webhook, depending on the server code I actually don’t want the particle webhook to retry. Whereas at the moment I understand it causes the particle webhook to retry, potentially sleeping to keep trying?

The other interesting thing is simultaneous hook requests. Say I have 200 active clients, posting 4 times a minute, is that tripping the mentioned 100 simultaneous requests for a hook?

I need to know pretty precisely the answers as I’m wondering if I’m seeing issues because of limits in the particle chain.

As you know, I’m super keen to directly send traffic straight to my AWS HTTPS gateway endpoints…

Hi @mterrill,

Good questions! I'll try to answer them one by one:

a) if I return a non-success, ie warning, the hook will retry and cause it to go into too many errors, sleeping.

We often see when people experience webhook issues, it's because the receiving API can't handle demand, or experienced downtime. We added automatic retry on error for the hooks, so if you return an error code, the hook will backoff, and retry a few times. You can avoid this by not returning an error code.

Say I have 200 active clients, posting 4 times a minute, is that tripping the mentioned 100 simultaneous requests for a hook?

By default any given webhook can perform up to 100 simultaneous / concurrent requests. So it depends on how long your server takes to reply to a given request / how long the hook has to wait for a response. If each request takes a second, and your requests are spread out evenly, then it shouldn't be a problem, but if all 200 devices are publishing simultaneously, and the requests are really slow, then it might take a while for them to go through. Our enterprise integration tiers can provide higher throughput.

The Particle Cloud is designed to be highly scalable and reliable, and built to support an unlimited number of devices. We provide a ton of flexibility for product creators, and lots of free features in the free developer tier, as well as SLAs and greater throughput for the enterprise tiers.

I know there is some work happening on a TLS library that should also make it easier for your to hit an HTTPS endpoint directly, I'm not sure if that's been released yet, but hopefully soon!

Thanks,
David

2 Likes

Thanks Dave. appreciate the info.

FYI, you may be interested in this article on AWS lambda + HTTP gateway error massaging. https://blog.jayway.com/2015/11/07/error-handling-in-api-gateway-and-aws-lambda/

I originally went through and changed the lambda code to only return success, but with error text. I’m going to rework that to return errors (as it should), but then for the HTTP gateway service to massage errors based on the errorMessage text into either 200’s or 500’s - 500’s being reserved for actual fatal errors like upstream authentication gateway failure. This isn’t particularly standards based, but at least the means exist to prevent downstream issues with particle.

The fundamental issue is that the health of the gateway is being co-joined with the validity of a client’s webhook data. A particular webhook request with a variable data payload can post invalid information, request something not available or not authorised. The gateway can be in perfect health and just saying ‘thanks for that information, but you asked to post an alarm to an invalid endpoint’ - which is my situation.

May I suggest, in an ideal world the particle webhook system would understand a few more codes.

    1. Fatal server error, don’t retry
    1. Temporary error, ie overloaded or maintenance. Go for the retry process webhooks currently use
    1. Gateway timeout. This would be useful enhanced information for me, and anyone else using an API gateway service (which is pretty much the architectural model folk should be using). This would simply convey the webhook endpoint tried to relay and failed. Probably best to go an exponential standoff and retry method webhooks use now.
  • 400 (bad request), 401 (unauthorised), 403 (forbidden) are all useful diagnostics that the webhook system should not retry on, and is useful diagnostic information to flash to the user in logs etc.

Probably have a long backlog of cards, but I’d have immediate use for being able to provide at least 200, 400/500, 501 so to cover the three situations (all good, fatal error don’t retry, temporary error). Happy to jump on a call if you’d like to discuss any of the above.

Lastly, who is doing the TLS work? Glowfish wasn’t working, is there an internal Particle project to provide TLS? Can I assist with providing requirements etc (ie google & AWS microservice compatibility).

Thanks!

1 Like

Hi @mterrill,

Thanks for the thoughtful reply! I agree that it would be helpful if that hook retry behavior were a bit smarter, and configurable. I brought your post to the attention of the team working on webhooks / integrations, and I’ll see if we can move this forward.

We’ve been working on a TLS library internally, but I think it’s very much beta right now, and doesn’t have a concrete release date. I think the hope is to release it relatively soon however.

Thanks!
David

1 Like