It’s really starting to drive me crazy. I don’t know if its on my end, or Particle trying to send the ‘yep, heres the hook response’ to a photon that may or may not be online.
There’s a strong likelihood that the issue is on my end, as I’ve changed some lambda things to massage statuscodes towards particle so they don’t trip the retry. However that particular error don’t make sense and its filling my particle subscribe mine trail
oh, and knowing what the particle webhook request (upstream to webhook destination) / response (to photon) timeouts are would help…
EDIT: it could be as simple as if my upstream AWS lambda takes 5 seconds (what its set to as its timeout), the particle webhook is giving up at say 3 seconds and throwing a esockettimedout
EDIT2: If I change my lambda timeout to 3 seconds then I get to see that return as a hook-response. Going to try 4 now…
EDIT3: 5 seconds seemed to be what was breaking particle webhook. I now can see it cleanly returning a timeout without triggering esockettimedout
5 seconds is indeed the timeout when issuing the HTTP request to your server. There would never be an error sending the response if the Photon was offline, because it is a PUB/SUB system. If the Photon is not online, it simply would not see the response.
It’d be good to a) expand the ESOCKETTIMEDOUT error message to indicate webhook destination did not complete in 5 seconds
b) detail in those docs we discussed via email all the error messages (which I presume is a neat list somewhere at a top of a code file)
Not to beat anyone up, but for context: It was quite frustrating and many hours consumed to progressively hunt down and solve this issue. Where is the error coming from? Who timed out? Particle photon, particle webhook system, AWS API gateway or AWS lambda? It took me quite a while to prove Particle wasn’t at fault, there was just a tight constraint and an ambiguous error message.
5 seconds isn’t a lot of time, I imagine a number of people are processing webhooks and sending push notifications and that process can easily chew 4 seconds. Knowing the constraints would inform a different (and more complicated) task creation design pattern on reception of a particle webhook.
I’m really looking forward to being able to issue a publish to a non-particle destination using the TLS cert of my choosing!
Just to expand on the conversation here, I did some looking into this issue today. There are two different timeouts that can occur when delivering a webhook/integration: ETIMEDOUT and ESOCKETTIMEDOUT.
There are two main types of timeouts: connection timeouts (ETIMEDOUT) and read timeouts (ESOCKETTIMEDOUT). A connect timeout occurs if the timeout is hit while Particle is attempting to establish a connection to a remote machine (the webhook destination). A read timeout occurs any time the server is too slow to send back a part of the response.
In both scenarios, the timeout value is set to 5 seconds. Note that in either case, the failed request will be retried. Our system will retry 3 times quickly, then sleeps and retries 3 times more, up to 10 total times.
You may also find a support email thread or community thread where I suggested that doing a timeout after 5 seconds and retrying is not the best approach. When I last looked at it, anything but a 200 status code triggers the system to try again which is obviously not standards based and has in the past caused the particle system to ignore that webhook for my fleet. The code was giving a valid error to a client, then the particle webhook system would keep thrashing it but as it kept getting the error code back it would try again and then mark the webhook as invalid for the whole fleet.
That’s just silly. I heard the reason was for people who were developing and/or didn’t know how to keep a webservice up that it would help them. Well, it doesn’t help anyone else and flies in the face of logic and web standards.
So, two good options: Publish an AWS style best practice architecture guide to show folk exactly how to setup a job queue system external to Particle to quickly accept a job and acknowledge it always as a 200 (which is what I’ve done), or fix your system so it handles 500/400/300 status codes appropriately. Also 5 seconds is pretty short if folk don’t quite yet understand that they need a job system middleman if they’re using the webhook to send push notifications for instance (which easily takes a few seconds of processing/acknowledgement).