[Fixed] Webhooks callbacks are not reliable!


#100

Thanks Dave, very keen to hear about that. I’ll get your local bottle shop (does that phrase translate to the US?) to ship the team friday drinks when you release it! 100 simultaneous requests isn’t a lot, I’ll easily breach that soon, and its unnecessary cost and complexity for Particle to handle it.

ps, for posterity, the team may be interested in this:

  • You want AWS lambda to succeed/fail as appropriate, you shouldn’t massage this based on particle’s feelings as it will affect invocation request error reporting/alarms etc throughout the AWS toolchain
  • You can massage status codes in the reply Integration Response within http gateway.

The interesting tricks that I spent about 10 hours on is:

  • AWS will only allow one transformation per returned status object. Hence you can’t look for multiple strings (status":50\d / status":40\d) and come back to a common mapping. To keep particle you need to pretty much map anything but really really serious fatal flaws as 200. This means lambda errors and lambda successes, which come back differently from lambda.
  • Even if you parseJSON the escaped JSON string in the body reply, you cannot then do conditional logic on a property (ie, 'myjsonstructure.elephants == 2). However if you returned an error via a proper javascript error object it doesn’t escape, its pure json. Then you can look at properties. So test the error object not the potential body return.

This looks really simple, but it wasn’t, and highly counterintuitive:

#set ($errorMessageObj = $util.parseJson($input.path('$.errorMessage')))
#set ($bodyObj = $util.parseJson($input.body))
#if($errorMessageObj.status == "")
$bodyObj
#{else}
$input.path('$.errorMessage')
#end

#101

@Dave, @zachary anyone,

Could someone please (urgently) tell me what ‘ESOCKETTIMEDOUT’ means?

Is that timeout upstream to my webhook end (which I don’t believe as its AWS http gateway > lambda, so thats infinitely scalable with throttling set to 2000 simultaneous connections)?
Is it downstream back to the particle photon?

My big problem is that hook-error (and hook-sent) doesn’t have any linkage to the original core ID that the transaction relates to. It’s a random error while other transactions are going through, but it causes an outage across clients.

The other thing is hook-response doesn’t show the status code being sent to client (very usefully to the conversation above, the status code received by particle as a webhook response from my end, which is translated into a hook-response to the particle photon)

Really appreciate some help as it’s a live issue and its pretty counterintuitive.

{"name":"pushtemps","data":"{\"f1\":158.04,\"f1t\":196,\"f2\":32.00,\"f2t\":200,\"f3\":160.76,\"f3t\":200,\"pit\":243.64,\"pt\":248,\"ph\":268,\"pl\":208,\"sp\":248.0,\"fan\":43,\"mode\":\"NORMAL\",\"alarm\":\"none\",\"s\":\"C\"}","ttl":"60","published_at":"2017-02-03T00:44:39.064Z","coreid":"3c0043000a51353335323536"}
    {"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T00:44:39.572Z","coreid":"particle-internal"}
    {"name":"pushtemps","data":"{\"f1\":133.00,\"f1t\":203,\"f2\":32.00,\"f2t\":130,\"f3\":32.00,\"f3t\":130,\"pit\":279.98,\"pt\":285,\"ph\":310,\"pl\":245,\"sp\":285.0,\"fan\":36,\"mode\":\"NORMAL\",\"alarm\":\"none\",\"s\":\"F\"}","ttl":"60","published_at":"2017-02-03T00:44:39.196Z","coreid":"370042001551353531343431"}
    {"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T00:44:38.899Z","coreid":"particle-internal"}
    {"name":"hook-response/pushtemps_3c0043000a51353335323536/0","data":"{\"status\":200,\"coreid\":\"3c0043000a51353335323536\",\"alerted\":[],\"message\":\"No alarm so will finish\"}","ttl":"60","published_at":"2017-02-03T00:44:40.119Z","coreid":"particle-internal"}
    {"name":"hook-response/pushtemps_370042001551353531343431/0","data":"{\"status\":200,\"coreid\":\"370042001551353531343431\",\"alerted\":[],\"message\":\"No alarm so will finish\"}","ttl":"60","published_at":"2017-02-03T00:44:39.306Z","coreid":"particle-internal"}
    {"name":"hook-error/pushtemps/0","data":"ESOCKETTIMEDOUT","ttl":"60","published_at":"2017-02-03T00:44:40.301Z","coreid":"particle-internal"}
    {"name":"pushtemps","data":"{\"f1\":32.00,\"f1t\":195,\"f2\":159.52,\"f2t\":195,\"f3\":32.00,\"f3t\":170,\"pit\":257.11,\"pt\":225,\"ph\":300,\"pl\":185,\"sp\":225.0,\"fan\":0,\"mode\":\"NORMAL\",\"alarm\":\"none\",\"s\":\"F\"}","ttl":"60","published_at":"2017-02-03T00:44:43.290Z","coreid":"200028001947353236343033"}
    {"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T00:44:42.768Z","coreid":"particle-internal"}
    {"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T00:44:42.999Z","coreid":"particle-internal"}
    {"name":"hook-response/pushtemps_200028001947353236343033/0","data":"{\"status\":200,\"coreid\":\"200028001947353236343033\",\"alerted\":[],\"message\":\"No alarm so will finish\"}","ttl":"60","published_at":"2017-02-03T00:44:43.534Z","coreid":"particle-internal"}
    {"name":"hook-error/pushtemps/0","data":"ESOCKETTIMEDOUT","ttl":"60","published_at":"2017-02-03T00:44:47.791Z","coreid":"particle-internal"}

#102

Still getting esockettimedout issues…

,"ttl":"60","published_at":"2017-02-03T01:38:24.150Z","coreid":"370042001551353531343431"}
{"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T01:38:24.892Z","coreid":"particle-internal"}
{"name":"hook-response/pushtemps_3c0043000a51353335323536/0","data":"{\"status\":200,\"coreid\":\"3c0043000a51353335323536\",\"alerted\":[],\"message\":\"No alarm so will finish\"}","ttl":"60","published_at":"2017-02-03T01:38:24.971Z","coreid":"particle-internal"}
{"name":"hook-response/pushtemps_370042001551353531343431/0","data":"{\"status\":200,\"coreid\":\"370042001551353531343431\",\"alerted\":[],\"message\":\"No alarm so will finish\"}","ttl":"60","published_at":"2017-02-03T01:38:25.289Z","coreid":"particle-internal"}
{"name":"pushtemps","data":"{\"f1\":32.00,\"f1t\":195,\"f2\":165.15,\"f2t\":195,\"f3\":32.00,\"f3t\":170,\"pit\":272.64,\"pt\":225,\"ph\":300,\"pl\":185,\"sp\":225.0,\"fan\":0,\"mode\":\"NORMAL\",\"alarm\":\"none\",\"s\":\"F\"}","ttl":"60","published_at":"2017-02-03T01:38:28.264Z","coreid":"200028001947353236343033"}
{"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T01:38:27.737Z","coreid":"particle-internal"}
{"name":"hook-error/pushtemps/0","data":"ESOCKETTIMEDOUT","ttl":"60","published_at":"2017-02-03T01:38:32.761Z","coreid":"particle-internal"}
{"name":"pushtemps","data":"{\"f1\":114.01,\"f1t\":190,\"f2\":32.00,\"f2t\":190,\"f3\":32.00,\"f3t\":190,\"pit\":231.67,\"pt\":240,\"ph\":260,\"pl\":190,\"sp\":240.0,\"fan\":46,\"mode\":\"ZEROCOUNTERAIR\",\"alarm\":\"none\",\"s\":\"F\"}","ttl":"60","published_at":"2017-02-03T01:38:35.487Z","coreid":"25002f000b47353235303037"}
{"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T01:38:35.262Z","coreid":"particle-internal"}
{"name":"pushtemps","data":"{\"f1\":161.15,\"f1t\":196,\"f2\":32.00,\"f2t\":200,\"f3\":165.86,\"f3t\":200,\"pit\":241.46,\"pt\":248,\"ph\":268,\"pl\":208,\"sp\":248.0,\"fan\":35,\"mode\":\"NORMAL\",\"alarm\":\"none\",\"s\":\"C\"}","ttl":"60","published_at":"2017-02-03T01:38:39.018Z","coreid":"3c0043000a51353335323536"}
{"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T01:38:39.457Z","coreid":"particle-internal"}
{"name":"pushtemps","data":"{\"f1\":160.31,\"f1t\":203,\"f2\":32.00,\"f2t\":130,\"f3\":32.00,\"f3t\":130,\"pit\":281.04,\"pt\":285,\"ph\":310,\"pl\":245,\"sp\":285.0,\"fan\":29,\"mode\":\"NORMAL\",\"alarm\":\"none\",\"s\":\"F\"}","ttl":"60","published_at":"2017-02-03T01:38:39.149Z","coreid":"370042001551353531343431"}
{"name":"hook-sent/pushtemps","data":"undefined","ttl":"60","published_at":"2017-02-03T01:38:39.429Z","coreid":"particle-internal"}
{"name":"hook-error/pushtemps/0","data":"ESOCKETTIMEDOUT","ttl":"60","published_at":"2017-02-03T01:38:40.340Z","coreid":"particle-internal"}
{"name":"hook-response/pushtemps_3c0043000a51353335323536/0","data":"{\"status\":200,\"coreid\":\"3c0043000a51353335323536\",\"alerted\":[],\"message\":\"No alarm so will finish\"}","ttl":"60","published_at":"2017-02-03T01:38:40.641Z","coreid":"particle-internal"}
{"name":"hook-response/pushtemps_370042001551353531343431/0","

#103

Hi @mterrill,

Sorry about the slow reply on my end, this notification got buried in my inbox. If you need urgent tech support, Particle does have a paid support option if you think you might need official tech support after hours or with quick turnaround.

Sorry the error message wasn’t more immediately useful - ESOCKETTIMEDOUT means that you’re making too many simultaneous requests, and you’ve run out of available sockets for your hook. Normal user webhooks, by default have a maximum of 100 simultaneous connections, and a given host may only be hit up to 120 requests per minute. We have these default limits in place so that someone doesn’t accidentally use the Particle webhook apparatus to attack or flood other servers on the internet. Try spreading out your hook triggers over the course of a minute, so they’re not all making the same request simultaneously.

Thanks,
David


#104

Hi @Dave, that interestingly contradicts what @bryce informed me - ie webhook timed out. My testing showed that 5 seconds was the undocumented constraint and it was an upstream timeout to the webhook destination. I’ve asked that the docu be updated and perhaps a more informative error message.

We’ve talked about the rate before, last June a specific whitelist was created, though it may be useful to check that its still in place for https://r6g4i5ujf9.execute-api.us-west-2.amazonaws.com ? Worth noting that I’m more than happy to list the destination url as because it has some of the world’s best rate limiting and security on it though I appreciate the ‘do no harm’ netcitizen approach Particle has defaulted to for web properties that could easily be taken down by a bot attack like Mirai last October.

You’re right in that it’s an interesting trap for young players though, but I’d hope the wildcard whitelist had been implemented to help them out, ie (https://.execute-api..amazonaws.com)

EDIT: other thread of ESOCKETTIMEDOUT was: What does ESOCKETTIMEDOUT mean? I created a more visible thread as I realised someone in the community that wasn’t watching this old thread may know.


#105

I am using firmware 0.6.0 with SYSTEM_THREAD(ENABLED); running on an electron and appear to still be loosing web hook responses. The event log show that the publish was processed and the response was received by the cloud but was never passed to the device. I have 3 devices under test and they all show this problem. There is not difference in the web log between publish response pair that work as expected and those that do not.

Here is the log of publish data as seen at the host - in order they were received by the post.

14:02:55 //W Position_ 1500757366 150075717102_002 successful publish response

15:03:25 //W Position_ 1500760982 150075737402_002
15:03:25 //W Position_ 1500760967 150075737402_002
15:03:27 //W Position_ 1500760997 150075737402_002
15:08:37 //W Position_ 1500761311 150075737402_002
15:19:45 //W Position_ 1500761939 150075737402_002
15:19:45 //W Position_ 1500761969 150075737402_002
15:19:45 //W Position_ 1500761954 150075737402_002
15:20:08 //W Position_ 1500761984 150075737402_002
15:20:08 //W Position_ 1500761999 150075737402_002

time = time the publish data was received by the host
3rd column = device seconds when the publish was started
4th column = device seconds when the last response was received.
02_002 = response

Also publish data is not received in the same order as it was published

If no response is received by the device in 15 seconds the publish is repeated.

Any ideas?


#106

How did you go? I use the published_at on my processing endpoint as the authoritative datetime and convert that to a datestamp for the record. I don’t trust the received time.


#107

My problem is not resolved. Webbook response are still getting to the Particle cloud but not to the device. I have no joy in getting Particle interested in investigating / solving problem.
I solved the my issue by adding an end to end data ack/nack protocol that make sure the data is received and stored by the host.

There are many async activities involved in publish / response process string, I will not think you could rely on the timing.


#108

@KyleG


#109

Hi @Jibeset,

Person from Particle here! It’s very difficult to diagnose this problem without looking closely at your firmware and your requests / responses to/from your server targeted by the webhook.

Things to try:

  • use the latest system firmware if you can, it’s possible this issue was already solved.
  • include something unique (like a counter or device timestamp) in your publish to help differentiate continuity
  • make sure you’re not going over publish limits on the device, or trigger limits on your hook
  • make sure you’re not blocking for long periods of time on your device that might interfere with you receiving a message, system_thread helps, but isn’t a cure all.
  • use the _WITH_ACK flag when publishing (firmware 0.6.1 or later) -> https://docs.particle.io/reference/firmware/electron/#particle-publish-
  • make sure your server is returning a non-empty success response in order for an event to be generated back down to your device
  • watch your event feed using the console or a tool like the CLI, and confirm your hook-response messages are appearing there when comparing to the electron.

I hope that helps!

Thanks,
David


#110

@Dave = thank you. Here are my responses to your suggestions

use the latest system firmware if you can, it’s possible this issue was already solved.
– I am using 0.6.0 across my products and upgrading is a pain at this time.

include something unique (like a counter or device timestamp) in your publish to help differentiate continuity
– I already included the publish time and a sequence number in the publish data. This is how I
know that response are not always being returned to the device.

make sure you’re not going over publish limits on the device, or trigger limits on your hook
– A device can only send one publish every 15 seconds (max) and the test involves 3 or 4 devices
– The appears to be two failure modes: Data received by host and response not received by device. Data not received by host - this latter is very rare. In both cases the recovery process make sure all data get to the host

make sure you’re not blocking for long periods of time on your device that might interfere with you receiving a message, system_thread helps, but isn’t a cure all.
– My application executes the same loop:- reading data from the asset tracker shield, and every 15 seconds saving the position, and issues a publish every 120 seconds. When the response is not received in 15 seconds, the publish is issues again. There are tests for “Particle.connected()” before each publish. The only possible variable is the reading / processing of data from the AT shield.

use the _WITH_ACK flag when publishing (firmware 0.6.1 or later) -> https://docs.particle.io/reference/firmware/electron/#particle-publish-
– would require a across product upgrade - will look into this.

make sure your server is returning a non-empty success response in order for an event to be generated back down to your device
– The server always returns the sequence number of the data block received. This drives the device possible data loss recovery process.

watch your event feed using the console or a tool like the CLI, and confirm your hook-response messages are appearing there when comparing to the electron.
– I have followed the sequences of publish / web hook / response on the console for hours and see no differences. The publish happens, the web hook happens, the response is received even when the device does not see the response.

I have observed:

When there is a failure, data published 15 seconds apart will arrive at the host at the same time and sometimes not in the same order as they were published.

I believe this has something to do with the GSM link going up and down. The failures often come in bunches - may take 2 or more re-publish to successfully get the response back.

See ticket# 47547 for a discussion I had with support.


#111

Hi @Jibeset,

Hmm, it’s still hard to know without seeing your firmware, but thank you for your responses.

Are you using the SIM cards and service provided by Particle, or a different sim? In what part of the world are you located?

Thanks,
David


#112

@Dave:
I am using the SIM that came with the Electron - in all cases. I am in Saratoga CA. I am prepared to share the firmware with you but not post it online


#113

Hi @Jibeset,

Okay, thanks for the quick answers! I’ll DM you details on where you can send your code so I can try and identify where the issue is.

Thanks,
David


#114

David:
After sitting and watch two Electrons operate for 18 hours, I have correlated the loss of the response packet with the coming and going of the GSM connection.
My process was to watch the Electrons and note the time when the status light when from “breathing blue” to “flashing green” to “flashing blue” to “breathing blue” again - GSM connection lost and recovered. At the end of the 18 hours, I compared these times with the times when the response packet was not received by the device and found a perfect correlation - when response packets did not arrive at the device, there was always a GSM disconnect and reconnect around the same time. There were no response packets lost at any other times.
In a few cases the upstream (publish) packet was never received by the hosts when the publish was executed.
In all cases when the response packet was not received by the device, the publish was re-tried (one or more times) and was always received by the host.

I go away happy that I understand the issue and have a valid method of recovery.

Thank you for your assistance.