Webhooks sometimes do not generate responses back to device

Hello!

I am kind of new to Particle (have been using it for less than a year now) but not new to microcontrollers :slight_smile:

We are developing hardware and software using Photons and P1s here. Our system provides home automation for people with disabilities. It comprises a main control unit (with a color graphic LCD) and peripheral units such as a bed controller, door opener and IR hub. All devices interact to each other using a custom UDP protocol and they also interact with our backend using Particle Cloud and webhooks.

Our system is now more than 4k lines of code (yes, I know, that means lots of bugs!) and it makes use of several FSMs to control things like:

  • networking (WiFi connection and disconnection, Internet checking, cloud connection, etc);
  • backend communication (we are using cloud messages and webhooks in order to post data, get data, ping the server, etc);
  • GUI (drawing menus, icons, processing user input, etc)
  • Handling BLE devices (beacons and remote controlling Smartphones and Laptops)
  • Communicating with other local devices (such as WeMos and HUE lights) by using UDP/TCP;

We have some devices which work as bridges to control Bed, Doors and IR controllers.

Well, everything was working fine until a couple days ago, when we started noticing our devices keep disconnecting and restarting. This behavior is caused by devices losing contact with our backend (they send messages and get no reply, consecutive pings fail, the device then disconnects from cloud, test internet connection, tries to connect to the could again but then a few minutes later more replies are lost and then system gives up and restarts all networking system). This is our design and (I think) it is the way it is supposed to be.

What it is not supposed to happen is a device losing replies!

Some facts:
1- Our message handling FSM publishes only one message per second;
2- Using the console I can see that the events are published on the cloud;
3- I can also see that a webhook is triggered;
4- I can see that the webhook got a reply from our backend;
5- Lots of time that reply doesn’t reach the device as expected;
6- My messaging FSM retries the message, but most of the time retries also fail;
7- This is happening on different locations, different networks and providers;

In order to dig deeper I wrote a very simple code on one board that listens to any hook replies for another board (which runs our production firmware). The result is that when one board gets a reply, so does the other. But when one board does not get a reply, so doesn’t the other! My conclusion is that the replies aren’t really there!

Well, I am 99% our firmware is ok and I really think that for some reason the webhook or cloud is not relaying the message as it should, my tests showed that, but of course, it could be a problem in our firmware. So, I ask, has anyone here experienced something similar to that?

Thank you!

1 Like

Hey @FabioP!

Are you using the hook-sent and hook-response events in your testing? We just fixed an issue with those events getting lost some of the time and I’m wondering if you could retry your test and tell us if it’s any better?

Cullen

1 Like

Hello Cullen!

I am using a generic listener for all my responses:

Particle.subscribe(System.deviceID() + “/hook-response/”, serverHandler, MY_DEVICES);

Should I use hook-received instead of hook-response ???

Fabio

1 Like

No, I think you’re totally right! I didn’t look closely at what I was typing and typed the wrong thing. :slight_smile: I’ve updated my post to reflect the actual name of the event instead of what I was thinking about. :wink:

1 Like

I implemented a ping fail counter here and most of my devices are experiencing about 50% ping drops, I can see that sometimes all messages are replied for some time and then lots of messages fail. Then, out of the blue pings start to get replies again.

Right now one device is losing all replies, whereas another one is getting all its replies.

I also notice sometimes Photon’s LED blinks cyan (such as it is connecting to cloud), but on console cloud is ok (I have a handler catching network state changes, so I always get a message when WiFi connects/disconnects and when cloud connects/disconnects):
System.on(network_status+cloud_status, networkHandler);

Anyway, the problem is still there… :frowning:

1 Like

Hmmm, that’s curious.

So the problem we’re talking about is not getting the reply to the webhook sent to your backend?
Specifically the code that runs in response to Particle.subscribe(System.deviceID() + “/hook-response/”, serverHandler, MY_DEVICES);?

I went to run a test on my own device and I haven’t seen any problems in the console for all my events that have been published after I made that change.

Cullen

1 Like

So I notice that when I look at the device events, I only get the events that are published by the device.

But when I look at ALL events, I notice that I am able to see all of them reliably. I haven’t made my device subscribe to the hook-response event so I haven’t tried that myself

Is it possible that the Console is only showing you events for your device as you’re looking at the device page?

Cullen

1 Like

On my last message, by console I mean my local serial console connected to Device’s VCP.

Generally speaking, if I use Particle’s web console, click on a device, I can see all its messages. From there I can see that every 15 seconds…

Wait!!!

Did you change something? No pings are failing here anymore!

I noticed that all pings were being replied and just reset all my testing devices to make sure their ping fail counter is not incrementing anymore!

I can confirm that as of now ALL my packets are arriving as expected!

Fabio

1 Like

Cullen,

I didn’t change anything in my end, but for some reason all replies are arriving as expected now. I am glad it wasn’t my firmware but we need to know what was the issue so that it won’t happen again :slight_smile:

I rewrote our communication FSM to avoid restarting networking in case of missing replies, but while that avoids one problem, it doesn’t fix it because we might lose important packets anyway.

I’ll wait for a reply from Particle ok?

Thanks again

Fabio

1 Like

Yea, the problem was that some new infrastructure we were testing had a problem with dropping hook-sent and hook-response events. It just took longer for the fix to deploy than I thought it would. Many things to learn as we start using new tools!

We’re talking internally about how to make sure this type of thing doesn’t happen again, and we’ve got a lot of great ideas that should go a long way to improving the situation.

In terms of restarting the cloud connection, the Device OS already does that when it detects a problem connecting to the Particle Cloud, so there’s not much need to do that manually as well unless you’re controlling the connection manually for some power savings. Generally leaving the Particle Cloud connection connected will use less data than disconnecting and reconnecting it manually.

So glad it’s working for you!

Cullen

1 Like

Thank you Cullen!

Yes, we control everything manually! We decide when to connect to WiFi, we detect cloud drops and we can keep working offline if needed (as long as WiFi router provides an IP).

The problem we experienced with cloud connection/disconnection is when you call Particle.Connect() it blocks user thread, so before trying to connect to cloud our networking FSM checks if we have Internet and then we try to connect to cloud. We also try to synchronize it with our GUI state, so that the user does not realize the application code is blocked.

Have you tried System Threaded mode?

I am not fancy with Firmware, but as far as I’m aware, Particle.Connect() shouldn’t block while in System Threaded mode. Now, I believe that Wifi.Connect() DOES appear to block because connecting to a Wifi system uses nearly all of the Photons’ resources while it’s connecting and that is called first if not already connected.

Electrons, because their networking interface is on a different processor, will be much less blocking on a Particle.Connect().

Happy Hacking!
Cullen

1 Like

See this about the blocking/non-blocking nature of Particle.connect(): https://github.com/particle-iot/firmware/issues/1449#issuecomment-361840017

1 Like

Hello!

Yes, we are using manual threaded mode and the link you posted (as far as I could understand) states that it is a blocking call even when the system is running in multi-threaded mode. This is what I saw when I was designing our networking FSM and that is why we decided to have multiple steps and be very careful before calling both WiFi.Connect() and Particle.Connect().

That is why we test for Internet before attempting to connect to Particle Cloud, because if Internet is down, trying to connect to the cloud will cause a noticeable pause on our application.

I would love to see how it is gonna be handled on your new Hybrid Processor modules (ESP32+CM4) since ESP32 will run all the networking stack independently from the “application processor”!

Fabio

1 Like