Hello!
I am kind of new to Particle (have been using it for less than a year now) but not new to microcontrollers
We are developing hardware and software using Photons and P1s here. Our system provides home automation for people with disabilities. It comprises a main control unit (with a color graphic LCD) and peripheral units such as a bed controller, door opener and IR hub. All devices interact to each other using a custom UDP protocol and they also interact with our backend using Particle Cloud and webhooks.
Our system is now more than 4k lines of code (yes, I know, that means lots of bugs!) and it makes use of several FSMs to control things like:
- networking (WiFi connection and disconnection, Internet checking, cloud connection, etc);
- backend communication (we are using cloud messages and webhooks in order to post data, get data, ping the server, etc);
- GUI (drawing menus, icons, processing user input, etc)
- Handling BLE devices (beacons and remote controlling Smartphones and Laptops)
- Communicating with other local devices (such as WeMos and HUE lights) by using UDP/TCP;
We have some devices which work as bridges to control Bed, Doors and IR controllers.
Well, everything was working fine until a couple days ago, when we started noticing our devices keep disconnecting and restarting. This behavior is caused by devices losing contact with our backend (they send messages and get no reply, consecutive pings fail, the device then disconnects from cloud, test internet connection, tries to connect to the could again but then a few minutes later more replies are lost and then system gives up and restarts all networking system). This is our design and (I think) it is the way it is supposed to be.
What it is not supposed to happen is a device losing replies!
Some facts:
1- Our message handling FSM publishes only one message per second;
2- Using the console I can see that the events are published on the cloud;
3- I can also see that a webhook is triggered;
4- I can see that the webhook got a reply from our backend;
5- Lots of time that reply doesn’t reach the device as expected;
6- My messaging FSM retries the message, but most of the time retries also fail;
7- This is happening on different locations, different networks and providers;
In order to dig deeper I wrote a very simple code on one board that listens to any hook replies for another board (which runs our production firmware). The result is that when one board gets a reply, so does the other. But when one board does not get a reply, so doesn’t the other! My conclusion is that the replies aren’t really there!
Well, I am 99% our firmware is ok and I really think that for some reason the webhook or cloud is not relaying the message as it should, my tests showed that, but of course, it could be a problem in our firmware. So, I ask, has anyone here experienced something similar to that?
Thank you!