Randomly failing to publish


#1

Hi,

I have tens of devices installed in different places working with OS 0.7.0 version since almost half a year. Since ~last month I had some devices that stop publishing data without any apparent reason. It seems that the connection is OK (diagnostics show all fields green), and ping reaches the device successfully. The only solution seems to restart the device.

Is there any clue to how to debug it? All local devices that I test doesn’t show this error, and the code was working sucessfully since a long time ago. Maybe some change in the Particle cloud might prevent some events to be published?

Thank you


#2

Still searching for this error, and update with new information.

After the connection problems, I just added some extra reliability in the communications, so now, after publishing the message, it waits until the matching webhook fires and got the webhook reply. The thing is that sometimes this doesn’t work well.

It gets subscribed into the webhook reply:

(void)Particle.subscribe(System.deviceID() + "/hook-response/" EVENT_NAME + "/", SubscriptionCallback, MY_DEVICES);

Then, after publishing, a webhook sems to reply well:

The thing is that the webhook reply doesn’t reach the device sometimes. Then, it keeps failing until it receives a “cloud connection down” and then “cloud connection up”.

If the devices sends a packet, why the Particle Cloud doesn’t send the reply to the device immediately? The cellular connection must be up in this moment!

Thank you


#3

More information:

  • Once the communications starts failing, a celular ping to 8.8.8.8 is performed, with -3 result (RESP_ERROR).
  • If a reset including ublox reset is performed (Cellular.command(30000, “AT+CFUN=16\r\n”)), the communications goes up again.

#4

Finally, I’m try to understard the low level communication using SerialLogHandler, the problem is that I think that the data send through uBlox is encripted somehow so need to know what is the content of the messages to really understand what is going on, and why some replies from the Particle cloud doesn’t reach the module with no aparent reason.

So, how can I interpret messages like:

31804.181 AT read + 14 2 “\r\n+USORF: 0,“52.55.104.91”,5684,104,”\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\\x00[\x00\x01\x00\x00\x00\x00\x00\\xdb\xce\xb6\xdc\xed\xe7\x8de\x9e\xd3\x9b\x8fk\xe0\xce\xe4Q\\x87U…


#5

You should add one thing to your failure handing code. If you are missing your response events, you should not only reset the mode (either AT+CFUN=16 or SLEEP_MODE_DEEP) but you may also want to reset your cloud session. The code to do that is here:


#6

In fact, I’m using the electronsample as a basic schema. SessionCheck is working, but it never fires (every time waitToSendState sends the check it works).

There are different incidents, all related to connectivity (software watchdog, losing sporadic webhook replies, losing webhook replies for a long time, etc). If I run the application without any connectivity, it NEVER has any problem, so I’m not sure I can do anything else to improve that.

Maybe I can show you privately how it works (or the logs) to check if there are some hidden bug? I’m using 0.7.0, and waiting for the stable 0.8.0 version before changing to it.


#7

I observed another problem with the internal millis() time:


[06/12/18 - 13:22:36:996] 3269.132 AT read + 17 “\r\n+USOST: 0,225\r\n”
[06/12/18 - 13:22:37:032] 3269.142 AT read OK 6 “\r\nOK\r\n”
[06/12/18 - 13:22:37:041] 0003269143 [comm.protocol] INFO: rcv’d message type=6
[06/12/18 - 13:22:37:043] 0003269143 [comm.protocol] ERROR: Event loop error 3
[06/12/18 - 13:22:37:044] 0003269144 [system] WARN: Communication loop error, closing cloud socket
[06/12/18 - 13:22:37:045] 0003269144 [system] INFO: Cloud: connecting
[06/12/18 - 13:22:37:045] 0003269146 [system] TRACE: sparkSocket Now =0
[06/12/18 - 13:22:37:046] 0003269146 [system] TRACE: Close Attempt
[06/12/18 - 13:22:37:047] socketClose(0)
[06/12/18 - 13:22:37:048] 3269.145 AT send 12 “AT+USOCL=0\r\n”
[06/12/18 - 13:22:37:049] ConnectionCheck: Cloud connection down
[06/12/18 - 13:22:38:437] 0000000006 [system] INFO: Power Management Initializing.
[06/12/18 - 13:22:38:441] 0000000013 [system] INFO: State of Charge: 85.63 %
[06/12/18 - 13:22:38:444] 0000000014 [system] INFO: Battery Voltage: 4.00V

At “3269.145 AT send”, the miliseconds value is 145, while the previous message is at 146. They are different type of messages, but it seems strange an increased value.


#8

This is probably due to FreeRTOS and sharing the same serial interface between multiple threads.
While one thread may have had its timeslice to created the string it might lose control before actually sending it to the interface and when it regains control will have to wait for the shared resource to become available which may well take more than a millisecond.

With up to 5 concurrent threads an out-of-order log of 5ms wouldn’t give me bad dreams.


#9

That seems the reason. I didn’t know that more than 1 thread was used by Particle OS (I don’t use threads for the application).