Randomly failing to publish

apica · November 16, 2018, 2:55pm

Hi,

I have tens of devices installed in different places working with OS 0.7.0 version since almost half a year. Since ~last month I had some devices that stop publishing data without any apparent reason. It seems that the connection is OK (diagnostics show all fields green), and ping reaches the device successfully. The only solution seems to restart the device.

Is there any clue to how to debug it? All local devices that I test doesn’t show this error, and the code was working sucessfully since a long time ago. Maybe some change in the Particle cloud might prevent some events to be published?

Thank you

apica · December 1, 2018, 4:56pm

Still searching for this error, and update with new information.

After the connection problems, I just added some extra reliability in the communications, so now, after publishing the message, it waits until the matching webhook fires and got the webhook reply. The thing is that sometimes this doesn't work well.

It gets subscribed into the webhook reply:

(void)Particle.subscribe(System.deviceID() + "/hook-response/" EVENT_NAME + "/", SubscriptionCallback, MY_DEVICES);

Then, after publishing, a webhook sems to reply well:

The thing is that the webhook reply doesn't reach the device sometimes. Then, it keeps failing until it receives a "cloud connection down" and then "cloud connection up".

If the devices sends a packet, why the Particle Cloud doesn't send the reply to the device immediately? The cellular connection must be up in this moment!

Thank you

apica · December 3, 2018, 11:49am

More information:

Once the communications starts failing, a celular ping to 8.8.8.8 is performed, with -3 result (RESP_ERROR).
If a reset including ublox reset is performed (Cellular.command(30000, “AT+CFUN=16\r\n”)), the communications goes up again.

apica · December 5, 2018, 8:34am

Finally, I'm try to understard the low level communication using SerialLogHandler, the problem is that I think that the data send through uBlox is encripted somehow so need to know what is the content of the messages to really understand what is going on, and why some replies from the Particle cloud doesn't reach the module with no aparent reason.

So, how can I interpret messages like:

31804.181 AT read + 14 2 "\r\n+USORF: 0,"52.55.104.91",5684,104,"\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\\x00[\x00\x01\x00\x00\x00\x00\x00\\xdb\xce\xb6\xdc\xed\xe7\x8de\x9e\xd3\x9b\x8fk\xe0\xce\xe4Q\\x87U...

rickkas7 · December 5, 2018, 5:14pm

You should add one thing to your failure handing code. If you are missing your response events, you should not only reset the mode (either AT+CFUN=16 or SLEEP_MODE_DEEP) but you may also want to reset your cloud session. The code to do that is here:

github.com

rickkas7/electronsample/blob/master/src/SessionCheck.cpp#L102


	// Failed to receive event
	if (++numFailures < NUM_FAILURES_BEFORE_RESET_SESSION) {
		// Try sending again just in case
		sendEvent();
		return;
	}


	ConnectionEvents::addEvent(ConnectionEvents::CONNECTION_EVENT_SESSION_RESET);


	// Too many tries, reset the session
	Particle.publish("spark/device/session/end", "", PRIVATE);


	delay(2000);


	// Reset the whole device just in case. If ConnectionCheck is present, do a full modem reset,
	// otherwise just reset.
	if (ConnectionCheck::getInstance()) {
		ConnectionCheck::getInstance()->fullModemReset();
	}
	else {
		System.reset();

apica · December 10, 2018, 11:01am

In fact, I’m using the electronsample as a basic schema. SessionCheck is working, but it never fires (every time waitToSendState sends the check it works).

There are different incidents, all related to connectivity (software watchdog, losing sporadic webhook replies, losing webhook replies for a long time, etc). If I run the application without any connectivity, it NEVER has any problem, so I’m not sure I can do anything else to improve that.

Maybe I can show you privately how it works (or the logs) to check if there are some hidden bug? I’m using 0.7.0, and waiting for the stable 0.8.0 version before changing to it.

apica · December 10, 2018, 12:31pm

I observed another problem with the internal millis() time:

...
[06/12/18 - 13:22:36:996] 3269.132 AT read + 17 "\r\n+USOST: 0,225\r\n"
[06/12/18 - 13:22:37:032] 3269.142 AT read OK 6 "\r\nOK\r\n"
[06/12/18 - 13:22:37:041] 0003269143 [comm.protocol] INFO: rcv'd message type=6
[06/12/18 - 13:22:37:043] 0003269143 [comm.protocol] ERROR: Event loop error 3
[06/12/18 - 13:22:37:044] 0003269144 [system] WARN: Communication loop error, closing cloud socket
[06/12/18 - 13:22:37:045] 0003269144 [system] INFO: Cloud: connecting
[06/12/18 - 13:22:37:045] 0003269146 [system] TRACE: sparkSocket Now =0
[06/12/18 - 13:22:37:046] 0003269146 [system] TRACE: Close Attempt
[06/12/18 - 13:22:37:047] socketClose(0)
[06/12/18 - 13:22:37:048] 3269.145 AT send 12 "AT+USOCL=0\r\n"
[06/12/18 - 13:22:37:049] ConnectionCheck: Cloud connection down
[06/12/18 - 13:22:38:437] 0000000006 [system] INFO: Power Management Initializing.
[06/12/18 - 13:22:38:441] 0000000013 [system] INFO: State of Charge: 85.63 %
[06/12/18 - 13:22:38:444] 0000000014 [system] INFO: Battery Voltage: 4.00V
...

At "3269.145 AT send", the miliseconds value is 145, while the previous message is at 146. They are different type of messages, but it seems strange an increased value.

ScruffR · December 10, 2018, 12:38pm

This is probably due to FreeRTOS and sharing the same serial interface between multiple threads.
While one thread may have had its timeslice to created the string it might lose control before actually sending it to the interface and when it regains control will have to wait for the shared resource to become available which may well take more than a millisecond.

With up to 5 concurrent threads an out-of-order log of 5ms wouldn’t give me bad dreams.

apica · December 10, 2018, 12:52pm

That seems the reason. I didn’t know that more than 1 thread was used by Particle OS (I don’t use threads for the application).

Topic		Replies	Views
"Lost" - Particle subscribe Firmware boron	3	327	February 19, 2021
Cellular Connection but no Cloud Connection? Connectivity	3	274	September 27, 2023
Particle.subscribe() not listening Cloud boron	7	440	September 16, 2020
A Sub-set of my Electrons cannot publish web hooks successfully Troubleshooting	5	709	December 3, 2018
Webhooks sometimes do not generate responses back to device General	13	856	June 28, 2018

Randomly failing to publish

Related topics