Using mesh.publish kills the cloud connection

xenon
Tags: #<Tag:0x00007fe223e9c958>

#1

I’ve noticed something odd with the Xenon. A simple program that uses Particle.Publish is very reliable, but add a Mesh.publish and it quickly loses it’s connection to the cloud. The Particle.Publish (w/ACK) will start returning false after awhile (~2min), though they will actually make it to the cloud for another few minutes before it starts blinking cyan and then nothing works. Oddly the Particle subscriptions, firmware updates, and function calls no longer make it through to the Xenon after that first failure (~2min). Happens on v1.1.0 as well as 1.2.1-rc.2.

Here’s a test app to demonstrate the issue, it typically takes 2min for it to start showing up, regardless of publish interval. After you can reproduce the issue, comment out the Mesh.publish and try again and you’ll see it remains connected.

SerialLogHandler logHandler(LOG_LEVEL_NONE, { { "app", LOG_LEVEL_TRACE  } });

unsigned long publishFrequencySec = 5;
unsigned long lastPublishTimeMs = 0;

void setup() {
	Particle.subscribe("test", handleMessage, MY_DEVICES); 
}
void loop() {
    if(millis() - lastPublishTimeMs > publishFrequencySec*1000UL) { 
        lastPublishTimeMs=millis();
        //Comment out the following line and the cloud connection remains stable
        Mesh.publish("test", String::format("uptime: %lu", millis()));
    	bool success = Particle.publish("test", String::format("uptime: %lu", millis()), 60, PRIVATE, WITH_ACK);
        if (success) {
            Log.info("Publish success");
        } else {
            Log.error("Publish failed.");
        }
    }
 }
void handleMessage(const char *event, const char *data)
{
	Log.info("Received message - " + String(event) + ": " + String(data));
}

Here’s what this looks like:

0000116505 [app] INFO: Publish success
0000116514 [app] INFO: Received message - test: uptime: 116200
0000121658 [app] INFO: Publish success
0000121667 [app] INFO: Received message - test: uptime: 121201
0000126567 [app] INFO: Publish success
0000126576 [app] INFO: Received message - test: uptime: 126202
0000151201 [app] ERROR: Publish failed.
0000171201 [app] ERROR: Publish failed.
0000186792 [app] ERROR: Publish failed.
0000206792 [app] ERROR: Publish failed.

Before boiling this issue down to that test app I tried numerous ways to get the Xenon to reconnect with no success. In v1.1.0 it still thought it was connected the whole time but with the latest 1.2.1-rc.2 firmware it can tell when it disconnects but can’t fix the connection on it’s own, no amount of turning on/off mesh/network/cloud has any effect. The only thing that works is a system.reset. A reset of the gateway Argon has no real effect, the Xenon will go offline and reconnect but publishes/functions/firmware updates will keep failing.


#2

Have you tried this with all logging on?

I ran in to this issue: Mesh.publish() alloc errors and function/variable issues a while back. I haven’t had a chance to re-test it on newer versions though. I’m wondering if you’ll see similar logging.


#3

Oh, you’re right. I don’t recall seeing this error before with my full program but it’s pretty obvious with TRACE level on my test app.

0000111324 [app] INFO: Publish success
0000111329 [app] INFO: Received message - test: uptime: 111129
0000111331 [comm.protocol] INFO: rcv'd message type=8
0000116527 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0000116528 [comm.protocol] INFO: message id 112 complete with code 0.00
0000116529 [comm.protocol] INFO: rcv'd message type=13
0000116529 [app] INFO: Publish success
0000116536 [app] INFO: Received message - test: uptime: 116130
0000116536 [comm.protocol] INFO: rcv'd message type=8
0000121427 [net.th] TRACE: input failed to alloc
0000121433 [net.th] TRACE: input failed to alloc
0000126379 [net.th] TRACE: input failed to alloc
0000126386 [net.th] TRACE: input failed to alloc
0000134958 [net.th] TRACE: input failed to alloc
0000136184 [net.th] TRACE: input failed to alloc
0000137784 [net.th] TRACE: input failed to alloc
0000138712 [net.th] TRACE: input failed to alloc
0000140513 [net.th] TRACE: input failed to alloc
0000141129 [app] ERROR: Publish failed.
0000141434 [net.th] TRACE: input failed to alloc
0000141441 [net.th] TRACE: input failed to alloc
0000146695 [net.th] TRACE: input failed to alloc
0000146947 [net.th] TRACE: input failed to alloc
0000147249 [net.th] TRACE: input failed to alloc
0000148912 [net.th] TRACE: input failed to alloc
0000151250 [net.th] TRACE: input failed to alloc
0000154652 [net.th] TRACE: input failed to alloc
0000157404 [net.th] TRACE: input failed to alloc
0000158906 [net.th] TRACE: input failed to alloc
0000160155 [net.th] TRACE: input failed to alloc
0000161129 [app] ERROR: Publish failed.
0000161409 [net.th] TRACE: input failed to alloc
0000161415 [net.th] TRACE: input failed to alloc
0000164952 [net.th] TRACE: input failed to alloc
0000166608 [net.th] TRACE: input failed to alloc
0000167160 [net.th] TRACE: input failed to alloc
0000169380 [net.th] TRACE: input failed to alloc
0000170506 [net.th] TRACE: input failed to alloc
0000174273 [net.th] TRACE: input failed to alloc
0000176536 [comm.protocol] ERROR: Event loop error 1
0000176536 [system] WARN: Communication loop error, closing cloud socket
0000176537 [system] INFO: Cloud: disconnecting
0000176537 [system] INFO: Cloud: disconnected
0000176538 [system] INFO: Cloud: connecting
0000176539 [system] INFO: Read Server Address = type:1,domain:$id.udp-mesh.particle.io
0000176540 [system] INFO: Loaded cloud server address and port from session data
0000176541 [system] TRACE: Address type: 1
0000176542 [system] TRACE: Cloud socket=0, family=10, type=2, protocol=17
0000176543 [system] INFO: Cloud socket=0, connecting to 64:FF9B::36A6:B5BF#5684
0000176543 [system] TRACE: Cloud socket=0, connected to 64:FF9B::36A6:B5BF#5684
0000176544 [system] TRACE: Updating cloud keepalive for AF_INET6: 30000 -> 30000
0000176545 [system] TRACE: Applying new keepalive interval now
0000176546 [system] INFO: Cloud socket connected
0000176546 [system] INFO: Starting handshake: presense_announce=0
0000176547 [comm.protocol.handshake] INFO: Establish secure connection
0000176737 [comm.dtls] WARN: session has 0 uses
0000176743 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=0
0000176743 [comm.dtls] INFO: out_ctr 0,1,0,0,0,0,0,239, next_coap_id=74
0000176744 [comm.dtls] INFO: app state crc: cached: 8cd594ee, actual: 8cd594ee
0000176745 [comm.dtls] WARN: skipping hello message
0000176745 [comm.dtls] INFO: restored session from persisted session data. next_msg_id=116
0000176746 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 2
0000176747 [comm.protocol.handshake] INFO: resumed session - not sending HELLO message
0000176751 [system] INFO: Send spark/hardware/max_binary event
0000176754 [system] INFO: Send subscriptions
0000176758 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000176758 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000176759 [comm] INFO: Sending TIME request
0000176760 [system] INFO: Cloud connected
0000176761 [app] ERROR: Publish failed.
0000176868 [net.th] TRACE: input failed to alloc
0000176875 [net.th] TRACE: input failed to alloc
0000176920 [net.th] TRACE: input failed to alloc
0000176928 [net.th] TRACE: input failed to alloc
0000176934 [net.th] TRACE: input failed to alloc
0000176940 [net.th] TRACE: input failed to alloc
0000176947 [net.th] TRACE: input failed to alloc
0000176953 [net.th] TRACE: input failed to alloc
0000176975 [net.th] TRACE: input failed to alloc
0000176981 [net.th] TRACE: input failed to alloc
0000178578 [net.th] TRACE: input failed to alloc

#4

Well I’m both happy and sad to see you’re seeing the same issue as I was. Where there’s two now I’ve sent a message to support to see if they can look.


#5

@Fragma can you still reproduce this? If so I might ask support to email you rather than me, as I’m not having much luck in reproducing it and they’d like to check some of their logs from when it’s happening.


#6

Yes, though it takes twice as long now, failing around 240sec, reproduced 3x. Go ahead and send them my way :wink:


#7

That “input failed to alloc” is tied to the following line indicating a failure to allocate a packet buffer while processing input on the OpenThread mesh.

For the Xenon any Particle.publish() has to travel across the mesh and out through a gateway so a mesh layer failure would also impact Particle.publish().

Needs confirmation, but it looks like you have a pretty minimal test case that showing possible leak of packet buffers impacting mesh/cloud connectivity on the Xenon. Thanks for the report and I’ll make sure it gets to the right place.


#8

Thanks for reporting this! I’ve managed to reproduce the issue. Please confirm whether you have other nodes within your mesh network doing mesh publishes. If so, a temporary workaround until we include the fix in one of the next releases is to add a dummy Mesh.subscribe() to any event. Without a single subscription we are apparently not reading the data out of the mesh pub/sub socket.

I’ve also created an issue on GitHub, so you may follow it for an update on when it gets fixed: https://github.com/particle-iot/device-os/issues/1828


#9

Thanks, I’ve confirmed adding a subscribe to the test program resolves my issue. I do have other devices on the mesh network and they already had a mesh subscription.