Code Not Working after 15 Days in the Field

Opening this one back up again as I have another device showing the same issue.
We are now on version 0.7.0.

Our code now watches for a successful publish by way of particle.subscribe.
If not successful, we force a handshake with @rickkas7 suggested (Particle.publish("spark/device/session/end", "", PRIVATE);). The device then resets and runs again to hopefully handshake.

To recap, the device is waking, taking sensor measurements, turning on the modem, connecting to the cell network (via hologram sim) successfully, sending packets (according to both serial monitor and the hologram dashboard), and then going to sleep. But the publish is not being seen on the particle cloud. In fact, even though I can see the forced handshake seemingly publish (via serial monitor), the particle cloud is never registering a handshake.

Any other ideas?

This honestly sounds like either the UDP packets are getting lost over the air and not making it to particle, or there is an issue in the Particle backend. Iā€™m guessing something in the backend that they need to fix, but I donā€™t know. I just know itā€™s consistently the same devices dealing with this problem.

Have you tried removing the battery for about at least 5 seconds and then re-plugging it in? Does it then handshake? If it doesnā€™t, then something is really messed up. I tried doing a modem reset and such like suggested earlier, and even include code to force a handshake, but it only works sometimes.

Yes. An unplug of the battery will force a new handshake. In fact, a press of the reset button will even force a new handshake. Then everything works again as intended. Unfortunately, my devices are in enclosures. We do have a reset button that grounds the reset pin but this does not seem to work the same as the reset button on the board.
At any rate, after forcing a handshake it will run fine for maybe a month, coming out of sleep, checking sensors, and publishing several times a day like clockwork. Then suddenly I will find devices in this state. So far I have seen this issue on at least 4 devices.
As a failsafe, I have a function that runs the publish end session that Rick suggested, but that only works if the Particle Cloud is already recognizing the connection so it sort of defeats the purpose. What I would love is to replace that particular function With the equivalent of what happens after a battery pull or actual reset. Although not ideal, that would be a great failsafe for this issue. But I havenā€™t been able to figure out what exactly is happening that is different than a soft reset or reset pin grounding.

This sounds a bit like a well known effect when running into heap fragmentation (often caused by extensive use of String objects).

1 Like

I've seen this said a lot on Particle's forums, and there was certainly some truth it it back in AVR Arduino days, but I wonder if today it's really a red herring. The major issue with String dates back to an avr-libc bug with free(). This was resolved, Google Code Archive - Long-term storage for Google Code Project Hosting., and merged into the main Arduino codebase, Backported malloc and realloc from avr-libc 1.8.0 (without test code) by cmaglie Ā· Pull Request #1329 Ā· arduino/Arduino Ā· GitHub.

Of course, Particle being an STM32 would never have pulled from avr-libc, so perhaps it has its own free() bug, or there's some other issue. However, The HATRED for String objects - "To String, or not to String" - Programming Questions - Arduino Forum has a very nice conversation about the true pitfalls and it seems that there is a firm consensus that String() isn't inherently problem, just the implementation which can be.

The issue is not that you are running out of free heap memory as such (as may have been the cause with buggy free() implementation), but that - due to the lack of garbage collection - the heap will fragment over time eventually while still having enough free space no big enough block for a particular allocation request can be found.

But I agree, itā€™s not the implementation of String in itself nor the mere use of it, but due to the mutating nature of String objects when used without care and have them grow and relocate, you tend to leave a trail of smallish free blocks on the heap that may eventually lead to issues.

1 Like

In linked forum discussion it's pointed out that a proper implementation of free() will recover unallocated memory before and after the block being freed. The claim is that as a result fragmentation isn't really an issue (unless long strings are being used). This certainly rings true and was borne out in their testing.

Of course, that discussion is somewhat academic since avr-libc behavior doesn't carry over to STM32. Still, I wonder if there have been known cases in the wild where Particle devices have rebooted because of stack/heap collision resulting from overuse of small strings.

It is a real problem and it happens all the time. The problem is not that memory is never freed, it is heap fragmentation. Maybe the Particle free implementation is less good or maybe the allocation pattern between the user code and particle cloud code is just pessimistic but the problem is real.

It is very easy to use Strings in a casual way that uses a lot of heap memory and it is hard to use Strings in a way that is statically allocated and memory friendly. It is not impossible to use Strings, it is just hard to use them well and over time (typically days to weeks) heap fragmentation eats up bits of RAM to the point where some large chuck like a TCP buffer cannot be allocated and your program crashes.

If you want to use Strings, globally allocate them and always use clear and append to set them. But then you might as well use char arrays and be done with it.

This is interesting. I do use String objects. Showing my naivety, I didnā€™t realize they could even be an issue.
That said, my program never crashes. It continues to run dutifully. The sensor data is still collected and the Particle.publish still runs. It just isnā€™t seen on the SSE stream until I somehow force a handshake.
Knowing this, do you still think this could be the issue? If so, I will do some work on reducing the use of String objects in my user code.
I am currently letting a device run with a trace log level to help track down this problem. But it might be a few weeks before I see anything.

Could be. It is hard to say without seeing the code. The typical symptom is an SOS crash with 8 red flashes indicating out of heap memory, but some things just block forever trying to allocated a large enough block of memory.

1 Like