Periodic Spark Core Reset

My co-developer and I each have a Spark Core programmed to receive and log sensor trips to a RAM based circular buffer. We have both experienced random resets of the Core when the Core is apparently doing nothing (other than running the loop code of the firmware). The Cores are in different locations with different power and different WiFi connections to the Internet.

The interval between resets is several hours to a day or more, and they appear to occur randomly. The firmware running on my Core synchronizes time with the cloud once per day and publishes sensor trip events to the cloud when they are received, albeit the resets do not appear to happen on either of these actions. My co-developer has more Spark.publish() events and seems to get the reset more frequently, albeit not coincident with any particular event.

Although the resets are infrequent and seemingly random, we have each actually observed it happening once. Here is what happens:

  • the Core is breathing Cyan, normally.
  • suddenly, the Core starts flashing cyan or white (hard to tell) rapidly.
  • after a second or two of this, the Core starts flashing red. The red flashes appear to be the SOS message followed by some flashes. It is hard to count the flashes accurately and this only happens once, but it is possible that the number of flashes is 8 (code for out of heap space).
  • after one sequence of red flashes, the Core seems to reset. It flashes green (WiFi connection) and that fast purple or white and then it is back up breathing cyan again and running the firmware just fine, albeit after having run setup() and reset everything (thus losing our sensor log information).

When I observed this on my Core, I can guarantee that there were no sensor trips or anything that might be more than the regular loop() activity, which had been running this way for over 23 hours previously with no problem at all. I had my Core powered via a USB port on a laptop with power saving turned off, in order to guarantee that the resets are not caused by a power supply problem or AC glitch.

It would appear that the Core is frequently communicating over the cloud (at least opening or restoring a cloud connection) and that occasionally there is some communication delay or something that crashes the cloud communication code which in turn resets the Core. Why the Core would suddenly run out of heap space (if that is what the problem really is) is totally unexplained.

Anyone have any idea what is happening here? An unplanned reset even once per day, and even automatically recovering, is unacceptable as we lose our sensor data. Yes, the data could be logged to some external non-volatile storage, but we believe that these resets should not be happening. Since our firmware does not seem to be causing them, we don’t know how to solve this problem.

Any help would be appreciated. Even just an understanding of what the Core i doing in the “background” vs a vs cloud communication might be helpful.

Just to clarify – our firmware has 3 Spark.function() and 4 Spark.variable(), in addition to a Spark.time() and Spark.publish(). I am now re-testing with Saprk.time() and Spark.publsih() commented out.

To help helping, could you show your code?

If it actually is a heap problem, are you using dynamic mem explicitly, implicitly by frequently creating/destroying objects or unnoticed/unexpected e.g. by use of String operations that might call for reallocation?

I’l have to find a place to store the code to share it – it is large. And yes, there are string manipulations in the code that can dynamically allocate memory, albeit we have taken care to prevent memory leaks. But I think that the most important point here is that these resets happen when nothing at all is changing, no strings are being manipulated, nothing is being logged, etc. It happened again this morning, after about 22 hours of OK operation. The firmware is in loop() and is looking for new sensor trip data, but there is no new sensor trip data in these tests because the sensors are under my control and not being operated (and I have verified this but reading the logs periodically). So the code is in a simple loop that is doing nothing at all that is in any way memory related. It is not logging anything new, not writing anything anywhere, not building strings – nothing but reading inputs and deciding that nothing is there to process. Furthermore, I had previously stress tested the firmware with a sensor simulator that generated a new sensor event once every three seconds (more than 100 times faster than in normal operation). I ran this stress test for over 14 hours (nearly 15,000 sensor events) with no problems at all. If there was a memory leak in my firmware, it would have shown up in the stress test and not when there are 50 sensor events per day.

I believe that what is happening is the the Core is continually connecting or communicating to the cloud – just as a heartbeat (since I am not actually changing anything in this test). Very occasionally, there may be a slow Internet connection or something that causes this background connectivity to go into red flashing and the microprocessor on the Core to reset (or at least go back and execute setup() ). But this is just a theory, as I don’t understand what is actually happening vis-a-vis cloud communication when the firmware is just sitting there looping with nothing new to process. The problem is not momentarily lost cloud connections because the system has been designed to be tolerant of this (via the circular buffer log in the RAM). The problem is that my Core is resetting randomly, when nothing at all is happening, which effectively wipes out the RAM (re-initializes the buffer pointers, in any event).

Hi @BobG

I am 90+% sure you have a memory problem because I have seen these symptoms before. This type of problem can be fixed but it often requires a bit of a rewrite.

What typically happens is that the Spark cloud connection, running in the background (outside of loop), allocates and deallocates memory dynamically when it needs to connect. Normally this isn’t a problem but when the last operation that your code did causes the heap to be fragmented such that it cannot allocate a large enough chunk of memory for the Spark cloud, it will fail in the way that you described. This can happen up to ten seconds after the operation that caused the heap fragmentation.

String operations that return new String objects, like substring are particularly problematic. You don’t have to call new or String to be allocating new objects. Moving your code to statically allocating either char* arrays or Strings that don’t change size/return new Strings will generally fix the problem.

As @ScruffR said, if you can share your code with us, we have help. You can also try boiling your code down to the minimum that will fail.

1 Like

Thanks to both bko and ScruffR for your help and assistance. I will happily share the code with you but it is way too long to paste here. I can email it to you or post it somewhere - suggestions?

@bko: everything you have said makes sense to me and I do think that the problem may be heap fragmentation. The only issue that I have is that there is no activity at all happening within 10 seconds of the reset (no string building, etc.). The loop code is only reading and deciding that there is nothing new to do when the reset happens. This is the very strange part. Do you know how often this background cloud activity runs? It may be that the heap got fragmented enough hours before but the background activity didn’t take place until just prior to the time of reset. Does this make any sense to you? Do you know what size block of free RAM the background cloud communication needs? Does this vary if when it fires off my Internet happens to be slow (might account for why it happens so infrequently)?

My large circular buffer takes up a few KB of RAM. As the next test, I have reduced the size significantly to see if extra free RAM makes any difference. We will see in a few hours or a day or so. Statistics:
Normal code: RAM = 16,196 (79.1%); Flash = 88,152 (79.7%)
New test: RAM = 14,720 (71.9%); Flash = 88,152 (79.7%)
“Blink”: RAM = 13,088 (63.9%); Flash = 77,356 (69.9%)

Thank you both for your help.

Hi @BobG

To put your code online, you can use Github or gist or pastebin or any other web service with a public folder like Dropbox or Box.

I don’t know how much RAM the cloud needs to establish a connection. I am sure part of that is how much RAM is used for the item in question since it could be a short 64-byte publish string or a long 622-byte string variable.

It is now 49 hours of running my code with reduced RAM (1.5KB less) and I am happy to report that no resets of the Core have happened. Thanks bko and ScruffR for suggesting that this is a RAM problem. There are places where I do string building in a manner that can cause heap fragmentation, and I will clean this all up. The issue remains, however, as to exactly what triggers this reset, as my code was doing nothing at all but looping, waiting for some data (which never comes because I am not generating any). There are functions and variables exposed to the cloud, but I am not accessing them. There are no publications, I have disables time synchronization, that there is certainly no string building or other such activity going on at this time. So the trigger is some background activity that Spark must run periodically.

I did happen to see an “event” this morning. I just happened to be sitting near my Core and checking my e-mail when the Core suddenly started flashing white (perhaps cyan - hard to tell). The flashing went on for some time – 10 seconds or more – enough that I was wondering it would recover on its own. Then, it flashed red three times (no SOS) and came back breathing cyan. At that point, I found that my code was running just as it was before the event – no reset, everything restored and OK.

SPARK TEAM: we users would appreciate some more information about what “background” activities run on the Core, how often, and what they do, etc. Is there some “heartbeat” to maintain connection to the cloud? How often does this run? What resources does it normally consume (time, RAM, etc.)? What happens if there is a slow Internet or other issue with the heartbeat? How does it recover (I can see that it does) and what resources are used in the recovery? And finally, are interrupts disabled when these things happen and if so, for how long? It is obvious that the Core-cloud communication must be maintained, but it is also important for us to know that real-time code can run and not run out of resources or be unable to respond to interrupts or test inputs without losing track of the world that it is controlling. We need enough information to properly code around the necessary cloud communication activity so as not to lose valuable state data. Thanks.