Another unusuable, never-connect Boron situation (Update: Console free RAM reporting issue)

Paul_M · October 16, 2020, 7:45pm

My valid, error-free code causes repeatable panic hard fault red LED SOS crashes on 1.31.-rc1. It reconnects and reboots fine.
On higher Device OS versions, the same code does not cause the panic hard fault (this is further evidence it’s not my code’s fault)
However, on higher Device OS versions, as I have thoroughly documented in my past threads, and seen anew for myself in recent days, the cellular reconnection is utterly busted, and my device will go long periods (and perhaps eternally) flashing green.
Now I cannot even get it to connect with my code on 1.5.2. Tinker on 1.5.2 connects, but my simply code hangs on seeming eternal green flash with SEMI_AUTOMATIC Cellular.on() Cellular.connect().
You may say, “grab the cellular logs by using this:https://github.com/rickkas7/boron-clouddebug” But this utterly fails. When I try to flash boron.bin, it says “Error writing firmware: CRC is invalid, use --force to override”. Then I force flash, and do particle serial monitor, and I get nothing. There is no cloud information.
Catch 22: I can’t have my Boron not crash with valid code unless the Device OS is newer than 1.3.1-rc1, yet I can’t have my Boron stably reconnect to the cloud (or, it seems right now, even connect at all from a cold power-cycle) unless I stay on V. 1.3.1-rc1.
Help. Does anyone else have these kinds of ridiculous seeming issues? My old code which uses MQTT to upload does not cause the red panic issue, but I am trying to cut data usage by cutting MQTT and using Base64RK + Particle.publish to do the uploading. I am so exasperated by this dysfunctional Boron platform but at the same time I’ve made my custom PCBs for it and I have a few mysterious units in the field which decided to be stable. My newest installation is not one of them: that Boron died and still hasn’t reconnected since Oct 8th (8 days ago), meanwhile my ElectricImp imp006 in the same spot using the same power supply receiving the same UART data from my sensor controller has continued to reliably upload.

Paul_M · October 16, 2020, 7:50pm

Further information:

Now it connects when I change:
retained uint8_t publishQueueRetainedBuffer[141*20];
PublishQueueAsync publishQueue(publishQueueRetainedBuffer, sizeof(publishQueueRetainedBuffer));

to

retained uint8_t publishQueueRetainedBuffer[141*10];
PublishQueueAsync publishQueue(publishQueueRetainedBuffer, sizeof(publishQueueRetainedBuffer));

This is absurd that the usage of a little more stack memory causes the initial connection to cloud to utterly fail.
Why do my vitals show “117kB of 146kB RAM used” if in fact I am apparently on the upper end of RAM usage such that a mere usage of 1410 more bytes on the statically allocated stack causes it to fail.
Why does the cloud compile fail with a memory overload when I try to use “21 * 141” for the retained buffer size instead of “20*141”. Why is it behaving like I am on the upper threshold of stack memory usage when the cloud console is telling me I have a ton more free memory, '117kB of 146kB RAM used:". This contradictory information is what caused this confusion and issue.

Thank you.

Paul_M · October 16, 2020, 8:42pm

Now it won’t connect ever connect on 1.5.2 even when I take the memory way down, whereas Tinker connects. This is insane and inexplicable. Just an hour ago the same code would connect just like Tinker does. How many bytes of memory is someone allowed to statically allocate on the stack in order for the entire Boron product to not be useless and unusuable, never connecting to the cloud?

UPDATE: After a power-cycle, it connected and then immediately disconnected and is stuck flashing green. I can’t understand why Particle’s firmware is so fickle, random, sensitive, deceptive (saying there’s tons of RAM available when using just 1kb prohibits all cell connection), and unreliable.

rickkas7 · October 16, 2020, 8:54pm

It is possible that you are running into a recently discovered situation where if there is not 4K of contiguous free RAM available at the time of cellular connection a Gen 3 cellular device (Boron, B Series SoM, Tracker SoM) it can enter an infinite blinking green state. This will happen at least in 1.5.0 through 2.0.0-rc.2, possibly more versions. It can happen at startup or on reconnect.

To determine if this is the case, make sure your device is running 1.5.2 or later.

Add:

SerialLogHandler logHander(LOG_LEVEL_TRACE);

to your code if you have not already done so.

Monitor the USB serial debug output.

If you see this message, repeatedly, while blinking green:

0118621303 [gsm0710muxer] INFO: Starting GSM07.10 muxer
0118621303 [ncp.client] ERROR: Failed to perform early initialization
0118621304 [ncp.client] TRACE: Hard resetting the modem

then you are experiencing this issue. The only solution right now is to make sure there is more RAM available.

Since you are apparently at the edge of available RAM, adding a RAM allocation failure handler may also be helpful for debugging:

Add a global:

int memoryFailure = -1;

Register an out of memory handler in setup:

    System.on(out_of_memory, outOfMemoryHandler);

A handler function:


void outOfMemoryHandler(system_event_t event, int data) {
    memoryFailure = data;
}

And from loop(), check the status:

    if (memoryFailure >= 0) {
        Log.info("out of memory event size=%d", memoryFailure);
        memoryFailure = -1;
    }

(The out of memory handler is called from the allocator’s thread. Since this may be a different thread with a small stack, calling Log.info from that context is not recommended, so a flag is set and is logged from loop.)

The forever blinking green thread muxer initialization failure will log an out of memory log message for 4096 bytes.

Paul_M · October 16, 2020, 9:00pm

Thank you @rickkas7 You are a genius and my frustrations with the Boron platform are not directed to you. I will try this. I am also for the first time trying your PublishQueueAsyncRK library with the goal of replacing MQTT and saving data.

Paul_M · October 16, 2020, 9:12pm

Before I perform @rickkas7’s procedure, let me share two things:

My impressions are highly consistent with his theory that there needs to be at least 4kb of free RAM or else it will remain permanently disconnected (gee, nice to know that after 17 months of working with the Boron and trying to make it stable)
Sounds like this further validates my previous claims stuff sometime after 1.3.1-rc1 got ruined stability wise; and
My experience today further validates cellular reconnection destruction in Particle Device OS POST-1.5.2. Yes, 1.5.2 as compared to 2.0.0-rc1/2. If you remove the antenna on 2.0.0-rc1/2 and reconnect, it stays flashing green for a long, long time or permanently. If you do the same thing on 1.5.2 or 1.3.1-rc1, it starts flashing blue to reconnect to cloud literally the instant you reconnect the antenna.

I am feeling better that 1.5.2 has the same memory leak cellular reconnection fix as 1.3.1-rc1 but has introduced the 4kb issue.

So I am right that the only stable cellular version is circa 1.3.1-rc1 (however for unknown distinct reasons it causes a red SOS panic crash with my code, unlike 1.5.2 and newer).

So I will see if I can get away with 1.5.2 and 4kb+ free RAM.

It is very frustrating @rickkas7 not being able to go over 2048 bytes with the buffer with your PublishQueueAsyncRK library. Competitor has a similar library using the full 4MB of SPI flash chip built-in.

Can you explain why I am at the upper end of free RAM when the Console is reporting "119kB of 147kB RAM used "? I should be able to safely make this buffer a whopping 27kb given the Console information. Thanks.

ScruffR · October 17, 2020, 5:58am

Do you have any other retained variables in your code?
The Backup RAM section is not any arbitrary portion of RAM that can be placed anywhere in the entire RAM address space but a dedicated area of (fixed) 3068 bytes. Hence the System.freeMemory() reading does tell you little about the availability for retained variables.

AFAICT it's not allocated on the stack either.

Paul_M · October 17, 2020, 6:16am

@ScruffR Thank you for illuminating the separateness of the “retained” SRAM memory which previously I did not realize. That would explain why I have tons of free RAM at the same time the compiler fails when my retained buffer is too big.

But now this confuses why my Boron was refusing to connect to cloud when my retained buffer wasn’t small.

Rickkas7’s theory seemed likely, but your revelation indicates I have plenty of RAM when my usage of most of the SRAM causes it to never connect.

It seems like there is another problem condition which I have run into, separate from <4kb RAM free, which will cause permanent disconnectivity.

ScruffR · October 17, 2020, 6:51am

I have the feeling this is a (somewhat) separate issue which only indirectly coincides with the amount of used retained memory.

Take this with a grain of salt but while Backup RAM is a "special" part of the RAM space it's not entirely separate or barred off. So it may well be used for non-retained data when the system decides to do so (e.g. for heap allocation).
Another factor to consider is fragmentation of your plentiful RAM.
As Rick said the issue arises when the system cannot find a contiguous block of 4KB - if you happen to have a lot of free memory but that memory falls apart into loads of fragments smaller than 4KB the issue will still hit.

Another factor Rick didn't mention but may play a role (pure conjecture here - to create a "viable" picture from only a few puzzle pieces available) would be if the system needed that 4KB block to live in a certain area of RAM (i.e. adjacent to Backup RAM).
This would also match my feeling that 3068 bytes is a somewhat odd number but may indicate that it's only the remainder of a bigger area (with a nicer byte count ) that's partly used for system data to be retained too.

BTW, independent from the "4KB bug" and Backup RAM topic you always need to consider on systems without garbage collection is heap fragmentation. Over time your free RAM may get fragmented and might hit unpredictably. Sometimes it causes the system to become unable to reconnect, sometimes it causes SOS panics, sometimes it causes code to fail silently - all depending on the code that tries to allocate dynamic memory.
Hence you need to check your own code but also used libraries whether they may cause heap fragmentation. One well known foe in that regard is the use of String where the strings keep growing/shrinking a lot or where many intermediate/temporary objects are created.
System.freeMemory() will not tell you about the "health" of your RAM space.
One way to debug such an issue would be to add some function that tries to allocate a fair sized chunk of data and reports whether it could or couldn't. This function can be called on demand (e.g. via a dedicated button or serial command) whenever you find your device in some of these "limbo" states to find out whether this actually is the cause or not.

system · April 17, 2021, 6:52pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Particle Boron that connected last night now permanently fails ever to connect to internet / flashes green Hardware boron	1	719	February 21, 2021
Boron 1.5.2 long reconnect interval issues Device OS boron	16	1614	August 26, 2020
Boron almost never connects except to publish power-on/reset events Troubleshooting boron	14	1027	February 6, 2019
Boron LTE - Problems with reconnecting to Cloud Cloud boron	8	531	May 15, 2021
Boron Connection Issue Connectivity boron	4	1074	July 29, 2021

Another unusuable, never-connect Boron situation (Update: Console free RAM reporting issue)

Related topics