Serial1 RX randomly dies on Boron 2.0.1/red SOS crash with second Serial1.begin

Paul_M · December 26, 2020, 11:16pm

My entire use case for Particle products is simply internet-LTE-uploading minutely binary UART packets sent by a Teensy 3.5 microcontroller.

I have a couple remote installations with Boron 1.3.1-rc1 which are doing this great, but the entire platform is dysfunctional for me on 2.0.1, and this issue I am pretty sure was also happening on 1.5.something before I upgraded.

ISSUE: Serial1 RX randomly, repeatedly, eventually dies. It stops giving data even though I have confirmed my minutely UART packets are indeed getting sent to it. When it enters this state, a cloud call retriggering “Serial1.begin(115200)” causes the Boron to red LED SOS hard crash and reset, whereas re-calling “Serial1.begin(115200)” before the RX-dropout condition happens returns normal, i.e., nothing happens, as Serial1 was already started. But it causes a kernel panic crash in the state where Serial1.available() returns 0 incorrectly.

Details: I am using SerialBufferRK as my packets are 256 bytes and the receive buffer is frustratingly 64 bytes, but I do not believe this is an issue with the SerialBufferRK library because I’ve used that on older Borons where I never had this problem. I believe this is a firmware issue that Particle recently introduced. I can’t downgrade to 1.3.1-rc1 because I am using base64 library now which crashes red SOS panic on this older version.

Does anyone else experience this? What is the cause?

ScruffR · December 26, 2020, 11:21pm

Can you provide a minimal test code that exhibits this issue?
With the symptom description above it’s not easy to provide a definitive answer.

BTW, how many slow blinks follow the SOS signal?

Paul_M · December 26, 2020, 11:23pm

Thanks @ScruffR.

I will have to invest more time to get minimal example code because I need to cut out unused, uninfluencing parts of my current code.

However I can tell you the number of slow blinks on calling Serial1.begin during this state in just a few minutes when it will predictably lose Serial1 input (crashed 5 minutes ago, typically takes 5-10 minutes to enter the serial loss state).

Paul_M · December 26, 2020, 11:25pm

It just happened again, however the way Particle modulates the red SOS light makes it incredibly difficult for me to discern the pattern. There appears to be a very small temporal differentiation between the pulse durations, and I am not an expert in Morse code. I’ll have to take a video of it in 5-10minutes when it happens again and then scrutinize the LED patterns. Thank you for your help and patience.

Paul_M · December 26, 2020, 11:38pm

Dear @ScruffR, after seeing it again and consulting https://docs.particle.io/tutorials/device-os/led/boron/#network-reset-fast-blinking-blue-, there is one (1) red blink after the SOS pattern, indicating, “1. Hard fault”.

This is what happens after I call Serial.begin again during the dysfunctional condition where the Serial1 was not showing bytes it truly was receiving.

I did notice that, once I called Serial.begin again, right before the SOS crash it did resume getting data (for 1 or 2 seconds before it crashed)

So there is some low-level problem with 2.0.1, maybe interacted with SerialBufferRK, where the Serial1 RX goes totally haywire.

I will post minimum reproducible code when able.

jgskarda · December 27, 2020, 4:25am

I doubt this is related but I was also getting red SOS when using Serial on Device OS 2.0.0. However, I was not using Serial rx to talk to another device but rather just doing Serial.println(). In my case, I converted all serial.print to be Info.log and issue went away. Just figured I’d share in case it could be related somehow. Good luck!

Gildons · January 1, 2021, 8:58pm

Hello @ScruffR and @Paul_M

I’m facing the same problem. I spent a lot of time debugging my own code but ended up narrowing down the problem to the Seria1.read functions.

As far as I can tell, given enough received data over time, or a continuous burst of data some conditions will happen, and firmware will crash.

Here is my minimal code for testing:

#include "Particle.h"
#include <functional>

SYSTEM_MODE(AUTOMATIC);

Thread *mainThread;

void stuff()
{
    Serial1.begin(230400);
    char *buff;
    buff = new char[100];
    size_t ndata{0};
    while(1)
    {
        //Now we can send the request and read the response
        memset(buff, 0, 100);
        Serial1.write("REQUEST:READ_ALL:\r\n");
        ndata = Serial1.readBytesUntil('\n', buff, 100);

        Serial.printf("Read: %d --", ndata);
        for(uint8_t i = 0; i < ndata; ++i)
        {
            if(buff[i] >= '!' && buff[i] <= '~')
            {
                Serial.printf("%c", buff[i]);
            }
            else
            {
                Serial.printf("|%d|", buff[i]);
            }
            
        }
        Serial.printf("-- \r\n");
        }
}
void SystemEvtsReport(system_event_t event, int data)
{
}

void setup()
{
    //Init onboard USB-Serial UART
    Serial.begin(115200);

    System.on(network_status | cloud_status, SystemEvtsReport);
    mainThread = new Thread("ccu_loop", stuff);
}

void loop()
{

}

and here my python code using to send data:

import serial

ser = serial.Serial("/dev/serial0", 230400)

for x in range(100000):
	ser.write("hahashas".encode())

I’m aware that the internal serial buffer is 64 bytes as per the documentation, so data corruption is possible if you don’t read from Serial1 fast enough. However, it seems that if a significant amount of bytes are received (probably something above 64 bytes considering the differences between bytes read and received) a buffer overflow is happening somewhere and things go bad.

I have been looking at my code for more time than I should, so if I missed any obvious errors on it please let me know.

Also, I tested with both 2.0.1 and 1.5.4-rc1 versions of DeviceOS and the ~~result is the same~~.
EDIT:
Actually, using 1.5.4-rc1 the board sometimes goes to:

A halt and just stops printing, and the RGB LED stops completely
Sometimes it goes into SOS
Sometimes it just stops printing the received data even though my script is still sending data via UART

It seems that the behavior of 1.5.4 is not the same as 2.0.1, but considering that 1.5.4 has a lot more stuff going in the background that seems like a reasonable scenario

Device: Boron 2G/3G
DeviceOS: 2.0.1
#Blinks during SOS: 1 (Hard fault)

ScruffR · January 2, 2021, 11:21am

Can you provide a crashing example without the use of threads?
As this application note states most APIs are not thread safe.

Gildons · January 2, 2021, 5:33pm

Hello ScruffR,

I thought about that on my way home. I’ll try to swing by the lab today and test it without a new thread, or at least without using Serial (not Serial1).

Serial1 shouldn’t be shared with DeviceOS Tasks and only “ccu_loop” thread would be using it, so it should still be a valid scenario if this is related it the produce/consume rate.

Gildons · January 2, 2021, 7:28pm

Hi ScruffR,

So I created a test case as you request without the use of threads, and unfortunately, the problem persists. For a moment I thought that you were right on point with the non-thread-safe APIs, but after running the test for a little longer I noticed that the problem persists.

There are two scenarios:

The firmware crashes and goes into SOS.
The firmware keeps running, but Serial1 stops reading data from serial. Write still works as I can seed my “REQUEST” message via minicom on the other side.

Read: 100 --kljfhsdkfjshdkflsjhfksdjhfslfjhsfldlfkjshahashashaAAAjdfkjdhfkjhfklsjfhskljfhsdkfjshdkflsjhfksdjhfsl--
Read: 100 --fjhsflkjsdhfkdjfhsdlfkjhsdlfkjshhfklsjfhskljfhsdkfjshdkflsjhfksdjhfslfjhsflkjsdhfkdjfhsdlfkjhsdlfkjs--
Read: 100 --hahashashaAAAjdfkjdhfkjhfkldjhfslfjhsflkjsdhfkdjfhsdlfkjhsdlfkjshahashashaAAAjdfkjdhfkjhfklsjfhskljf--
Read: 9 --hsdkfjshd--
Read: 0 ----
Read: 0 ----

It’s interesting to note that it tends to crash into SOS before it finishes the connection with the cloud, and it tends to stop reading if it’s able to close the connection with the cloud.

Here is my non-thread version:

#include "Particle.h"
#include <functional>

SYSTEM_MODE(AUTOMATIC);

char *buff;

void stuff()
{
    size_t ndata{0};

    memset(buff, 0, 100);
    Serial1.write("REQUEST:READ_ALL:\r\n");
    ndata = Serial1.readBytesUntil('\n', buff, 100);

    Serial.printf("Read: %d --", ndata);
    for(uint8_t i = 0; i < ndata; ++i)
    {
        if(buff[i] >= '!' && buff[i] <= '~')
        {
            Serial.printf("%c", buff[i]);
        }
        else
        {
            Serial.printf("|%d|", buff[i]);
        }
    }
    Serial.printf("--\r\n");
}

void SystemEvtsReport(system_event_t event, int data)
{
}

void setup()
{
    //Init onboard USB-Serial UART
    Serial.begin(115200);

    Serial1.begin(230400);
    buff = new char[100];

    System.on(network_status | cloud_status, SystemEvtsReport);
}

void loop()
{
    stuff();
}

I also ran some testing using threads but without Serial.printf (trying to avoid conflict) but it still crashes.

Paul_M · January 2, 2021, 8:22pm

I’m thankful @Gildons that you were able to post your own minimized code repeating this severe, critical issue before I can due to time constraints.

@rickkas7, we have demonstrated that recent Particle firmware updates have killed beyond-momentary usage of the Serial 1 RX and this completely kills my use case for Particle products for me.

I am not willing to revert to my stable 1.3.1-rc1 design because I wind up paying twice (with the expensive 40cent overage) using MQTT rather than Particle publish, and on the older version it’s impossible to upload binary data that way because base64 consistently crashes.

So when the competitor is offering 10cents/mb with a much more capable breakout board with a ton of working serials vs one broken serial, I could not justify it.

At the same time I’d hate for this issue to be the killer of all the time I’ve invested in the possibility of Particle.

Do you know how to fix the Serial 1 RX crash @rickkas7? Thank you for your help.

Gildons · January 4, 2021, 9:19pm

@ScruffR,

If you have any other ideas or suggestions for tests to run on my side please let me know!

marekparticle · January 4, 2021, 9:51pm

Hi all,

Thank you for reporting this issue; I’ve relayed these concerns to our Engineering team and I’ll provide an update as soon as one is available.

Willybood · January 5, 2021, 3:23pm

I’m seeing a similar issue with the B5 SOM chipset.
In my case I’m communicating with a CanBus module using Serial1. In some cases when I write it causes a red SoS hard fault error.

Oddly enough reading and writing to Serial1 in certain parts of the code work fine, but in some functions it doesn’t. When I wrote a quick test program to take a look at this it works in some functions but not others, and I was able to get it to work predictably after moving it around in the code. It looks like something odd is happening at compilation time.

This fix isn’t predictable, replicating the change that caused my test code to work in another project doesn’t.

If it matters, I’m using the cloud compiler via the CLI.

Gildons · January 5, 2021, 8:54pm

Hi @all,

I’m currently suspecting that this problem is an edge case when pushing data from the USART to the middleware buffer (the 67 bytes one).

With that in mind, I believe that I came up with a workaround while we wait for confirmation/fix of this issue.

I moved all my Serial1 calls ( write && readBytesUntill()) to a single point and surround it with Serial1.begin() and Serial1.end(). With that, I believe that internal buffer pointers are reset and I can avoid the edge case as long as the message I’m trying to read is less than 67 bytes (in my case around 53)

        Serial1.begin(230400);
        Serial1.write("REQUEST:READ_ALL:\r\n");
        ndata = Serial1.readBytesUntil('\n', buff, 100);
        Serial1.end();

It looks like it’s working for now, but I hope we have a permanent fix soon!

toddparticle · January 6, 2021, 7:29pm

Hi @Gildons I’m trying to reproduce this while using the Particle Debugger so that we can find a root cause. So far I have not been able to reproduce either 1 (crashes into SOS) or 2 (no further serial input received)-- can you help me understand your test setup in more detail?

How do you force the device to crash into SOS? I have been starting the python serial sender script first, resetting the device and watching the serial output. For me, due to the AUTOMATIC mode of the example code, the cloud connection completes before the first Read: 100 appears in the serial output.
When you say that “Serial1 stops reading data”, do you mean that further sends from rerunning the python script are not received? For me, if I rerun the python script after it finishes, I can see that additional serial data is received and echoed by the firmware. One thing I added to the python script is a ser.close() at the end of the script.

Thanks for any help!

Gildons · January 6, 2021, 7:51pm

Hello @toddparticle,

Sure thing.

First, let me add some more information regarding the problem overall. This constant flow of data via the serial interface is not needed to make the problem happen. It was only a way for me to test the scenario. My actual code is sending the request message from, as shown in my test code, and receiving back a messages that is 53 bytes in size (terminated with ‘\n’ as suggested by my receiveBytesUntil condition). I also wanna point out in this case I was running in a dedicated “User thread”.

As for the test variation without threads as ScruffR asked, I ended up increasing the string sent by my script to around 100 characters-ish (not terminated with ‘\n’)

I don’t need to do anything other than wait, although the SOS using this forced test seem to be more common with 1.5.4rc1
Yes, rerunning or continuing to run the script will result in Read: 0 ---- and a previous post shows. Calling ser.close() is a good call for the script, but I personally was running it via python’s CLI, so I kept it open between runs.

Let me know if you need more details!

steeves · January 12, 2021, 4:04pm

Hi Folks,

Any update about the problem ?

Gildons · January 22, 2021, 2:25am

Any updates @toddparticle ?

toddparticle · January 22, 2021, 3:15am

Thanks for the detailed reproducible case for Boron. We were able to boil it down to a case reproducible on the Argon for high speed Serial1 input. We have a fix in progress and expect to release it with the next LTS release. Stay tuned and thank you for your patience!

Topic		Replies	Views
Serial1 connection weirdness Troubleshooting boron , photon	24	1477	May 25, 2019
Boron Serial1 problems Troubleshooting boron	18	2041	January 14, 2021
Debugging a locked up Boron Troubleshooting boron	5	1586	May 16, 2019
Is the Mesh devices UART implementation buggy? Firmware	58	3055	November 29, 2019
Particle Boron - Occasionally goes into Flashing Red SOS + 10 blinks - Assertion failure - How to troubleshoot? Device OS	12	1152	December 19, 2020

Serial1 RX randomly dies on Boron 2.0.1/red SOS crash with second Serial1.begin

Related topics