Very simple SW, just sleep and publish, panics randomly

I have noticed that most of my Xenons reset and shows in particle console this several times a day:
“spark/device/last_reset panic, assert_failed”

I have reduced to SW to bare minimum that causes panics. It does not do anything else than publishes and sleeps.
Removing sleep removes the problem, so seems to removing the publish. But these are pretty standard stuff, should work without problems.
This appears in Xenons but I am not using any Xenon features in test program. Of course it is part of the mesh in order to publish and I have one Argon in mesh to connect to internet.
I have also noticed that actually there are more resets than the console shows, the panic text is not always shown.
This works same way with 1.4.4. and 1.5.0

Is this a known problem? How can I debug it?

The code (could be even simpler than this):

unsigned long waitEndTime = 0;

enum State { wait, measure, ready } state = wait ;

void setup() 
{
    waitEndTime = millis() + 30*1000;
}

void loop() 
{
    if (state == wait)
    {   // Wait a while in beginning to allow cloud flash after reset. 
        // After this the SW is online so briefly that cloud flash is not possible
        if (millis() > waitEndTime){
            state = measure;
        }
    }
    else if (state == measure)
    {
        Particle.publish("test", "testing...", PRIVATE);
        state = ready;
    }
    else if (state == ready)
    {
        System.sleep(D2, RISING, 10);
        state = measure;
    }
}

There might be a race condition.
You may want to wrap your measure state block in an additional check for Particle.connected().

BTW, why are you not using switch() case for your FSM?

OK, thanks. Target is of course to wait until connected. I thought that publish will make sure of that.
I will update the code so that it loops until connected returns true and then publishes. Lets see how it impacts. BTW, this is not mentioned in documentation.

(Yes, switch-case is usually used, I just left it this way since it showed the problem, changing and testing the change takes about one day.)

Unfortunately it doesn’t (I don’t understand it either).

I added test around publish so that is publishes only when connected and then moves to next state:

        if (Particle.connected()){
            Particle.publish("test", "testing...", PRIVATE);
            state = ready;
        }

Unfortunately, testing for connected did not have any impact. Still panics.
Any other ideas? No one else seeing this? It should happen with any Xenon, that goes to sleep and then publishes. I have multiple Xenons with different SW and all that do this sleep/publish pattern do panics too.

So the simplest program that panics is:

void setup() 
{
}
void loop() 
{
    if (Particle.connected()){
        Particle.publish("test", "testing...", PRIVATE);
        System.sleep(D2, RISING, 10);
    }
}

It is quite clear that there is no bug in this program.
Device OS has some faults that are related to waking up from sleep.

I guess this is something for the Paricle folks to chime in on (@marekparticle, @avtolstoy).

However, since the Xenon won’t be getting too much attention anymore before its sunset with 1.6.0, can you reproduce the issue with an Argon or Boron?

My only Argon is on duty 24/7 and I do not have any Boron.
I am also afraid that this does not get fixed if it is related to mesh.

@no1089 is very plugged into Sleep right now, can you comment?

1 Like

:slight_smile: I’m still figuring it out myself.
@meshmesh Are you running DeviceOS 1.4.4 or 1.5? 1.5 has an entirely revamped sleep configurator, that is a little tricky to figure out. You can’t use the old method of calling sleep any more.

One little note: Hibernate is not deep sleep, and Stop is standby - I might have that wrong too, so take this all with a grain of salt.

This is sleep on the Xenon with 1.5:

SystemSleepConfiguration config;

config.mode(SystemSleepMode::STOP).gpio(D0, RISING);

System.sleep(config);

Let me just double check how to wake up after x minutes.

This would contradict the STM nomencalture tho’
Stop and Standby are two entirely separate concepts as far as STM names it.

Mingling them into one would probably cause a lot of confusion when reading older threads or seasoned developers transitioning to 1.5.0.

Stop sleep used to be System.sleep(wakePin, wakeEdge, period) while
Deep Sleep was STM’s Standby.

Wouldn’t that mean a breaking change that would require a major version number change to 2.x.x?

Minor version change allows for adding features and (slightly) altering behaviour, but breaking otherwise working code? I wouldn’t think so :flushed:

For reference:


1 Like

@no1089, I am using 1.5.0. The “old” sleep seems to work fine (most of the time). But the Sleep 2.0 does not, I have another thread about specifically sleep 2.0: Sleep 2.0 in 1.5.0 not working as expected. Maybe it is better discuss Sleep API there.

Ok, disregard my advice. I had myself very confused. Hibernate is the new Deep Sleep of the old Sleep method. :man_facepalming:

Let me get back to you on breaking changes.

  • So I was wrong again. It can still be used, but we encourage you to move code into the new version.
1 Like

I got new API working, it seems that I have to connect manually after sleep even in automatic mode. I try with that now if I still get panics.

I got panic in two minutes, another one in another two minutes. Seems that with new API it is even worse!

Send me your full code so I can check on my devices? You are running this on a Xenon?
SystemSleepConfiguration config;

config.mode(SystemSleepMode::STOP)
      .gpio(WKP, RISING)
      .duration(60s);
SystemSleepResult result = System.sleep(config);

That should work.

Yes, I am running in Xenon.
The shortest code to show the panic problem is:

SYSTEM_THREAD(ENABLED);

SystemSleepConfiguration config;

void setup() 
{
    config.mode(SystemSleepMode::STOP)
       .gpio(WKP, RISING)
       .duration(5s);
}
void loop() 
{
    if (Particle.connected()){
        Particle.publish("test", "testing...", PRIVATE);
        System.sleep(config);
        Particle.connect();
    }
}

I got this problem to appear when running with particle debugger. Call stack:

HAL_Delay_Microseconds@0x00039c54 (/home/tkk/projects/particle/device-os/hal/src/nRF52840/delay_hal.cpp:43)
panic_@0x000584fe (/home/tkk/projects/particle/device-os/services/src/panic.c:98)
__assert_func@0x0003b9ca (/home/tkk/projects/particle/device-os/hal/src/nRF52840/newlib.cpp:64)
ntf_enter@0x00074162 (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_swi.c:340)
nrf_802154_swi_notify_receive_failed@0x00074282 (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_swi.c:456)
nrf_802154_notify_receive_failed@0x00073f36 (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_notification_swi.c:57)
receive_failed_notify@0x000751ae (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_core.c:289)
irq_bcmatch_state_rx@0x0007643a (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_core.c:1878)
irq_handler@0x0007643a (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_core.c:2540)
nrf_802154_core_irq_handler@0x0007643a (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154_core.c:3077)
nrf_802154_radio_irq_handler@0x00074f02 (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/nrf_802154.c:241)
signal_handler@0x000749a0 (/home/tkk/projects/particle/device-os/third_party/openthread/openthread/radio/rsch/raal/softdevice/nrf_raal_softdevice.c:566)
??@0x00025cc4 (Unknown Source:0)

In ntf_enter the code that asserts is:
assert(!ntf_queue_is_full());
Does that tell something to someone?

I think I finally cracked this!
I debugged the problem and found out that the notify event queue (m_ntf_queue in device-os/third_party/openthread/openthread/radio/nrf_802154_swi.c) was full of NTF_TYPE_RECEIVE_FAILED events with error code NRF_802154_RX_ERROR_INVALID_DEST_ADDR. So other mesh devices were sending more frames that could fit into event queue at the time. I also found out that this happens when another Xenon comes out of sleep. I have six other Xenons running and three of them execute periodic sleep with different intervals (5 and 10 min, 1 hour). This caused seemingly random pattern of panics.
When testing the sleep, I had one extra Xenon with 5 seconds sleep interval so the problem manifestated almost immediately. >
So if someone is running the test program without any other device executing sleep sequence, the problem does not appear.

I fixed this now by ignoring these events since they are frames that are not meant for this device, they will be eventually more or less ignored anyway. So I added in device-os/third_party/openthread/openthread/radio/nrf_802154_core.c:

 /** Notify MAC layer that receive procedure failed. */
static void receive_failed_notify(nrf_802154_rx_error_t error)
{
    if (error == NRF_802154_RX_ERROR_INVALID_DEST_ADDR)
    {
        return; // this frame was not meant to us, ignore it
    }
    ...

The event then never gets inserted in notify queue.

This problem is probably related to receiving the events from another device while still waking up from sleep. It occurs more easily with tight sleep loops like in the example. But it occurs in all my Xenons having sleep cycle, regardless of the structure of the SW.

Should I create bug for this?

1 Like