E-Series resetting after publish

Hello—I’m seeing some unpredictable behavior with my E-Series when it’s in its publication sequence and I’m hoping someone may have some insight. The use case is that my sensor collects data for some time while out of cellular range and then when it comes back into range, it transmits an encoded version of the whole data file line-by-line in < 622 byte chunks. I’m using @rickkas7’s SPIFFS library though I think that’s unrelated to the problem (so sorry to bug you, @rickkas7!).

What I want to see (and sometimes do see) when I get a connection, is the continued looping through my file as it publishes line-by-line until there are no data left, then it goes to sleep.

However, sometimes what I see is it publishes only one line of data (confirmed in both the Console as well as in the endpoint to a webhook) after printing its publish_id to screen and then doesn’t print anything else to screen. It then does nothing that I can observe for ~ 10 seconds (it’s configured to publish roughly once per second) then resets, starting again at the very top of my script instead of staying within the loop where it should be for a couple minutes.

Here is the most relevant code chunk, I believe:

      // print out publish ID to screen
      Serial.println(publish_id);

      // publish data
#if USING_CELL_NETWORK_UPLOAD
      // I think a good idea is to use the MAC address of the device + date/time (filename)
      // if the publish is a success....
      // Serial.println("trying to publish...");
      if (Particle.publish(publish_id, data_to_publish, 60, PRIVATE))
#endif
        {

          Serial.println("Publish Success");

          // increment upload number
          upload_num++;
          // truncate the file
          bin_file.truncate(bin_file.length() - num_of_bytes_to_encode);
          bin_file.flush();

          // reset "no upload" timeout timer
          time_of_init_state_exit_ms = millis();
        }

#if USING_CELL_NETWORK_UPLOAD
        // if the publish wasn't successful
        else
        {
          // basically, don't truncate the last bit of the file so it will retry.
          // reset "no upload" timeout timer
          time_of_init_state_exit_ms = millis();
        }

        // keep connection alive (do once every 20 seconds or more often)
        Particle.process();
#endif

        // Debugging info about file length.
        Serial.println("File Length = " + String(bin_file.length()));

In my setup() loop, I have Serial.println("Reset Reason: " + String(System.resetReason())); so I can see that every time this happens and it wakes back up, it prints “Reset Reason: 0” but that’s not very descriptive based on the resetReason list that I’ve seen.

I have those #if blocks in there so that I can change the USING_CELL_NETWORK_UPLOAD define to zero to test everything else in my code without burning through a ton of data. Are those potentially problematic in the way they’re used here (I have a couple thousand lines of code so thought I should copy in only this snippet)? I wanted to copy enough to show that I have closed if and else blocks within the USING_CELL_NETWORK_UPLOAD scenario.

Some potentially relevant specs from the Console when I am seeing this bad behavior are:
88% battery charge
Good cellular signal
0 disconnect events
2579ms round-trip time
0 rate-limited publishes
31kB of 109kB RAM used

Anyone spot anything obviously wrong or poorly done here? Thank you!

It would be good to know some more about the context of these issues.
When you say “goes to sleep” what’s the sleep command you use?
How long is the device running between two occurences of this issue?
Since this seems to be connected to some kind of panic reset of the device, have you got any means to catch the SOS blink pattern before reset?
I see you’re using String in your snippet, so I guess you may be using it frequently in the other thousands of lines too, which can cause issues over time, so I always advise against String and rather use good-ol’ C strings (if you use deep sleep, heap fragmentation should be less of a problem than with stop mode sleep).
Not seeing the rest of you code, some additional things to look out for is tripple checking buffer integrity and boundary checks for any indexes you use, pointer validation, free memory trend during the lead-up to the issue, …

2 Likes

Thanks for jumping in to help, @ScruffR.

More recent test results first, then I'll answer your questions. After leaving it charging via USB overnight, I just reflashed the same firmware and observed the same behavior. Then I flashed very slightly modified firmware (I commented out the two #if/#endif blocks as I'd had #define USING_CELL_NETWORK_UPLOAD 1 anyway so no change, theoretically, by commenting out the if logic. Then I retested the upload and it worked as expected once. Then without (re)flashing anything, I tried another upload and it failed as previously described.

I've observed this previously where I had the exact same problem as described above then flashed with a minor change, observed proper behavior for a bit, then the problem returns. So that seems like an interesting clue, maybe free memory related?

As for your questions and comments,

      Cellular.on();
      System.sleep(SLEEP_MODE_DEEP);

I have it in SYSTEM_MODE(MANUAL) to save power by turning Cellular.off() when I know it's out of range. But I know to turn Cellular.on() just before putting it into SLEEP_MODE_DEEP in order to properly, fully turn everything off.

It depends. I have two different ways to reproduce it. One is a normal operation of the device which is that it is turned on via a WKP trigger and it logs data till it gets a stop recording command (more on this in a sec) then it begins to try to transmit over cellular. The other is that I have it in its CLI and issue a "U" command for upload which puts it in the exact same state as it would have entered via the actual use case described above. So in the former (normal operation/field deployment) case, it can be hours or days between occurrences; in the latter (CLI/benchtop) case, it can be minutes. The WKP and "Stop Recording" are both tied to a water sensor. When the water sensor is tripped, it hits the WKP pin and begins recording. When the water sensor circuit is open (out of water), it begins trying to send data via cellular.

The LED sequence I see (forgive me, I'm 90% sure this is correct but I'm red/green colorblind) is breathing cyan for a second or two (when cellular is turned on), then flashing green (seeking cell connection), breathing cyan for a couple seconds (transmitting that single message), then off. I have another red LED that is supposed to pulse at each successful transmission, or almost once per second (I'm limiting upload rate with while(((millis() - time_of_init_state_exit_ms) > 1000)) where time_of_init_state_exit_ms is reset to millis() after each successful transmission. But I see that red LED light up only once before the crash. I never see an SOS blink pattern in the RGB LED.

Noted. I'll take a look. Admittedly, I'll first have to look up what a lot of these suggestions mean (which is probably a good indicator that they may be related). Is there a straightforward way to monitor the free memory trend?

*** EDIT ***

I'll give that a shot.
*** End edit ***

Thanks again, I know this is a pretty involved conversation already. I really appreciate it!

1 Like

Not quite sure where this info comes from.
I know that you need to call Cellular.on() after cold start in order to have it completely power off with Cellular.off() prior to entering deep sleep (I'd also add a short delay just before too), but I'm not aware that you should turn it on to have System.sleep() turn it off.
Maybe @rickkas7 can chime in on that.

One thing I'm not sure about tho'
In your initial post you wrote

What "loop" do you refer to? The void loop() function or another loop that's surrounding the snippet you posted above?

I'm asking since when you use deep sleep, your code would be expected to start from the beginning and not stay within loop() (nor any other loop) so could it be that deep sleep is engaged erronously but immediately wakes again?

He already has here (note that I am not currently using SYSTEM_THREAD): Electron SLEEP_MODE_DEEP tips and examples. Also, that post doesn't suggest a short delay. Let me know if either of you think I'm misinterpreting something there.

Sorry, "loop" was a poor word choice. I did not mean loop() and I do expect it to start from the beginning every time it wakes as you describe. But I don't suspect that is necessarily related. What I meant by loop, rather, was that it should stay within case SYSTEM_STATE_DATA_UPLOAD: (see full case code below if you wish to dig in that deeply...) as long as bin_file.length() is not equal to zero (there are a couple other conditions that cause it to break but that's the main one that I was using to describe this inner while loop).

I've checked a handful of the other causes for a break from that case and none of them are occurring, which I can confirm individually by working out the byte-by-byte behavior manually or, more conclusively, for all cases having seen that it works properly for a given file sometimes even after not working (that is, resetting) for that same exact file and same exact messages sometimes.

////////////////////////////////////////////////////////////////////////////////
      case SYSTEM_STATE_DATA_UPLOAD:
      // upload data bit by bit either by serial or data upload

      // Some notes on this...
      // Data upload is done by uploading the very last chunk of data in the file
      //  and then truncating it.  This allows us to have a good "lost connection/partial upload scheme"
      //  as what is uploaded is removed from the file.  Files are natually uploaded in order of
      //  the last file in the SPI   Data is stored in 496byte "blocks" and data doesn't
      //  span across blocks, so data upload can be asyncronous, even within the file
      //  Also, data is compressed using base85 before being uploaded.

#if USING_CELL_NETWORK_UPLOAD
      // make sure that we are connected to the cloud and not doing more than 1/sec
      while(Particle.connected() && ((millis() - time_of_init_state_exit_ms) > 1000))
#else
      // make sure we are not outputting (over serial) more than 1x/sec
      while(((millis() - time_of_init_state_exit_ms) > 1000))
#endif
      {
        // multiple byte read
        // if the file's length is not divisable by the block length (496)
        if ((bin_file.length() % MAX_SPI_DATA_BLOCK_LENGTH) != 0)
        {
          // seek to the last chunk of data after blocks of 496
          bin_file.lseek(-1 * (bin_file.length() % MAX_SPI_DATA_BLOCK_LENGTH), SPIFFS_SEEK_END);
          // grab the data chunk
          num_of_bytes_to_encode = bin_file.readBytes((char *)data_to_encode, (bin_file.length() % MAX_SPI_DATA_BLOCK_LENGTH));
        }
        // otherwise just grab the last block of the file
        else
        {
          // seek to the last MAX_SPI_DATA_BLOCK_LENGTH of data and read it in
          bin_file.lseek((-1 * MAX_SPI_DATA_BLOCK_LENGTH), SPIFFS_SEEK_END);
          num_of_bytes_to_encode = bin_file.readBytes((char *)data_to_encode, MAX_SPI_DATA_BLOCK_LENGTH);
        }

        // pad with zeros till we have multiple of 4 for even conversion to b85
        //  IMPORTANT to keep track of how many extra so we don't truncate the file
        //    too much (multiple upload bug!)
        while (((num_of_bytes_to_encode + extra_bytes_to_encode) % 4) != 0)
        {
          data_to_encode[(num_of_bytes_to_encode + extra_bytes_to_encode)] = 0;
          extra_bytes_to_encode++;
        }

        // print out data to encode
        Serial.println("Data to Encode: (" + String((num_of_bytes_to_encode + extra_bytes_to_encode)) + "bytes)");
        for(int i = 0; i < (num_of_bytes_to_encode + extra_bytes_to_encode); i++) Serial.print(data_to_encode[i]);
        Serial.println();

        // convert to base85
        // function returns a pointer to a \0 at the end of the string
        //  ...so, subtracting the data_to_publish pointer gives us how many chars
        //  we should publish
        num_of_bytes_to_publish = (bintob85( data_to_publish, data_to_encode, (num_of_bytes_to_encode + extra_bytes_to_encode)) - data_to_publish);


        if (upload_num > MAX_NUM_UPLOADS)
       {
         Serial.println("Upload number > MAX_NUM_UPLOADS, deleting file and exiting!");

         // remove the uploaded file
         bin_file.remove();
         bin_file.close();

         // reset the upload attempts so it doesn't try on reboot
         num_upload_attempts_remaining = -1;

         // go to deep sleep
         glo_next_system_state = SYSTEM_STATE_DEEP_SLEEP;
         break;
       }
////////////////////////////////////////////////////////////////////////////////



      // print out data to publish that has been enocded
      Serial.println("Data to Publish: (" + String(num_of_bytes_to_publish) + "bytes)\n" + String(data_to_publish));

      // generate the publish id
      strcpy(publish_id, ("Sfin-" + String(glo_device_id) + "-" + bin_file_name + "-" + String(upload_num) + "\0")); // PJB commented out

      // print out publish ID to screen
      Serial.println(publish_id);

      // publish data
// #if USING_CELL_NETWORK_UPLOAD
      // I think a good idea is to use the MAC address of the device + date/time (filename)
      // if the publish is a success....
      // Serial.println("trying to publish...");
      if (Particle.publish(publish_id, data_to_publish, 60, PRIVATE))

// #endif
        {

          Serial.println("Publish Success");

          // increment upload number
          upload_num++;
          // truncate the file
          bin_file.truncate(bin_file.length() - num_of_bytes_to_encode);
          bin_file.flush();

          // reset "no upload" timeout timer
          time_of_init_state_exit_ms = millis();
        }

// #if USING_CELL_NETWORK_UPLOAD
        // if the publish wasn't successful
        else
        {
          // basically, don't truncate the last bit of the file so it will retry.
          // reset "no upload" timeout timer
          time_of_init_state_exit_ms = millis();
        }

        // keep connection alive (do once every 20 seconds or more often)
        Particle.process();
// #endif

        // Debugging info about file length.
        Serial.println("File Length = " + String(bin_file.length()));

        // if we have uploaded the entire file, close the file and go to sleep?
        if (bin_file.length() == 0)
        {
#if USING_CELL_NETWORK_UPLOAD
          // publish a EOF success event (TODO)
          //  if (Particle.publish(publish_id, data_to_publish, 60, PRIVATE))
#endif
          {
            Serial.println("File fully published, deleting file: ");

            // remove the uploaded file
            bin_file.remove();
            bin_file.close();

            // probably check to make sure there aren't any more files to upload (TODO)

            // reset the upload attempts so it doesn't try on reboot
            num_upload_attempts_remaining = -1;

            // go to deep sleep
            glo_next_system_state = SYSTEM_STATE_DEEP_SLEEP;
            break;
          }
        }
      }


      // if we aren't connected and CELL_SIGNAL_TIMEOUT_MS has gone by without a publish event
      //  shut the system down for a later reattempt
#if USING_CELL_NETWORK_UPLOAD
      if (!Particle.connected() && (millis() > (time_of_init_state_exit_ms + CELL_SIGNAL_TIMEOUT_MS)))
#endif

      if ((millis() > (time_of_init_state_exit_ms + CELL_SIGNAL_TIMEOUT_MS)))
      {
        // if this is the first failure (num_upload_attempts_remaining == -1)
        //  setup the number of reattempts to try and send to deep sleep
        if (num_upload_attempts_remaining == -1)
        {
          num_upload_attempts_remaining = NUM_UPLOAD_ATTEMPTS_MAX;
        }
        glo_next_system_state = SYSTEM_STATE_DEEP_SLEEP;
      }
      break;

A few more details from continued testing… Feel like I’m maybe getting closer but still can’t quite pinpoint my error and would love any suggestions!

The lead-up…
I wasn’t able to reproduce the issue for about four hours, running various short tests despite having seen it on two different PCBs and probably 30 times over roughly the past ten days. I did a lot of short tests (3 – 5 mins) and the only changes I made to the firmware were the following.

  1. This print statement
    Serial.println("Extra bytes to Encode: (" + String((extra_bytes_to_encode)) + "bytes)");
    in the UPLOAD case. It prints a number between 0 – 3 inclusive on the first message and then 0 for each subsequent message for any given file upload. I’d say that’s perfectly expected behavior. I was concerned that maybe extra_bytes_to_encode was increasing to infinity since it’s not reset within loop() but it is reset to 0 at the top of the .ino file and it is evidently calling that properly each time it wakes up.
  2. Test 2 was to print out pointers like this:
Serial.printf("Data to encode pointer %d\n", &data_to_encode);
Serial.printf("Data to publish pointer %d\n", &data_to_publish);

The response is
"Data to encode pointer 536884620
Data to publish pointer 536885120"
in each loop (those values were the same for each of the encode/publish pairs for each test I’ve done, including a 75 minute one). So that looks fine.
3) Test 1 (increase buffer size for data_to_encode, data_to publish) I haven’t done yet.

The crash!
Reproducing the crash finally occurred after a couple dozen successful short, 3 – 5 minute tests and one successful long (75 min) test; every short test after the long one has failed so far. The 20 short and even the one long were all transmitted fine. In fact, it worked so well for the first bunch of tests that I thought I was losing my mind because I couldn’t reproduce the failure. But the very first short test after the long one failed and every one since has as well. So it almost certainly has to do with those long tests and maybe the crashing of the freemem which I have seen decreasing by > 500 each iteration within the while loop of the UPLOAD state. But freemem starts fresh each time it wakes up, reading 85104 every time I start a new session. Then it decreases each iteration while publishing and truncating the bin_file one message at a time. This freemem decrease during the UPLOAD state is definitely worrisome but I can’t locate anything that could be chewing it up yet.

@rickkas7, if I can ping you directly here, I’m wondering if you might take a glance at my implementation of SpiffsParticleRK as shown in my previous post’s code. Thanks, everybody and anybody!

One possible reason mentioned earlier

Thanks again, @ScruffR, but still not able to isolate the issue. Spent another week fighting the same battle and we’ve now tried removing String everywhere, tracing freemem, and checking buffers and pointers without any success or obvious clues. I’ve also compiled using Particle Dev and the cloud compilation tool as well as Workbench with local compilation and Electron 0.8.0-rc.10.

@rickkas7, it’s seeming increasingly likely that we’re misusing your SPIFFS libraries but we’re running out of ideas as to how. Are you able to take a couple minutes seeing if anything jumps out at you in our implementation (described a couple posts ago: E-Series resetting after publish)? We’d really, really appreciate it.

It took a bit to implement the recommended changes but I think we’ve now tried everything suggested here to no avail. We’re able to reproduce the issue by going directly from SLEEP_MODE_DEEP into the UPLOAD case previously described/linked above.

Would love more thoughts if anyone has any! Thank you!