Memory leak in Gen3 network handling

boron-3g
Tags: #<Tag:0x00007fe220527e70>

#1

I isolated my issue in a simple program on a v1.2.1 Boron 3G doing only this:

  • Send a udp message on boot
  • Send a udp message every hour showing the free system memory

The messages look like this:

Jul 23 16:55:07 Booted. Device OS 1.2.1, Reset 140 0
Jul 23 17:54:42 uptime 1h, mem 64448
[…]
Jul 23 23:54:40 uptime 7h, mem 64448
Jul 24 00:54:41 uptime 8h, mem 58520
[…]
Jul 24 07:54:41 uptime 15h, mem 58520
Jul 24 08:54:40 uptime 16h, mem 53192
[…]
Jul 24 14:54:40 uptime 22h, mem 53192
Jul 24 15:54:40 uptime 23h, mem 47872
Jul 24 16:54:41 uptime 24h, mem 37288

I removed messages with no memory change. This continues until there is about ~15kb memory left. Then the device reboots or it freezes breathing cyan (no pings, led in other test stopped blinking from loop()). The memory always drops by (multiples of) ~5kb. I run this code several time on different devices, sometimes without problems for multiple days, sometimes crashing after 1 day.

Why I believe this is a bug in the deviceOS code:

  • Before I had this minimal code to reproduce this, I logged errors using the papertrail log handler and every time before the free memory dropped again, there was this error message:

    [system] ERROR: Failed to load session data from persistent storage

    There was no memory drop without this message and no message without memory drop.

  • I see this message at every boot (which is understandable). Getting this message during normal operation looks like temporary network problems, which can happen, but must not leak memory.

  • Someone reported the same behavior, memory dropping by the same ~5kb when doing things with the modem: Memory leak in Argon doing WiFi.off() / WiFi.on() and Boron doing Cellular.off() / Cellular.on()

Thank you in advance for any help on this. I don’t know where to continue searching for the source of this bug.

This is my code:

SYSTEM_THREAD(ENABLED);

UDP udp;
uint16_t udpPort = xxxxx;
const char *host = "xxxxx.com";

void setup() { }

void sendUdpMessage(String msg) {
  static int inited = 0;
  static IPAddress address;

  if(!address) {
    address = Cellular.resolve(host);
    if(!address) return;
  }

  if(!inited) {
    uint8_t udpBeginStatus = udp.begin(8888);
    if(udpBeginStatus != 0) inited = 1;
    else return;
  }

  String time = Time.format(Time.now(), TIME_FORMAT_ISO8601_FULL);
  String packet = String::format("<22>1 %s %s %s - - - %s", time.c_str(), System.deviceID().c_str(), "jay-0", msg.c_str());
  int ret = udp.sendPacket(packet, packet.length(), address, udpPort);

  if(ret < 1) inited = 0;
}

void loop() {
  /* Heartbeat */
  static system_tick_t lastHeartbeat = 0;
  static int uptime = 0;
  static int bootMessageSent = 0;

  if(millis() - lastHeartbeat >= 3600000) {
    lastHeartbeat = millis();
    uptime++;
    if(Particle.connected()) sendUdpMessage(String::format("uptime %ih, mem %lu", uptime, System.freeMemory()));
  }

  /* Boot message */
  if(!bootMessageSent && Particle.connected()) {
    sendUdpMessage(String::format("Booted. Device OS %s, Reset %i %lu", System.version().c_str(), System.resetReason(), System.resetReasonData()));
    bootMessageSent = 1;
  }
}

#2

Thanks for posting! I’m hoping that @avtolstoy or @cheong can attempt to replicate and get an issue filed in GitHub.


#3

Thank you!
Let me know if I can do any additional tests or help in other ways, I would really like to get this issue solved.


#4

Thanks for reporting the issue. This has been resolved in https://github.com/particle-iot/device-os/pull/1862 and the fix will be included in the upcoming 1.3.1-rc.1 release.


#5

I can confirm this fix solves my issue. The memory leak is gone.


#6

As I discovered later, the reason for triggering the memory leak bug was a hardware issue. I describe it here to help people in the future having the same problem:

I was powering the boron via the VUSB pin with a powerful (3A) supply (always stable 5V here, ensured with scope), but I didn’t increase the PMIC input current limit. The default is 500mA which is not enough for the Boron 2G/3Gs current peaks. More details: Correct power supply for Boron 2G/3G without LiPo battery

This lead to voltage drops on VSYS which caused the modem to be unresponsive. The DeviceOS then restarted the modem, which leaked memory, also observed in thread linked above and confirmed here:

As this occurs infrequently, depending on cellular signal conditions, it is hard to debug. To reproduce this issue, I connected an electronic load to VSYS and pulled current (~650mA) until the voltage dropped to 3V for about 300ms. This causes the modem to become unresponsive but is enough to let the Boron not restart. Logs (level WARN) look like this:

0000063889 [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead
0000070480 [app] INFO: network_status_connecting
0000070629 [comm.protocol] ERROR: Event loop error 3
0000070632 [system] WARN: Communication loop error, closing cloud socket
0000083129 [app] INFO: network_status_connected
0000083137 [system] ERROR: Failed to load session data from persistent storage

Correct power supply for Boron 2G/3G without LiPo battery
#7

This should also be resolved in 1.3.1-rc.1: https://github.com/particle-iot/device-os/pull/1846


#8

Hi @nils,
I am having a similar issue with a Boron: “device reboots or it freezes breathing cyan”.
I didn’t do the same exact debugging as you (congrats on a very detailed work!), but the symptoms seem to be the same, and I’m also powering it uniquely via DC supply in VUSB pin, without a battery or a USB cable.
I would appreciate if you can confirm that you solved your problem by only changing the PMIC settings. Did you only include something like this in your code? Or did you use other PMIC functions? Which current did use as a limit?

PMIC pmic;
pmic.setInputCurrentLimit(2000);

Thanks,
David


#9

Hi @david,
the reboot or freeze was caused by the memory leak bug which is solved since 1.3.1-rc.1. This bug caused the free heap to be reduced by ~5kb every modem restart. So a reboot or freeze occurred after some modem restarts. The modem restarts had been caused by the power issue. So the power issue was only an indirect cause for the reboot/freeze!

If you are using 1.3.1-rc.1, your reboots or freezes have another cause as the memory leak is fixed. If not, please update :wink:
Check also if your power supply can handle the current and if the voltage is stable in case of current spikes.

This is my code:

PMIC pmic;
pmic.begin();
pmic.setInputCurrentLimit(3000);
pmic.disableCharging();
pmic.disableBATFET();

byte DATA = pmic.readChargeTermRegister();
DATA &= 0b11000110;
Wire1.beginTransmission(PMIC_ADDRESS);
Wire1.write(CHARGE_TIMER_CONTROL_REGISTER);
Wire1.write(DATA);
Wire1.endTransmission(true);

#10

Hi @nils ,
Thank you for your advice. I was using device OS 1.2.1. I tried using your code and also upgrading to 1.3.1, but I keep having connectivity issues: the device breaths cyan, even I made it log its connection to the cloud, which shows it’s connected), but the device sometimes doesn’t receive commands via Particle functions. Also, sometimes, I get the same error as you: [comm.protocol] ERROR: Event loop error 3 .

Regarding your suggestion about power supply, the current is fine (16A, but not sure about the spikes. I’ll try to get the decoupling capacitors closer to the device, as now they are kind of close but connected by those jumper wires used for breadboards (maybe the wire length makes the distance too long?!).

I tried to get more logs to debug what’s going on, however, even adding SerialLogHandler logHandler(LOG_LEVEL_ALL);, the serial monitor only showed the following message twice:

0000243138 [comm.protocol] ERROR: Event loop error 3
0000243139 [system] WARN: Communication loop error, closing cloud socket
0000243142 [system] ERROR: Failed to load session data from persistent storage

It didn’t show any other messages, like you showed on your previous posts (like system, app, AT commands, etc). I tried updating the firmware via CLI, using particle update, in order to get the Device OS with DEBUG_BUILD=y (according to this post: MAJOR BUG in 1.0.1-rc.1 and 1.0.1), but it didn’t make a difference. Are you using any special configuration or code to have all those different logs printed to your serial monitor?

Thanks,
David


#11

Hi @david

avoid breadboard jumper wires for power lines or anything working unreliable! They are very cheap because of bad quality and very thin conductors. The resistance is too high for those currents which leads to voltage drops. I also had several problems with loose connections inside these cables which makes them unreliable. Please use proper wiring and keep the wires short.

Decoupling only helps if the caps are large enough and with low ESR, but I don’t know numbers here.

You should also use an oscilloscope to make sure your input voltage doesn’t drop whenever you experience software problems. Try if adding a battery resolves your issues. If it does not, this is no power issue.

As far as I know, your log shows a lost cloud connection which is reestablished. I don’t think this alone is a problem, the connection can be lost sometimes because of network issues. If it’s reestablished, it should be fine.

I have no special log configuration, just a standard log handler. You don’t see the gsm0710muxer error because it’s not there, you have other problems than I had. The app messages come from my code, just callbacks on network system events to visualize them for debugging.


#12

Hi @nils ,
I appreciate your answers, you gave me some good advice, I have some work to do on that.

Correct, I’ll remake my wirings.

In a similar project with a Boron and the same power supply, when I used a battery, the problems were gone. I solved it by using decoupling caps (1x 1000uF and 2x 0.01uF) very close to the VUSB pin and shorter wires, and the device worked very good after that. So I’m guessing it’s really a wiring problem in my current prototype.

Thanks for the points about logging, I’ve not used them before.

Thanks again for all the advice.
David


#13

As a reference, in case someone with a similar issue reads this topic, changing the wiring solved the problem. The decoupling caps need to be very close to the Boron.

Thanks again @nils , great inputs.


#14

Thank you for the feedback, glad you solved it!