Photon locked up(Breathing green) after several hours running custom code

So I have written a rather large application that runs in the Photon module. It runs perfectly for several hours on end, but after a while the application stops running and the Photon module starts breathing green.

Can anyone list specific times when you have seen this and what caused it?

I know that beginTransmission on Wire(I2C) can cause this as I have seen it in the past.

I would post code but it is literally way over 1000 lines with all classes together. I am using Serial, Serial1, and I2C bus. I have determined that this is not caused by improper connection to I2C. I am also using EEPROM but that is only used at startup to read out settings, it succeeds.

Any insight whatsoever would be appreciated. I have been looking for this bug for several days now and am coming up empty handed. Problem is it takes several hours for it to occur but does not happen after a specific amount of time.

One other note. When the photon module runs into this if I hit the reset button it will do it again after my initial setup routine. However if I flash new firmware by putting it into safe mode the bug goes away and takes several hours to occur again. So this bug does not go away after reset.

Breathing green indicates aktive WiFi connection without Particle cloud.

I’ve no immediate answer to your question, but some more questions
What SYSTEM_MODE() are you using and are you using SYSTEM_THREAD(ENABLED)?
Would you try to test this snippet in your code?

  if (!Particle.connected())
  {
    Serial.println("Device offline");
    WiFi.off()
    delay(1000);
    WiFi.on();
    Particle.connect();
    Serial.println("Initiate reconnect");
    waitUntil(Particle.connected);
    Serial.println("reconnected");
  }

Hey @ScruffR

Thanks for the insight. I was aware of the breathing green LED indicating loss of Cloud connection but retention of WiFi connection. Generally speaking though a reset fixes this problem.

Also I have made sure our network has good internet connection when this occurs(We have direct fiber optic connection at our building).

I should also note that I am not using any Particle functions, variables, events, etc. Literally the only reason I have it connected to Particle at all right now is for Over the Air Flashing of firmware.

I do not have system thread enabled.

I will try to add code to my loop as you suggested to check particle cloud connection and reconnect if needed.

In the end this firmware will use system mode manual as it does not require Particle cloud or even internet connection. The photon simply acts as the processor for this application and is put in AP mode in order to configure the settings when needed through the embedded web interface. That said if Particle Cloud connection is all that is causing this I would not worry about it at all. I may try putting it into system mode manual just to see if the problem goes away completely or not.

This statement of yours was what I tried to be hitting at with my code snippet

Since a reset doesn't necessarily power off the WiFi module, an explicit WiFi.off() might help clearing some "clogs" that built up over time but won't be cured with a reset.
If this helped, we might get closer to nailing the actual reason.

Oh my. I did not realize that reset did not reset the WiFi module. Hmmm… I wonder why not? Developers must have had a reason but that makes a lot of sense now as to why reset does not correct the problem. Thank you for pointing that out. I am about 99% sure I know how to move forward now if this is the case.

It does reset it, but a reset doesn't always clear everything out which a power-down would.
e.g. when you got retained variables in your Photon code and hit reset, their contents will stay in tact, but when you pull the plug they'll be gone.
A reset doesn't have to clear all memory but only invalidate all references and set the registers to a well defined starting condition.
But a possible question remains: "Is this really well defined?"

I'm not saying this is what happens, but it's a hypothesis that can be tested.

A possible reason for not always clearing everything might be restart time.
e.g. the Electron keeps the cellular modem powered during a reset to save on data usage by avoiding a renegotiation with the cell towers.
Another "reason" might be bugs :wink:

Hey @ScruffR

I did not implement your test yet but I believe I may have found the smoking gun. I had a blocking call which I believe was breaking the application when an i2c chip did not respond. I had hastily put together the code and forgot to implement a timeout when waiting for 2 bytes to be returned via i2c through the Wire library. Here is the original code:

void MCP23017_16_IO::readAllInputs(int* buffer){

Wire.beginTransmission(address);
Wire.write(18);
byte status = Wire.endTransmission();
if(status != 0){
    initialized = false;
}else{
    // Serial.println("Command Successful");
    initialized = true;
}
Wire.requestFrom(address, 2);
while(Wire.available() != 2);
buffer[0] = Wire.read();
buffer[1] = Wire.read();

return;
}

As you can see the line `while(Wire.available() != 2); is absolutely horrendous and I have absolutely no right to be writing code like that! Like I said though i just put it together quickly and did not think about what I was doing. I determined fully that this block of code was what was causing the failure. To correct I implemented a timeout. See following:

void MCP23017_16_IO::readAllInputs(int* buffer){

if(!initialized){
	//ReInitialize i2c chip.
	setAddress(address1, address2, address3);
}

Wire.beginTransmission(address);
Wire.write(18);
byte status = Wire.endTransmission();
if(status != 0){
	//Chip did not respond to write.
    initialized = false;
    buffer[0] = 256;
    buffer[1] = 256;
    Wire.end();
    delay(50);
    Wire.begin();
    return;
}else{
    initialized = true;
}
Wire.requestFrom(address, 2);
unsigned long startTime = millis();
while(Wire.available() != 2 && millis()<startTime+timeout);
if(Wire.available() != 2){
	buffer[0] = 256;
	buffer[1] = 256;
	initialized = false;
	Wire.end();
	delay(50);
	Wire.begin();
	return;
}
//All clear
buffer[0] = Wire.read();
buffer[1] = Wire.read();

return;
}

As you can see I have a timeout now. the timeout variable is an unsigned long set to 100mS(way more than enough time for this i2c device to respond. You may be wondering why I do not call the initialization routine immediately after the failure. This function is called in the loop and if I get an int return of 256 in my buffer object then I know it failed and I alert the user via serial console prints. I want to know when this happens and I do not want to make this function call really long or implement goto:s so I thought this was the cleanest way to implement. When the main application sees this error it knows the input board is down and can try to re-initialize it while letting the user know what is going on.

Thank you for all of your input @ScruffR. At the end of the day this was just bad code. That said this may be a good candidate for SOS messages on the Photon RGB. All this is is a blocking call. You can replicate it with this code:

void setup() {

}

void loop() {
     while(true);
}

Why not implement a blocking call SOS message? Seems like it would be simple enough but who knows. I know this is pretty quick to spot in 99% of applications which are relatively short, but when you have a large application like this where that failure only causes a block every once in a great while(several hours) it can be hard to track down or know what caused it. If I would have had an SOS message that let me know I had a blocking code I could have quickly known what was causing the problem and what to look for. Would have literally saved me days. Just a thought. What do you think?

Because it's not an error from which the device can't recover.

If blocking code would be a candidate for an SOS call then thousand other reasons might be just as good candidates and you again wouldn't know what exactly was wrong (or you had to count thousand blinks :wink: )

The interesting thing is, why was it going into breathing green and not blinking cyan, which would have made the blocking nature of your code more obvious - or are you not using AUTOMATIC mode (which I asked initially already :wink: ).

BTW, the way you've done your timeout can cause issues near the limit of uint32_t and have range checks tolerant to unexpected results (e.g. Wire.available() >= 2 which would not really be an error condition)
You might rather go with this scheme

while(Wire.available() < 2 && millis() - startTime < timeout);

And if you have longer running blocking code, you might go for this

while(millis() - startTime < 20000) Particle.process();  // this keeps the cloud happy

Thanks @ScruffR Great information. I will take another look at my timeout scheme.

If you simply put the following sample code below into a Photon module it will not recover. It will sit there and breath green until the end of time. So is it really recoverable? I get what you are saying about drawing a line in the sand with SOS messages but… if you flash that code this that Photon will sit there and breath green and will not recover.

void setup() {

}

void loop() {
     while(true);
}

Also I have always left SYSTEM_MODE set to automatic. I do not set it and that is the default. This is literally all the code flashed into the test module from BUILD. Here is the screen shot:

If you flash this code it will connect to the cloud, run your code for a few seconds then it will go to breathing green because your application is blocking.

At the end of the day do not write blocking calls right :smiley: .

As a side note if you see the photon go from Particle Cloud connection(breathing Cyan) to breathing Green and you know you have good internet access on the network then review your code for blocking calls because that seems to be what that means.

Yes if the code was breaking out of that loop.
You'd still see the same breathing green (which I still don't understand not being cyan blinking - have to check back with Particle for that) with this code

volatile int someVar;
void ISR() 
{
  someVar=1;
}
...
void loop()
{
  while(someVar);
}

Would you expect an SOS now after 10 sec (or 100 days) blocking? It's definetly recoverable but still blinking cyan (or breathing green after 10sec).

A non-recoverable SOS renders the controler in a state where nothing but a reset can get it out of.
But with your or my code the controller is in a perfectly running condition - it just obediently turns its rounds in a loop which you told it to do.
Business logic code flow and hardware code execution are two different levels and SOS faults belong to the latter.

These SOS faults are hardware faults defined by the manufacturer and a blocking code is not part of that list.


BTW, add Particle.process() to your loop - it'll still be a blocking loop (relative to loop()) but won't start breathing green - SOS or not?

I think my biggest issue with it going into breathing green is that indicates that the module is connected to the WiFi network but not to the cloud. If you set the system mode to manual and then connect wifi you would just see a breathing green LED. It seems this is something else and maybe just needs a different flash code perhaps not SOS but something to let the user know the module is running but is prevented from connecting to the Particle cloud perhaps(basically the thread communicating with Particle’s cloud is not able to run.

Alternitively just a small note in the Cloud Not Connected section under Device mode in docs here would suffice:
https://docs.particle.io/guide/getting-started/modes/photon/#cloud-not-connected
I did review that section when I got the breathing LED, but it did not tell me anything I did not already know. Just a note there that could list common causes for this including blocking code. If that note would have been there I would have said ah… that makes sense.

Maybe I am way off base but I would just like to see something other than breathing green. Heck I don’t know. I’m just tickled I have everything running at this point. I was just trying to think of some insight I could provide so maybe others do not run into this in the future. Just trying to do my part to provide feedback.

1 Like

You’ve stired up something there :wink:

I’ve just conferred with Particle and other Elites and that breathing green thing is somewhat “disturbing”.
The change in behaviour in case of blocking code was (silently) introduced with some recent update.
Some knew about it others didn’t and the intended behaviour is not quite clear, since breathing green is meant as a code for a deliberate mode decission.

And - as you pointed out - the unanimous call for better documentation was raised.

3 Likes

Great news @ScruffR

Like I said, just a note there in the Device Modes documentation would suffice for me. I did look there. Of course a different RGB code would work to, that decision is up to the powers that be.

Umm… this sounds all too familiar.

Throw it on the pile !! (Brick Pile)

I wish I could give you the silver bullet solution. But I haven’t been able to cure the Breathing Green light of death for the Photon.

Good Luck

What IOTrav is seeing is due to blocking code (which he found and cured) so the device is not dead, the code just paralysed it.
Have you checked your code for something like that?

I have spent soooo many, too many, hours, days and weeks looking for reasons why the Photon (4) Breath Green, Won’t Claim, Won’t Stay Connected. Yes, i’ve looked at the code, i’ve abandoned the code and started over with the same dead photon results. Believe I wish they would work, great layout, poor execution.

1 Like

Did you try putting the module’s into safe mode and reflash the tinker app? I have never heard of this and I have probably flashed 300 modules.

3 Likes