Long Term Photon Connection Stability


#142

This isn’t a rare event. Both of my Photons exhibit this problem. Both use SYSTEM_THREAD(ENABLED) and both lock up user code. Symptom is rapid flashing green with the user thread not running. A single push of the reset button or power cycle and the device immediately reconnects and runs correctly.

I added code to both programs to detect loss of WiFi and/or not connected to the Particle cloud for 30 seconds and to issue a System.reset(). This sometimes works, but if the user code stops executing then even this fallback is useless. Where the old code without the reset() would fail every two or three days, the new code with reset() fails about once a week. I have no way of knowing how often my reset logic is correctly restarting but I suspect a couple of times per week.

This issue has been happening since I switched from the old Core to the Photon. My Cores continue to execute flawlessly and are very robust. The Photons I consider to be fragile and unreliable, and without any more Cores to fall back on my Particle development has been halted, at least until this is fixed. Sorry.


#143

That’s interesting. I never had the Cores work reliably for me. I just gave up on them, while the Photons are extremely stable. One of them is running for weeks without interruption. So it is the exact opposite for me.

(Note that I did not try the Cores again after I got my first Photon, maybe they are fine now with newer software versions, but I did not bother to check).


#144

Hi @Muskie. Thanks for reporting your experiences - I’m sorry they are not more positive. I hope you can appreciate that with WiFi, the issues may be environmental. I’ve not been able to reproduce these issues, but will be adding the timeout on connection to mitigate the problem when it occurs. It would be much appreciated if you could test the timeout fix when it’s available to see if it improves your experience.

Our goal is for the photon to be a rock solid platform, so we take these stability concerns in earnest.


#145

Hello @mdma, thank you for the reply. I’m sorry to be so down on the Photon, but it is getting tedious having to reset them. My cores (4) are on the same WiFI network and have no issues at all, so my conclusion is that it must be Photon related. I will try the new firmware fix once it becomes available and let you know how it goes.


Simple and reliable water level sensor circuit?
#146

No problem at all - good to hear your experiences - if it’s not working then we want to hear about it. :smile: (And when it starts working, well, we want to hear about that too!)


#147

Perhaps I am missing something here. The heart of the photon product is wifi connectivity. Admittedly, environmental factors, as mdma says, can cause unpredictable behavior regarding this connectivity. But the very frustrating, and most important, issue for me, and probably many others, is user code lock up. If I am monitoring & controlling temperatures & CO2 levels in an indoor agricultural facilty, with $70,000 worth of product growing inside, I need my system to execute my user code for at least months without requiring a grower to do a hard reset at 2:00am. I expect a totally unambiguous coding method to ensure that my user code will execute regardless of wifi signal strength and/or connection. Many systems can live without logging data to the cloud or having a web interface 24/7, but they should surely continue to execute. It has been way too long for the photon to still hang up because of a wifi signal issue, and for Stevie to be happy to report that he has one photon running for weeks without interruption. I am giving the Electrons a shot now and really hope they are more robust.


#148

If you don’t need the WiFi, and want to be in full control of the connection, have you considered the other System modes?
Alternatively, there’s threading, though it’s still in beta, and issues are to be expected. This is made to ensure a separation of background (wifi) tasks and user code, thus enable your code to run even without a connection.


#149

And you can - just look at the keywords @Moors7 has linked you too.
These are there for a really long time already.


#150

I actually have more than one which do not have a problem. But I usually run them for days and not weeks.

But to be honest: If you have $70000 hanging on the stability of the Photon, then I would probably change the system and split it into two parts:

  • One microcontroller which is monitoring and controlling the temperature. That could probably be a dead simple system like an Atmega328 based Arduino or similar.
  • The Photon or Electron to do the WiFi connection.

The controller would control all the time and from time to time use a serial connection to the Photon to report the data.
That way the reporting could be decoupled from the controlling, thus greatly reducing the danger that something gets stuck. The Photon is a much more complex system with a lot of involved libraries. The danger that something goes wrong is much higher than with some simple Arduino system.

If you use such a setup then if the reporting gets stuck, you can simply reset the Photon when you have time. There is no urgency there as you write, because the control function is independent of that.


#151

Thanks Moors7, Stevie & ScruffR. I have tried semi-automatic mode but not recently, and I am sure that I’ve not exhausted all of the possible ways to optimize the connectivity self checking while in this mode. I did try various things using semi-automatic mode a few months back for about 2-3 months, with lots of frustration with having to hard reset every 1 to 2 days, sometimes as much as 3. But I did not implement any wifi connection loss detection code along with System.reset(), but, as Muskie reported above, when user code halts, this system is also useless, failing for Muskie about once / week. I have been considering Stevie’s exact solution for quite some time, but am having a hard time taking the plunge into a whole bunch more complexity w/ 2 micros to code & maintain, 2 interfaces, yet another layer of communication between the 2 micros, and on & on. Also a tremendous waste of cpu & peripherals, and no longer in one nice package. I totally agree that, conceptually, this 2 micro solution should have zero lock up issues, as I’ve had quite a number of 328 & 1284 systems run flawlessly for many months or more. I surely appreciate and respect all the hard work at Particle, but just expected a little more reliability and less coding complexity after so much time.


#152

In all fairness, adding something as fragile as wifi to any singe threaded system is bound to give ‘issues’ sooner or later. Having multithreading means that most of these can be averted, making sure your user code doesn’t get interrupted (unless it explicitly requires a connection). The system modes (manual in particular) should give you the utmost control over the connection. You decide when & why it connects, and also decide on what it does while it’s not connected, or can’t find a connection.
Either of those modes should be able to let the user code run in absence of connectivity, whereby multithreading should obviously be the most elegant.
Even thought it’s still in beta, do give multithreading a try, since it seems to work fairly well. Doesn’t hurt to try, right?


#153

Thanks Moors7. If I continue with Photon, I will definitely give multithreading & manual mode a go. However, I am hoping to avoid most of these problems with the Electron, but I am sure it will have its own set of issues. I have not yet found much discussion on multithreading or connectivity & user code problems with the Electron yet. Do you have any experience with it yet, or knowledge of potential pitfalls ?


#154

you may want to search this forum. It seems that there are a few folks working on projects experiencing cellular connectivity issues, particularly if you are using power savings tools like sleep functions.


#155

The factory default firmware on the electron doesn’t support multthreading - this is coming in 0.5.0.


Еlectron happen to stop working by itself
TRY_LOCK was not declared in this scope WITH_LOCK was not declared in this scope
#156

I am using the multi threading in my code, found it to be very helpful and stable. But since I am not having issues with the WiFi it is not too important for me.

And I agree on @Moors7 comment about WiFi and single-threaded. On the other hand, adding multi-threading always adds the potential for race conditions, dead lock and the like. Which will kick in in the exact moment where it will do the most damage. I am writing a lot of multi-threaded code, but for something mission critical (like betting $70000 on it) I would prefer single-threaded. It is just much more deterministic…


#157

The design of the threading system uses the active object pattern so that system things are accessed by the system thread, and application things by the app thread. When the app wants to access a system resource, it posts an event to the system thread, which accesses the resource on behalf of the application. By doing this, we avoid needing to use locking to synchronize access to system resources hence remove and the possibility of deadlocks, since there is no sharing of system resources between multiple threads.

On the Photon, even with single threaded mode (from the application perspective) WICED is implemented as multiple threads for the networking stack, and comms with the wifi module, amongst other things, so you are never truly running single threaded even when application code and system code are running on the same thread.


#158

My stability issues were related to UDP, after switching to httpclient my device has not locked up yet, sofar in 2weeks, which is ok for now, a bit more overhead, but nothing major.


#159

Sorry for late reply,

After testing few days, we fixed the issue

ISSUE: When reset router, photon gets in infinite fast green blinking.

We tracked that this was due to when router turned off, our MQTT client tries reconnection and locks tcp client and it remains there for ever.

if (mqttHandler.isConnected()) {
		mqttHandler.loop();
} else{
  if(Particle.connected()){
	if((millis() - timeMqttRetry) >= 10000){
		timeMqttRetry = millis();
		mqttConnect();
	}
    }
  }

Actually when router disconnects, out mqtt client comes to know immediately and retries as the above condition Particle.connected() returns true, might Particle.connected() takes a little to refresh its status and thus the photon gets stuck to infinite fast green blinking.

SOLUTION:
Mode is AUTOMATIC SYSTEM_THREAD(ENABLED);
DO NOT RETRY TCP OR UDP retry just after disconnect.

if (mqttHandler.isConnected()) {
		timeMqttRetry = millis();
		mqttHandler.loop();
	} else{
		if(WiFi.ready() && Particle.connected()){
		if((millis() - timeMqttRetry) >= 10000){
			timeMqttRetry = millis();
			mqttConnect();
			}
		}
	}

Thanks to @mhazley for hint regarding fast green blinking due to UDP socket.


#160

Great thread. I just want to relate my latest problem as well.

I am using:
SYSTEM_MODE(SEMI_AUTOMATIC)
SYSTEM_THREAD(ENABLED)

And my critical code is on a separate thread, off a 1 second timer.
Timer timer(1000, timer_every_second);

That solved 99% of the issues.
The wifi/cloud can go up and down as it wants. Yes, it stops “breathing” but eventually reconnects.
I check every hour. If disconnected, reconnect.

The critical code is not affected since it is on another thread.
And I checked that carefully, by counting uptime seconds per day. None missed.

But: It still eventually totally hangs about once a week.
Hangs meaning no lights blinking at all. LED is off, not red/cyan/green.
So likely some sort of very low level problem.
Reset works, and it then comes up fine and runs for a long time.

I did have to mask off interrupts for a very short time, in order to keep
the communication between threads clean. But that is maybe 10 lines of simple code.
(True multi threaded code is not for the faint of heart)

I am looking into watchdog timers next…


#161

Thanks for the report. The smoking gun here is disabling interrupts. Can you be certain that all code paths after disabling interrupts leads to them being re-enabled? What hardware peripherals are in use?