Ok so here comes my first update from my test with the code I posted above. I had a packet loss at the 16 hours 30 minutes mark (not too worrisome using UDP).
It then reset and reconnected after an uptime of 1 day 33 minutes and 40 seconds so far the bad news,
the good news is I have looked at my router logs and they clearly show that at exactly that time my Internet connection was cut and reestablished. I’m using broadband with a dynamic IP and there is a maximum 24 hour lease on it so the router actively cuts it in the middle of the night to prevent that happening at other times.
Since then I have seen 2 more packet losses but no further reset. I will now try the same setup but without a cloud connection.
Just a point that may or may not be related. I find the greatest rate of fail is associated with the Photon I have talking to Ubidots through a Webhook.I tried a few solutions put forward on this forum with varying degrees of success in reestablishing wifi - some lasted about 11 hours - but the end result is always the flashing cyan problem. My immediate need does not require a high sample rate - about 1 every 6 minutes, so for now I am simply reseting the Photon after every Webhook broadcast and have been able to maintain reliable broadcasts for two days now. OK, I know this is not a solution - but it’s keeping my project up for now. I have had this (flashing cyan lockup) as a support ticket for a few days now without response from Particle other than a first up couple of suggestions about improving the WiFi, but with the large number of people affected I imagine the Particle guys are working on it. Would be nice to get an update from them.
I’ll add another datapoint - two photons under test here have been running for over 5 days without error.
As I’ve said previously, I believe the interruptions to be induced by internet/ISP weirdness. Obviously the photon should be able to ride through transient problems with internet connectivity. I will update more as I learn it, but without a failure here, I have no new data to offer.
Hi Andy, Just to be clear on the ISP / Internet weirdness, my Spark Core runs without missing a beat in the same installation position - and I have pretty much zero dropout from other devices. The Core is using a wire antenna - and does drop out from time to time but does not go into this apparent firmware loop of flashing cyan without reconnection.
Yes - I had 2 photons in a failing mode, and 1 core that continued to be connected to the cloud.
However, when I snoop on what’s actually happening with wireshark, I see that the core closes and reopens the TCP connection to the cloud much more frequently than the photon, so different behaviour doesn’t surprise me that much.
So after several days, I’ve found that setting my router to “Legacy mode” thus limiting speed to 54MHZ, and setting channel bandwidth to 20MHZ (rather than 20/40MHZ or auto) allows my photons to behave themselves. So the photons just aren’t robust yet to interference. Beyond that vague analysis, I have no clue. I’m temporarily happy I can at least get back to building projects rather than thinking it’s all my fault, now problems are just mostly my fault…
So I have been running 2 nights with the Photon on the WiFi but not connected to the cloud without hickups,
I couldn’t run non stop because of an unrelated hardware problem where slightly pressing on the metal plate on top to check the temperature resulted in that touch the chips and a reset. So at the moment my guess would be that the problem lies with the cloud heartbeats somehow triggering resets to aggressively and I would guess that the flashing cyan is then another problem with reset under such conditions. I haven’t personally seen the flashing cyan though only resets.
There might also be a relation with my router handing out an IPv6 prefix but I doubt it as I can’t see how that would only affect the photon when connected to the cloud. Maybe some of the other people could try the stability without a cloud connection e.g. by just sticking a Spark.disconnect() in the setup() of the code I posted.
@bpr I’m a bit confused about what effect that legacy mode could have because it sounds like something that would effect Photons running without the cloud connection and at least for my issue sounds like it wouldn’t trigger the reset exactly when the Internet was reestablished (the router usually doesn’t drop the WiFi connection during that)
@DRCO For the Core running without missing a beat are you checking for resets or only for hanged resets? What do you mean by “does drop out”? In the hickup I had it was just a reset without hanging that is thus quite a bit less noticeable
Hi, note that my Photon’s are not connected to the cloud. In the initial test I have done, they were not using the Internet at all, just exposing a web server to the LAN. They failed frequently. I added some pings to the Internet to be able to assess, if the WLAN is dead or not. I can check if with the newest version it runs better without the ping.
The Photons are very close to the router, other devices are not failing in the same location.
Maybe something which may have something to do with the problem: I do have a WLAN repeater which is also visible (but with lower strength) in the same position. I will see, if that affects the problem.
@Stevie Not connected to the cloud means you actively did Spark.disconnect()? Because the flashing uses the cloud and as long as the Photon breathes cyan it uses a cloud connection even if you don’t use cloud features.
Am experiencing the exact same problems with many devices - even the stock firmware sometimes hard faults. The worst bug that I have encountered is that the device wifi module seemingly “dies” and the device is stuck flashing green, and only reflashing the wifi firmware helps to recover it, however the problem is sure to reoccur.
I strongly suspect there may be more than one root cause here, and some of these may be how the photon reacts to some very site/ISP-specific situations.
I want to advise everyone participating and watching this thread not to jump to conclusions too quickly, this will take some systematic work to isolate and understand the failure modes.
Not sure if this is useful, but I had a Photon sampling analog and publishing to a webhook every 6 minutes using a “for” loop counter, then resetting the device using System.reset(). This approach worked fine for three days. I then changed the approach to use millis() as the timer method and the unit crashed into the dreaded rapid blue flash within 6 hours. Please note, I shortened the timer cycle to around 3 minutes using the millis()
Yes I agree multiple root causes seem very likely. My Photon has been running for 2 days now with the code I posted + Spark.disconnect() in setup() so everything points to the cloud connection somehow causing the reset. Also there was a thread on TCPCLient causing hard resets, maybe this has something todo with it as my own code doesn’t use it but the cloud connection might use parts of it. Also @Stevie mentioned his first failing Photons were running a web server also using TCP which might explain those failing even without a cloud connection.
One of my photons is running a prototype application reporting thermocouple readings every 5 minutes. My dashboard clearly shows that it made it’s last report at 00:29:07 Aug 4 (local time), after running for over two days without event.
I had previously started capturing all network traffic to/from it’s MAC address.
I have edited the resulting pcap file to eliminate the 2+ days of normal activity and an additional photon (I still have it, and can provide, if it will prove useful.) The trimmed pcap file is available for download here
I have not analysed it in detail, I wanted to get as many eyes on this as possible quickly, but here is my preliminary analysis:
We see normal activity up to about packet 53.
Packets 54/55/56 are TCP retransmissions, it looks like either the cloud service has gone out to lunch, or my ISP or firewall is having a hiccup.
Regardless, after only 12 seconds, the photon gives up on the TCP connection to the cloud and closes it (packet 57.)
Then it looks like it cycles round to open a new socket to the cloud, and starts issuing DNS requests to my default gateway (192.168.121.1) [which announces itself as the DNS server during DHCP] (packets 58,59,62, etc)
These all go unanswered, probably for exactly the same reason that the initial TCP connection was dropped, because my ISP/firewall is busy having a fit, or rebooting a core router or similar.
Regardless, the last DNS request (packet 72) is made at 00:29:33, approximately 71 seconds after the first sign of trouble. After that we see the photon is clearly alive and refreshing it’s DHCP lease, but never attempts to re-establish the TCP connection to the cloud (as witnessed by no additional DNS requests.)
That is, until I hit the reset button at 21:09:42 Aug-5 (packet 213), then everything takes off like normal.
There must be a failure scenario, involving DNS that results in the connection attempt never being retried (maybe it thinks the error returned is permanent.)
I think the TCP connection is being dropped a little aggressively (12 secs ?!?), and DNS gives up pretty quickly too (~1min) - although I recognise these may be a conscious trade-off vs. responsiveness.
Needless to say, everything else on my network rides through these ISP/firewall events without a visible problem.
This trace shows a single photon, but there was a second photon that failed in exactly the same way at or about exactly the same time (I can provide pcap of both, if that is useful to anyone.)
Additional analysis/comments welcome. I may not have time to look too deeply into the firmware tonight. If anyone can save me time and provide pointers to where the cloud connection is initiated and maintained, that would be helpful - I used to know in the days of the core, but have not had reason to dig deeply into that part of the new firmware.
Great investigative work! This looks quite similar to my issue, interesting though that you see the Photon hanging while mine resets. Those Photons are o 0.4.3 I assume? Are you using the WebIDE? Maybe it really depends on the timing, my broadband reconnects should have been below 1 minute so below the 71 seconds at which your Photon gave up DNS queries. Also in my case it would then find itself using a different public IP which might somehow trigger different behavior afterwards.
I think we really need someone from the firmware team to look at the whole cloud/TCP/DNS keep alive cycle. If I find the time my next test will be using TCPClient to do the same test as I’m currently doing with UDP but using a server outside my local network (with UDP but no cloud connection it has now been running for 3 days and 7 hours) so hardware and WiFi seem to be stable enough.
My photons are running 0.4.3 (built locally from firmware/latest branch [pre-0.4.4]) - the photon under the spotlight here is running an app built & loaded OTA using the web ide. Code is basicly:
I will add to this thread so I may keep track of any posts regarding a solution to this issue. I recently posted “Photon Reliability” and “Photon losing connectivity and Resets” as the Photon seems unable to maintain a connection for more than a day as noted in this thread. It might drop and recover under firmware version 0.4.3, but eventually it does not recover and just ends up sitting forever with a blinking green light. It also resets occasionally and at random times, restarting the application (from the setup()). One poster said this can happen when the WiFi connection is lost. It sometimes will reset at this point. This is noted in a bug list on GitHub #516 and #490).