Long Term Photon Connection Stability

Well this is somewhat worrying :unamused:

Am experiencing the exact same problems with many devices - even the stock firmware sometimes hard faults. The worst bug that I have encountered is that the device wifi module seemingly “dies” and the device is stuck flashing green, and only reflashing the wifi firmware helps to recover it, however the problem is sure to reoccur.

I strongly suspect there may be more than one root cause here, and some of these may be how the photon reacts to some very site/ISP-specific situations.

I want to advise everyone participating and watching this thread not to jump to conclusions too quickly, this will take some systematic work to isolate and understand the failure modes.

4 Likes

Not sure if this is useful, but I had a Photon sampling analog and publishing to a webhook every 6 minutes using a “for” loop counter, then resetting the device using System.reset(). This approach worked fine for three days. I then changed the approach to use millis() as the timer method and the unit crashed into the dreaded rapid blue flash within 6 hours. Please note, I shortened the timer cycle to around 3 minutes using the millis()

Yes I agree multiple root causes seem very likely. My Photon has been running for 2 days now with the code I posted + Spark.disconnect() in setup() so everything points to the cloud connection somehow causing the reset. Also there was a thread on TCPCLient causing hard resets, maybe this has something todo with it as my own code doesn’t use it but the cloud connection might use parts of it. Also @Stevie mentioned his first failing Photons were running a web server also using TCP which might explain those failing even without a cloud connection.

OK - got some hard data.

One of my photons is running a prototype application reporting thermocouple readings every 5 minutes. My dashboard clearly shows that it made it’s last report at 00:29:07 Aug 4 (local time), after running for over two days without event.

I had previously started capturing all network traffic to/from it’s MAC address.

I have edited the resulting pcap file to eliminate the 2+ days of normal activity and an additional photon (I still have it, and can provide, if it will prove useful.) The trimmed pcap file is available for download here

I have not analysed it in detail, I wanted to get as many eyes on this as possible quickly, but here is my preliminary analysis:

  • We see normal activity up to about packet 53.
  • Packets 54/55/56 are TCP retransmissions, it looks like either the cloud service has gone out to lunch, or my ISP or firewall is having a hiccup.
  • Regardless, after only 12 seconds, the photon gives up on the TCP connection to the cloud and closes it (packet 57.)
  • Then it looks like it cycles round to open a new socket to the cloud, and starts issuing DNS requests to my default gateway (192.168.121.1) [which announces itself as the DNS server during DHCP] (packets 58,59,62, etc)
  • These all go unanswered, probably for exactly the same reason that the initial TCP connection was dropped, because my ISP/firewall is busy having a fit, or rebooting a core router or similar.
  • Regardless, the last DNS request (packet 72) is made at 00:29:33, approximately 71 seconds after the first sign of trouble. After that we see the photon is clearly alive and refreshing it’s DHCP lease, but never attempts to re-establish the TCP connection to the cloud (as witnessed by no additional DNS requests.)
  • That is, until I hit the reset button at 21:09:42 Aug-5 (packet 213), then everything takes off like normal.

There must be a failure scenario, involving DNS that results in the connection attempt never being retried (maybe it thinks the error returned is permanent.)

I think the TCP connection is being dropped a little aggressively (12 secs ?!?), and DNS gives up pretty quickly too (~1min) - although I recognise these may be a conscious trade-off vs. responsiveness.

Needless to say, everything else on my network rides through these ISP/firewall events without a visible problem.

This trace shows a single photon, but there was a second photon that failed in exactly the same way at or about exactly the same time (I can provide pcap of both, if that is useful to anyone.)

Additional analysis/comments welcome. I may not have time to look too deeply into the firmware tonight. If anyone can save me time and provide pointers to where the cloud connection is initiated and maintained, that would be helpful - I used to know in the days of the core, but have not had reason to dig deeply into that part of the new firmware.

8 Likes

Great investigative work! This looks quite similar to my issue, interesting though that you see the Photon hanging while mine resets. Those Photons are o 0.4.3 I assume? Are you using the WebIDE? Maybe it really depends on the timing, my broadband reconnects should have been below 1 minute so below the 71 seconds at which your Photon gave up DNS queries. Also in my case it would then find itself using a different public IP which might somehow trigger different behavior afterwards.

I think we really need someone from the firmware team to look at the whole cloud/TCP/DNS keep alive cycle. If I find the time my next test will be using TCPClient to do the same test as I’m currently doing with UDP but using a server outside my local network (with UDP but no cloud connection it has now been running for 3 days and 7 hours) so hardware and WiFi seem to be stable enough.

1 Like

My photons are running 0.4.3 (built locally from firmware/latest branch [pre-0.4.4]) - the photon under the spotlight here is running an app built & loaded OTA using the web ide. Code is basicly:

setup()
{
    init_hardware()
}

loop()
{
    read_stuff_from_hardware();
    Spark.publish(that stuff in json form);
    sleep(600);
}

No deep magic there. The other one was running tinker.

1 Like

Issue #490 would result in a reset given a similar situation to the one I captured and described above.

I have no idea why my photons do not reset under this condition

2 Likes

I will add to this thread so I may keep track of any posts regarding a solution to this issue. I recently posted “Photon Reliability” and “Photon losing connectivity and Resets” as the Photon seems unable to maintain a connection for more than a day as noted in this thread. It might drop and recover under firmware version 0.4.3, but eventually it does not recover and just ends up sitting forever with a blinking green light. It also resets occasionally and at random times, restarting the application (from the setup()). One poster said this can happen when the WiFi connection is lost. It sometimes will reset at this point. This is noted in a bug list on GitHub #516 and #490).

1 Like

Hi All,

Just wanted to pop in and say thank you for posting these findings! This is very helpful, and I’ve brought this thread to the attention of our firmware team. This is an important issue, so any evidence or specifics are greatly appreciated. :slight_smile:

Thank you!
David

4 Likes

We are all looking forward to a resolution of the connectivity and reset issue. I am sure that if the firmware team wishes to Beta a firmware change, many of us here would be ready to do so. We can’t deploy a serious application if we are unable to keep the Photon running 24 x 7 x 365.

2 Likes

I really want to understand these DNS-related failure modes, both the SOS and the cloud-connection-not-retried.

I will work on creating a failure scenario on demand this weekend - by manipulating firewall rules to isolate a photon under machine control.

I’d still appreciate any pointers to the cloud connection initiation code in the current firmware tree. Sure I can grep as well as the best of them, but I think there’s more abstraction these days; so a couple of well written hints will probably save a bunch of time and frustration, and make a positive outcome more likely.

3 Likes

I took a look at this as I’ve not seen any Core/Photon network traffic in Wireshark. Interesting.

The only thing that I can add is that it looks like the ten DNS requests are two streams/threads, meaning that DNS requests #3, #5, #7 and #9 are resends of #1, and similarly the even ones; offset by five seconds.

I am going to guess that it’s trying what it thinks are two different DNS servers, only querying the second server once the first hasn’t responded in five seconds. For each of the two streams it retries after 11 seconds, then after a further ~12 seconds, then after a further 14 seconds, then finally after a further 18 seconds.

Some new sleuthing added here: Spark Firmware issue #490

Hope it helps… :smile:

OK - I have the zombie cyan flash mode in captivity.

Using ACLs on an ethernet switch, I am able to isolate an photon from the internet on demand - it stays connected to the access point, but the ACL drops all ethernet packets between the photon and the internet. I can turn this on and off on demand via a couple of expect scripts.

When I isolate the photon, the behaviour exactly mirrors the observed behaviour during a “normal” internet outage. The packet captures look exactly the same too (in this case, I installed the ACLs after packet 42):

Removing the ACLs and allowing traffic to pass again does nothing for the photon, it continues to flash cyan and does not ever retry opening the cloud connection. As soon as I hit reset, it takes right off.

Note that I’ve not been able to duplicate the SOS failure yet, but once I’ve run this one to ground, I’ll look at the test case listed in #490.

6 Likes

I think one problem is that the return value of inet_gethostbyname() in Spark_Connect() is not being checked. Instead the value of the returned IP address is compared against 0.0.0.0.

I’ve not fired up the debugger yet (it’s late here, and I have [real life] stuff to do tomorrow), but what I suspect is happening is:

  1. inet_gethostbyname() on a photon maps to wiced_hostname_lookup() [OK, I don’t suspect this, it is true]
  2. on failure (or at least on failures like I can inject) - wiced_hostname_lookup() writes garbage to the storage at the end of the HAL_IPAddress pointer.
  3. hence ip_addr is garbage (the constructor does init to 0.0.0.0, so it must be being overwritten) - the error path logic depends on 0.0.0.0 on failure
  4. the stack tries to open a socket to the specified (garbage) ip address.
  5. most times, this is relatively benign, because the ip address is not in the local subnet,and so the SYN does not appear over the air, and it does not appear to cause the stack to crash.
  6. sometimes, maybe the garbage IP address proves fatal to the stack

This doesn’t explain why I don’t see the logic in manage_cloud_connection() & friends kick in and retry, but the ad-hoc state machine logic could have flaws that I’ll not spot by just staring at it.

I don’t know when I’ll have the time to fire up a debug build and test this logic, so I’ll throw this out there in case anyone else wants to have a stab at it before I’m able to spend more cycles on it.

7 Likes

Updates:

  1. the photon inet_gethostbyname() call overwrites the value at the ipaddress pointer regardless of success/failure of the wiced_hostname_lookup() call. Assuming wiced_hostname_lookup() doesn’t write anything to the local ipaddress variable on failure, we’ll put garbage from the unitialized local variable into the ip address.
  2. Looking at the DNS packets on the wire it looks like we call Spark_Connect() five times, before stalling for some as-yet-unknown reason, because I see the DNS queries go out on the wire as five sets of 2 requests. Within each set, the 2 request packets are spaced out by 5 seconds, and the last parameter to wiced_hostname_lookup() is 5000, which looks suspiciously like a timeout in mS to me.
3 Likes

Yet Another Update:

The following patch seems to make things better - in my test case I now see the DNS queries repeating (in pairs 5 seconds apart) with the delay between the pairs backing off, as the code suggests.

diff --git a/hal/src/photon/inet_hal.cpp b/hal/src/photon/inet_hal.cpp
index b42812b..ea8eea8 100644
--- a/hal/src/photon/inet_hal.cpp
+++ b/hal/src/photon/inet_hal.cpp
@@ -31,7 +31,9 @@ int inet_gethostbyname(const char* hostname, uint16_t hostnameLen, HAL_IPAddress
     wiced_ip_address_t address;
     address.version = WICED_IPV4;
     wiced_result_t result = wiced_hostname_lookup (hostname, &address, 5000);
-    out_ip_addr->ipv4 = GET_IPV4_ADDRESS(address);
+    if (result == WICED_SUCCESS) {
+        out_ip_addr->ipv4 = GET_IPV4_ADDRESS(address);
+    }
     return -result;
 }
 

Now - there are still some mysteries, like why I do not see a TCP connection attempted to the hardwired IP address, in the presence of DNS failures - but that can be a problem for another day.

If other code is relying on the *ipaddr == “0.0.0.0” behaviour in the face of DNS failures, I would expect to see this workaround help those too.

However, I think the correct fix is to test the return value of the functions, and not rely on a side-effect (maybe this is a hangover from some CC3000 quirk, who knows) - either way, the return value should be the first thing checked.

If anyone following this thread is able to build locally and try this patch, please report your findings here.

6 Likes

I found another loosely related problem in spark_wiring_tcpclient.cpp:

diff --git a/wiring/src/spark_wiring_tcpclient.cpp b/wiring/src/spark_wiring_tcpclient.cpp
index badfcf4..16adda0 100644
--- a/wiring/src/spark_wiring_tcpclient.cpp
+++ b/wiring/src/spark_wiring_tcpclient.cpp
@@ -58,7 +58,7 @@ int TCPClient::connect(const char* host, uint16_t port, network_interface_t nif)
       {
         IPAddress ip_addr;
 
-        if(inet_gethostbyname(host, strlen(host), ip_addr, nif, NULL) == 0)
+        if((rv = inet_gethostbyname(host, strlen(host), ip_addr, nif, NULL)) == 0)
         {
                 return connect(ip_addr, port, nif);
         }
7 Likes