Long Term Photon Connection Stability

Hi All,

Just wanted to pop in and say thank you for posting these findings! This is very helpful, and I’ve brought this thread to the attention of our firmware team. This is an important issue, so any evidence or specifics are greatly appreciated. :slight_smile:

Thank you!
David

4 Likes

We are all looking forward to a resolution of the connectivity and reset issue. I am sure that if the firmware team wishes to Beta a firmware change, many of us here would be ready to do so. We can’t deploy a serious application if we are unable to keep the Photon running 24 x 7 x 365.

2 Likes

I really want to understand these DNS-related failure modes, both the SOS and the cloud-connection-not-retried.

I will work on creating a failure scenario on demand this weekend - by manipulating firewall rules to isolate a photon under machine control.

I’d still appreciate any pointers to the cloud connection initiation code in the current firmware tree. Sure I can grep as well as the best of them, but I think there’s more abstraction these days; so a couple of well written hints will probably save a bunch of time and frustration, and make a positive outcome more likely.

3 Likes

I took a look at this as I’ve not seen any Core/Photon network traffic in Wireshark. Interesting.

The only thing that I can add is that it looks like the ten DNS requests are two streams/threads, meaning that DNS requests #3, #5, #7 and #9 are resends of #1, and similarly the even ones; offset by five seconds.

I am going to guess that it’s trying what it thinks are two different DNS servers, only querying the second server once the first hasn’t responded in five seconds. For each of the two streams it retries after 11 seconds, then after a further ~12 seconds, then after a further 14 seconds, then finally after a further 18 seconds.

Some new sleuthing added here: Spark Firmware issue #490

Hope it helps… :smile:

OK - I have the zombie cyan flash mode in captivity.

Using ACLs on an ethernet switch, I am able to isolate an photon from the internet on demand - it stays connected to the access point, but the ACL drops all ethernet packets between the photon and the internet. I can turn this on and off on demand via a couple of expect scripts.

When I isolate the photon, the behaviour exactly mirrors the observed behaviour during a “normal” internet outage. The packet captures look exactly the same too (in this case, I installed the ACLs after packet 42):

Removing the ACLs and allowing traffic to pass again does nothing for the photon, it continues to flash cyan and does not ever retry opening the cloud connection. As soon as I hit reset, it takes right off.

Note that I’ve not been able to duplicate the SOS failure yet, but once I’ve run this one to ground, I’ll look at the test case listed in #490.

6 Likes

I think one problem is that the return value of inet_gethostbyname() in Spark_Connect() is not being checked. Instead the value of the returned IP address is compared against 0.0.0.0.

I’ve not fired up the debugger yet (it’s late here, and I have [real life] stuff to do tomorrow), but what I suspect is happening is:

  1. inet_gethostbyname() on a photon maps to wiced_hostname_lookup() [OK, I don’t suspect this, it is true]
  2. on failure (or at least on failures like I can inject) - wiced_hostname_lookup() writes garbage to the storage at the end of the HAL_IPAddress pointer.
  3. hence ip_addr is garbage (the constructor does init to 0.0.0.0, so it must be being overwritten) - the error path logic depends on 0.0.0.0 on failure
  4. the stack tries to open a socket to the specified (garbage) ip address.
  5. most times, this is relatively benign, because the ip address is not in the local subnet,and so the SYN does not appear over the air, and it does not appear to cause the stack to crash.
  6. sometimes, maybe the garbage IP address proves fatal to the stack

This doesn’t explain why I don’t see the logic in manage_cloud_connection() & friends kick in and retry, but the ad-hoc state machine logic could have flaws that I’ll not spot by just staring at it.

I don’t know when I’ll have the time to fire up a debug build and test this logic, so I’ll throw this out there in case anyone else wants to have a stab at it before I’m able to spend more cycles on it.

7 Likes

Updates:

  1. the photon inet_gethostbyname() call overwrites the value at the ipaddress pointer regardless of success/failure of the wiced_hostname_lookup() call. Assuming wiced_hostname_lookup() doesn’t write anything to the local ipaddress variable on failure, we’ll put garbage from the unitialized local variable into the ip address.
  2. Looking at the DNS packets on the wire it looks like we call Spark_Connect() five times, before stalling for some as-yet-unknown reason, because I see the DNS queries go out on the wire as five sets of 2 requests. Within each set, the 2 request packets are spaced out by 5 seconds, and the last parameter to wiced_hostname_lookup() is 5000, which looks suspiciously like a timeout in mS to me.
3 Likes

Yet Another Update:

The following patch seems to make things better - in my test case I now see the DNS queries repeating (in pairs 5 seconds apart) with the delay between the pairs backing off, as the code suggests.

diff --git a/hal/src/photon/inet_hal.cpp b/hal/src/photon/inet_hal.cpp
index b42812b..ea8eea8 100644
--- a/hal/src/photon/inet_hal.cpp
+++ b/hal/src/photon/inet_hal.cpp
@@ -31,7 +31,9 @@ int inet_gethostbyname(const char* hostname, uint16_t hostnameLen, HAL_IPAddress
     wiced_ip_address_t address;
     address.version = WICED_IPV4;
     wiced_result_t result = wiced_hostname_lookup (hostname, &address, 5000);
-    out_ip_addr->ipv4 = GET_IPV4_ADDRESS(address);
+    if (result == WICED_SUCCESS) {
+        out_ip_addr->ipv4 = GET_IPV4_ADDRESS(address);
+    }
     return -result;
 }
 

Now - there are still some mysteries, like why I do not see a TCP connection attempted to the hardwired IP address, in the presence of DNS failures - but that can be a problem for another day.

If other code is relying on the *ipaddr == “0.0.0.0” behaviour in the face of DNS failures, I would expect to see this workaround help those too.

However, I think the correct fix is to test the return value of the functions, and not rely on a side-effect (maybe this is a hangover from some CC3000 quirk, who knows) - either way, the return value should be the first thing checked.

If anyone following this thread is able to build locally and try this patch, please report your findings here.

6 Likes

I found another loosely related problem in spark_wiring_tcpclient.cpp:

diff --git a/wiring/src/spark_wiring_tcpclient.cpp b/wiring/src/spark_wiring_tcpclient.cpp
index badfcf4..16adda0 100644
--- a/wiring/src/spark_wiring_tcpclient.cpp
+++ b/wiring/src/spark_wiring_tcpclient.cpp
@@ -58,7 +58,7 @@ int TCPClient::connect(const char* host, uint16_t port, network_interface_t nif)
       {
         IPAddress ip_addr;
 
-        if(inet_gethostbyname(host, strlen(host), ip_addr, nif, NULL) == 0)
+        if((rv = inet_gethostbyname(host, strlen(host), ip_addr, nif, NULL)) == 0)
         {
                 return connect(ip_addr, port, nif);
         }
7 Likes

@AndyW I have to say a lot of this is going over my head, but I really appreciate the time you (and others) are putting into this! It will be great to get it sorted. I’m not an expert on this but I have 2 photons and a board running a P1 setup (and have previously had problems relating to this thread) so if there’s any test code you’d like to me to run for an extended period then please let me know!

Overnight testing (isolating the photon from the internet for 10 minutes every hour or so) suggests that this workaround is helpful keeping photons alive and kicking.

I’ll issue a pr against firmware/develop sometime soon(ish.)

4 Likes

Hi @AndyW I have tested your two patches related with the inet_gethostbyname function and they doesn’t solve any issue with the connect() call when using the test-code in #490. The Photon still panics when the server is not available after calling connect() and then connected(), in this specific order.

Please take a look to my latest comments in #490, maybe they help you to find a workaround, touching the connect() and connected() functions code in /wiring/src/spark_wiring_tcpclient.cpp. They are quite complex to me to understand what is going on below, to cause a PANIC or to always return 1 when the socket is closed at the server side, etc.

Thanks in advance for your testing.

OK, thanks for the testing. I’ll look at your comments and will also pivot over to your test cases as soon as I have created the PR for these two fixes. It might not be until much later tonight or tomorrow (US central time) however.

1 Like

Created pull request #527 to address the inet_gethostbyname() errors and usage.

I have not tested against the core - perhaps someone at Particle-central can do that as they inspect the PR.

I now understand why I was not seeing the connection attempt to a hardcoded IP address in the face of inet_gethostbyname() failure.

The logic tested for dns failure and lack of IP address separately, but dns failure was indicated by a lack of IP address, so it would never take the path of using the hardcoded IP address.

@jrodas, I’ll move onto your test case for #490 tonight, and see what I can discover.

6 Likes

I just noticed an issue with long term stability. I had a Photon running for just under a week and it just lost connection to the :spark: Cloud. I have the Photon sync its time with the Cloud once a day – my Cores do not and they lost connection after about 36 hours.

This is the first time I’ve had the problem. A quick reset fixed everything.

Thanks for working this!

We’ve recently added some stability fixes to the photon socket layer that we hope will improve long term stability.

4 Likes

There has been frenzied activity in the last few days on the develop branch that has yielded some remarkable improvements in TCPClient stability.

I do not know what the release vehicle will be for this work or when it will be available for the web ide, and there are still a couple of loose ends to work on, but one of my test apps used to last a few days at most before it reverted to flashing cyan, it has been running for many, many days now.

5 Likes

Ok I have been running my long term connectivity test again, now powered by solar, with 0.4.4-rc3 and connected to the cloud and it’s been 2 days and 13 hours with just a single UDP packet lost and no resets looks pretty promising.

5 Likes

I have also noticed that stability has been greatly improved, tough TCPClient::connect still seems to cause a hard fault on the Photons in some cases. It sometimes works flawlessly for multiple days, but when the network is under load then it will constantly panic multiple times, sometimes this repeating panic cycle lasts for hours before the device starts to work again as usual.

1 Like