OK - got some hard data.
One of my photons is running a prototype application reporting thermocouple readings every 5 minutes. My dashboard clearly shows that it made it’s last report at 00:29:07 Aug 4 (local time), after running for over two days without event.
I had previously started capturing all network traffic to/from it’s MAC address.
I have edited the resulting pcap file to eliminate the 2+ days of normal activity and an additional photon (I still have it, and can provide, if it will prove useful.) The trimmed pcap file is available for download here
I have not analysed it in detail, I wanted to get as many eyes on this as possible quickly, but here is my preliminary analysis:
- We see normal activity up to about packet 53.
- Packets 54/55/56 are TCP retransmissions, it looks like either the cloud service has gone out to lunch, or my ISP or firewall is having a hiccup.
- Regardless, after only 12 seconds, the photon gives up on the TCP connection to the cloud and closes it (packet 57.)
- Then it looks like it cycles round to open a new socket to the cloud, and starts issuing DNS requests to my default gateway (192.168.121.1) [which announces itself as the DNS server during DHCP] (packets 58,59,62, etc)
- These all go unanswered, probably for exactly the same reason that the initial TCP connection was dropped, because my ISP/firewall is busy having a fit, or rebooting a core router or similar.
- Regardless, the last DNS request (packet 72) is made at 00:29:33, approximately 71 seconds after the first sign of trouble. After that we see the photon is clearly alive and refreshing it’s DHCP lease, but never attempts to re-establish the TCP connection to the cloud (as witnessed by no additional DNS requests.)
- That is, until I hit the reset button at 21:09:42 Aug-5 (packet 213), then everything takes off like normal.
There must be a failure scenario, involving DNS that results in the connection attempt never being retried (maybe it thinks the error returned is permanent.)
I think the TCP connection is being dropped a little aggressively (12 secs ?!?), and DNS gives up pretty quickly too (~1min) - although I recognise these may be a conscious trade-off vs. responsiveness.
Needless to say, everything else on my network rides through these ISP/firewall events without a visible problem.
This trace shows a single photon, but there was a second photon that failed in exactly the same way at or about exactly the same time (I can provide pcap of both, if that is useful to anyone.)
Additional analysis/comments welcome. I may not have time to look too deeply into the firmware tonight. If anyone can save me time and provide pointers to where the cloud connection is initiated and maintained, that would be helpful - I used to know in the days of the core, but have not had reason to dig deeply into that part of the new firmware.