[SOLVED] Firmware losing connection to local cloud every 1-2 mins (due to UDP/TCP "collision")

lanteau · November 14, 2014, 5:40am

Hey everyone,

Some of you may be familiar with the working I was doing over the past few months to get the Spark firmware running on the STM32F4-Discovery board, and ultimately my STM32F4 based “smart” power strip.

Currently I am running a local cloud on my laptop, and things have been going well. I’ve been working on developing my public API that our Android app will communicate to the power strip with. Just last night I’ve started having issues keeping my firmware connected to my local cloud. About every 1-2 minutes it disconnects.

The server log is showing this:

Connection from: 192.168.1.147, connId: 14
recovering from early data!  { step: 5, data: '4101046600b174', cache_key: '_13' }
on ready { coreID: '340033001947313436323436',
  ip: '192.168.1.147',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: '_13' }
Core online!
1: Core disconnected: socket error Error: read ETIMEDOUT { coreID: '340033001947313436323436',
  cache_key: '_6',
  duration: 691.962 }
Session ended for _6
onSocketData called, but no data sent.
1: Core disconnected: socket close false { coreID: '340033001947313436323436',
  cache_key: '_13',
  duration: 91.676 }
Session ended for _13
Connection from: 192.168.1.147, connId: 15
recovering from early data!  { step: 5, data: '41017cdcf6b174', cache_key: '_14' }
on ready { coreID: '340033001947313436323436',
  ip: '192.168.1.147',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: '_14' }
Core online!

I’ve seen various posts on here about similar issues, but the solutions didn’t seem overly relevant to my situation. Would this seem to be a cloud-server issue, or a firmware issue? I didn’t seem to have this issue before, and nothing should have changed to cause this. It was staying connected for hours/days before.

I’ve heard the local cloud being blamed for having potentially “too low” of latency to cause issues. Could this be a factor here? I’m hoping someone has an idea of what could be the problem, or someone more familiar with the networking code can help me try to discover what is going on. Thanks!

lanteau · November 14, 2014, 7:04am

I believe figured out why the firmware is disconnecting. This code in the the spark_protocol.cpp file seems to be the source:

system_tick_t millis_since_last_message = callback_millis() - last_message_millis;
      if (expecting_ping_ack)
      {
        if (10000 < millis_since_last_message)
        {
          // timed out, disconnect
          expecting_ping_ack = false;
          last_message_millis = callback_millis();
          return false;
        }
      }

Basically it sounds like it hasn’t received a message from the server in 10 seconds so it voluntarily disconnects because it suspects a timeout. Does this point the finger at the cloud server as the source of the issue?

ScruffR · November 14, 2014, 7:15am

@lanteau, since you don't give a lot of insight into your code, it's hard to tell.

But for the Spar Core users it has become sort of common knowledge, that the firmware should be written in a way, that allows for a regular cloud "ping" to stay online. Another way to get around this is, to actively deactivate the connection temporarily (e.g. for a long running task that needs to be not interrupted).
A third way is to actively call Spark.process() from time to time within your code, to let the Core do the cloud housekeeping.

This answer might have popped up sooner, if it wasn't for this part of your first post

For my part, when I read your post first, I thought you already found out about the 10sec and it might be a problem I couldn't help anyway - especially without a clue about your code.

lanteau · November 14, 2014, 7:21am

@ScruffR None of the communications or spark protocol related code has changed in my firmware. Currently my application.cpp isn’t even doing anything. The loop() is empty. So it can’t be a matter of the firmware being too busy to to check in.

Based on my above reply, it seems like the firmware expects the server to send it a message every so often and that’s not happening. But I could be wrong…

ScruffR · November 14, 2014, 7:26am

Could you try connecting to the remote Spark cloud?
This way you might be able to distinguish between a Core or an local cloud issue.
You could also try forcing a local cloud server response from your firmwary by doing some Spark.publish() or so.
Normally the “ping” in SPARK_WLAN_LOOP should do this anyway, but a test publish doesn’t hurt.

lanteau · November 14, 2014, 7:31am

Being a custom built “core” (STM32F4-Discovery) I don’t have access to the official Spark Cloud, nor have I attempted to get it.

As far as the code goes, here are my repositories. But like I said, the WiFi and communications code is identical to the original Spark firmware.

ScruffR · November 14, 2014, 7:47am

Sorry, I forgot that Spark needs to know your “Core” before you can use the cloud.
Have you got an original Core to do some parallel testing?
With the original you can then test the local against the remote cloud and maybe even (quietly saying) spoof the Core ID on your own device

While you wrote your previous post I edited my earlier post to suggest to force communication by just doing a Spark.publish() or so, to test the cloud server.

Silly question, since you said nothing changed, but just to make sure: Did you maybe update the cloud server recently?

lanteau · November 14, 2014, 7:59am

I did throw a Spark.publish() in the setup() function. It didn’t seem to make a difference.

Nothing changed with the cloud server, when this started happening I moved my existing server folder to a backup and reinstalled the server from the github repo. The same issue persists, so I suppose that rules out any corruption or files getting unexpectedly changed.

I do have a Core, unfortunately, the member of my group that is developing our Android app has the Core so he can develop without our hardware. I may have to buy another Core if testing with one seems necessary. This is just so weird, since I didn’t have this issue originally, it stayed connected for hours.

ScruffR · November 14, 2014, 8:13am

That’s odd then! Nothing has changed, but it suddenly stops working !?

I’m slowly running out of ideas then
One final thing to check.
Has there been any change in the WiFi infrastructure (routers, APs, firewall, EMF pollution, …)?
And most unlikely, could your HW have got a problem (EEPROM/flash wear out, antenna, …)?

If all this doesn’t lead anywhere it’s maybe time to beg the gurus @satishgn and @mdma for more insight.

ScruffR · November 14, 2014, 8:39am

I missed the "setup()" bit. Try putting it into loop() together with a delay(1000) to do it regularely, but not too often.
And see if your published event turns up in the

mdma · November 14, 2014, 8:43am

Strange that this was working and then no longer is. Are you able to revert your code to the state when it was working?

The first message in the log - recovering from early data. I wonder if there’s a race condition. Maybe inject a delay of a second after the initial handshake in the communication lib. Maybe that will help.

kennethlimcp · November 14, 2014, 8:52am

Does your code have this fix?

ScruffR · November 14, 2014, 8:58am

The repo linked in post #4 seems to have the fix incorporated - I guess the local copy will have too

lanteau · November 14, 2014, 9:03am

@kennethlimcp as @ScruffR just mentioned, yes it has that fix. Have @mdma or anyone ever packet sniffed the connection between the Core and the cloud-server? I’ve just observed an interesting phenomenon. So I pointed out this code earlier:

else
    {
      system_tick_t millis_since_last_message = callback_millis() - last_message_millis;
      if (expecting_ping_ack)
      {
        if (10000 < millis_since_last_message)
        {
          // timed out, disconnect
          expecting_ping_ack = false;
          last_message_millis = callback_millis();
          return false;
        }
      }

If I packet sniff, right before the connection is lost, there is 10 seconds of packets sent from the device to the spark server. Each packet is spaced about 0.6 seconds apart. This is a stark contrast to the relatively quiet amount of packets when everything is working right. I experimented with changing it to:

 if (30000 < millis_since_last_message)

and I then observed 30 seconds of this packet “flood” from the device to the server. So it seems that this stream of packets is what leads to the disconnect. Any thoughts on these findings?

lanteau · November 14, 2014, 9:08am

Not sure if this provides more information, but if I halt my debugger (processor is halted) when this packet flood starts, it keeps going regardless. Seems like the CC3000 is continuing to send packets on its own?

EDIT: They appear to be TCP Retransmissions (if that helps).

EDIT2: Looks like I found the problem, I created a NTP class to update the time/date using network time protocol, I was calling udp.begin() in the main loop. Apparently that has this effect, removing that returned my connection back to stable. Sorry for wasting anyone’s time and thanks for your help!

ScruffR · November 14, 2014, 11:35am

So much about “nothing has changed” often said quickly, but then it turns out that minor changes, that should not have any influence, do have side effects.
But that’s often what I find myself. I ponder and search and try, till I decide to ask for a second oppinion, and then after that find it myself anyway, due to one or the other question from outside.
If this is how it worked for you, our time wasn’t wasted.

Glad to hear that you’re back on track again

Could you please edit your title to “[SOLVED] …” please.

EDIT: I have altered the title for you already.

lanteau · November 14, 2014, 9:36pm

@ScruffR Would it be possible for you to explain why the repeated call to UDP.begin() was causing this? I’d at least like to take away a lesson from it.

ScruffR · November 15, 2014, 12:59pm

@lanteau, I’m not completely sure about the actual cause, since I’ve also no idea about the rest of your code, but there might be some possible explanations.
As you said the problem occured after one to two minutes, we shouldn’t assume that only the UDP.begin() was the actual cause. I’d expect in the meantime there was some UDP setting up and packet sending/receiving going on too.
So any of these might have contributed.

Since in spark_wiring_udp.h we see this

#define RX_BUF_MAX_SIZE	512

class UDP : public Stream {
private:
	uint8_t _sock;
	uint16_t _port;
	IPAddress _remoteIP;
	uint16_t _remotePort;
	sockaddr _remoteSockAddr;
	socklen_t _remoteSockAddrLen;
	uint8_t _buffer[RX_BUF_MAX_SIZE];
	uint16_t _offset;
        uint16_t _total;
        [...]
};

and in spark_wiring_udp.cpp we see for example this

uint8_t UDP::begin(uint16_t port) 
{
        int bound = 0;
	sockaddr tUDPAddr;

	if(WiFi.ready())
	{
	   _sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
           DEBUG("socket=%d",_sock);
           if (_sock >= 0)
            {

              flush();
              _port = port;

              memset(&tUDPAddr, 0, sizeof(tUDPAddr));
              tUDPAddr.sa_family = AF_INET;
              tUDPAddr.sa_data[0] = (_port & 0xFF00) >> 8;
              tUDPAddr.sa_data[1] = (_port & 0x00FF);

              DEBUG("bind socket=%d",_sock);
              bound = bind(_sock, (sockaddr*)&tUDPAddr, sizeof(tUDPAddr)) >= 0;
              DEBUG("socket=%d bound=%d",_sock,bound);

              if(!bound)
              {
                  stop();
              }

            }
	}
        return bound;
}

we have to take in consideration, that by default we only have one UDP object instanciated and thus any action on the UDP object meant for NTP might interfere with any other use of UDP and vice versa.
And if you have any kind of asyncronism (e.g. interrupts, timers) in the use of your UDP you might also face the risk of race conditions.

Sorry if this is not helping or does not hit your actual point, but that’s the best I can think of at the moment.

mdma · November 15, 2014, 1:11pm

Just a tip: if you’re not already using version control locally, I can highly recommend it.

Being able to quickly see the differences between a known working version and one that doesn’t work helps you hone in directly on the problem, and can save hours of guessing what might have changed. As you have found, it’s easy to overlook seemingly trivial changes. Very glad you found the cause.

bko · November 15, 2014, 1:40pm

When you say “repeated calls to UDP.begin()” do you mean without a corresponding UDP.stop()?

If so, on each call you are using up one of the seven sockets available on the TI CC3000 (one of which is used for the cloud if you are using it) so you will rapidly run out of sockets.

Topic		Replies	Views
Known issue: long delays or blocking code kills the connection to the Cloud Troubleshooting	39	11514	May 25, 2016
TCP Connection unstable Troubleshooting	4	2002	November 28, 2015
Spark Core no longer communicating with the cloud Troubleshooting	9	1314	April 7, 2015
Local cloud flash firmware error Firmware	3	1115	April 25, 2015
Functioning with wifi and internet drops Troubleshooting	40	6547	May 30, 2014

[SOLVED] Firmware losing connection to local cloud every 1-2 mins (due to UDP/TCP "collision")

Related topics