Hard fault caused by the TCPServer example from the Spark docs and a simple Python client

mtnscott · December 17, 2014, 2:01am

HI @dave,
All my client.write calls have the correct size specified. I’m not sure what to do at this time other than rewrite with my own socket calls.

Unfortunately it’s stalled my development and until I find a workaround, or wait for the Photon I’m not sure what else to do. I’ve thought about adding a BTLE micro and I2C data to the spark from other sensors, but that’s a bit of work.

What’s confusing to me is that other people seem to not be having problems and I have an app that wakes up, turns on the CC3000, writes a chunk of data, and closes and shuts down the CC3000. That has been working for over 20 days. Guess just having WiFi on all the time with a frequent socket connection just is not stable.

psb777 · December 17, 2014, 9:43am

You are not alone. Other people are having problems.

Sometimes I guess it will be their own fault. The example cited above where the TCP or UDP write accesses memory it ought not to because of improper string termination is but one way one can screw up any C program.

But often the fault is with the Spark Core. I too am in your position. The Spark Core does not work as advertised or as intended or as documented in precisely the area which distinguishes the Core from the Arduino - the networking and the Cloud. I cannot use the Cloud because if it goes away (for whatever reason) my user code blocks. I use the so-called “UDP” instead but that is unstable and so I have to force my Cores to reboot every 15mins. Other people report similar problems.

When pushed Spark HQ report they have stopped actively addressing problems with the Core while they focus on the Photon. This is unreasonable in my view.

mtnscott · December 17, 2014, 3:28pm

@psb777 You may just want to shut down the WiFi every 10ish minutes and restart it. I’ve had an app running for over 20 days that does that - well in reverse, it will monitor a bunch of sensors and every 10 mins start up the WiFi, post its data, and then shuts down the WiFi. The loop stays running. So instead of rebooting the core just reboot the WiFi. In my case I’m not sure it will help me as I have found that if the client tries to connect and the server is not responding the client panics.

Dave · December 17, 2014, 6:21pm

Hey Guys,

I spent a few hours looking into this yesterday, which is why I posted, so I would definitely say we’re actively supporting and helping resolve issues whenever we can. The firmware we’re writing for the photon uses a hardware abstraction layer, so any improvements we make there will be applicable to the Core. We haven’t released those yet because they’re still being actively developed, but I personally am very interested in seeing this issue fixed as well. I’m going to look into offering a bug bounty on this to see if we can speed up the resolution.

Thanks,
David

Dave · December 17, 2014, 6:31pm

Hi @mtnscott,

We’re going to go ahead with writing up / posting a bug bounty. I think you already sent me your code, I don’t suppose you had any luck on a minimum failing test case? If you guys have specific failing test examples, can you share them with me?

I’m going to write up guidelines, test cases, and examples and we’re going to post a bug bounty. Even if we don’t have the time ourselves this week, we still care deeply about and are willing to pay to improve the experience for everyone.

Thanks,
David

guppy · December 17, 2014, 6:37pm

Great to hear – need more blog posts about what is happening with the photon. I’m putting my christmas light project on hold for the time being as the core just isn’t reliable enough for my needs on that project (and it’s only for next Christmas anyway).

mtnscott · December 17, 2014, 6:43pm

@Dave Use this code example -


// SYNTAX TCPServer server = TCPServer(port);Parameters: port: the port to listen on (int)
// EXAMPLE USAGE
#include <application.h>

// telnet defaults to port 23
TCPServer server = TCPServer(8888);
TCPClient client;

/**************************************************************************/
/*!
    Heartbeat variables and defines
*/
/**************************************************************************/
#define HEARTBEAT_DELAY 1000
int heartbeat_led = D7;
bool heartbeat_state = false;
unsigned long SparkHeartBeatNext;          // next time to toggle heartbeat led

void setup()
{
  // start listening for clients
  server.begin();

  // Configure the LED heartbeat pin
  pinMode(heartbeat_led, OUTPUT);
  digitalWrite(heartbeat_led, LOW);
  heartbeat_state = false;
  SparkHeartBeatNext = millis() + HEARTBEAT_DELAY;

  // Make sure your Serial Terminal app is closed before powering your Core
  Serial.begin(9600);
  // Now open your Serial Terminal, and hit any key to continue!
  while(!Serial.available()) SPARK_WLAN_Loop();

  Serial.println(WiFi.localIP());
  Serial.println(WiFi.subnetMask());
  Serial.println(WiFi.gatewayIP());
  Serial.println(WiFi.SSID());
}

void loop()
{
  unsigned long _curTime = millis();

  /*
   * Flash the heartbeat led
   */
  if (_curTime > SparkHeartBeatNext)
  {
    digitalWrite(heartbeat_led, heartbeat_state?LOW:HIGH);       // Turn LED either on or off
    delay(20);
    digitalWrite(heartbeat_led, heartbeat_state?HIGH:LOW);       // Turn LED either on or off
    SparkHeartBeatNext = millis() + HEARTBEAT_DELAY;
  }

  client = server.available();
  if (client.connected()) {
    // echo all available bytes back to the client
    while (client.available()) {
      server.write(client.read());
    }
    server.begin();
  }
}

It’s what I was using with @bko and @guppy. Simply put - if you write data to the Spark Core it can’t seem to handle it and goes into a state that appears to hang…it seems to not recover.

Here is some really simple python code to push and pull data from the Spark

#!/usr/bin/python
 
import socket
from time import sleep
 
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('192.168.1.105',8888))
 
message = "0123456789\n"
 
while True:
#        s.send(message)
        s.sendall(message)
        print s.recv(len(message))
#        sleep(1)

The above example will fail after the very first send / receive. You could say it hammers the spark, I’ve tried to put in delays but it will eventually fail.

The code I sent you earlier is a more complicated example that sends over a json message (maybe a few hundred bytes) every 5 seconds, in that case the core will fail after about 10-15 minutes (once it ran for 45 minutes). When the core goes into a failed state it will no longer accept incomming TCP connections and runs the user loop once every 20 seconds. I have a D7 blink when the user loop runs, otherwise everything else looks normal (breathing LED)

UPDATE: Just updated the Spark example to have a blinking LED while running the loop.

UPDATE: 12:26PM - Just ran the test again, the Core goes into Panic SOS. I’m not setup for debugging w/ JTAG so I can’t really dig any deeper. Maybe custom building the firmware will get us more debugging info.

psb777 · December 18, 2014, 11:07am

I’m happy to read this. In my opinion one of the bugs which needs to be addressed is the one I understand cannot be resolved by trickle-down from the Photon development. This is the design bug where evaporating Cloud causes user code to stop. Once said to be a necessary consequence of blocking calls to the TI chip, a fix to this accidental/deliberate design bug is now conceded possible by Spark HQ and is said to be well within the skill set of a reasonably well clued up user. I am not that user.

mtnscott · December 18, 2014, 2:57pm

@psb777 Not sure the Cloud has anything to do with this. This IMO is in the WiFi layer, I can get this to fail in the same way w/o the Cloud active. So is it the evaporating WiFi.

psb777 · December 18, 2014, 3:27pm

You’re right. Perhaps a topic change was needed earlier. But @Dave was responding to us I thought perhaps on the general point of lack of support for Core issues. I agree, this particular issue is not Cloud related.

However, I would say we’re complaining about oops discussing the same thing, fundamentally: That there has seemed to be a family of related Core problems not getting much attention. Your particular problem does at least benefit, potentially, from the trickle down effect of the Photon development. But that the Core will never be able to use the Cloud and still allow user code to run should the Cloud fail, that is an issue which can only be addressed directly. Spark HQ admit a solution is possible and relatively simple. They have said they will not be working on it.

mtnscott · December 18, 2014, 3:46pm

I did not realize they know how to fix it but have not prioritized it. That means all of the inventory I currently have may never be reliable and I will have to either return it or write it off as a loss My applications need to retain state, so rebooting is not an option. Writing state to flash every 10 mins will exceed the flash write cycle in a very short time.

bko · December 18, 2014, 3:55pm

Is there something I missed here? I don’t think anybody knows how to fix this right now. When did “Spark HQ admit a solution is possible and relatively simple”?

@Dave said he would offer up a bug bounty to a member of the community who supplies a pull request that fixes this–that is all I see here.

Dave · December 18, 2014, 3:58pm

Hey Guys,

I’d appreciate if we could keep the discourse a little less accusatory, we’re here helping, for example, I’m a Spark employee, and I’m here trying to help. We’re committed to supporting the Core and the community, and we’re making every effort to make the Photon as compatible as possible.

This isn’t a solved issue yet, if it were, it’d be fixed already The issue I’m seeing, based on the conversation, is that the current flow control when sending out data over the cc3000 is insufficient for long running / high traffic requests. I believe the 1.14 patch provided by TI in the last few weeks makes available ACKs, and could be used to throttle requests to the device to avoid overwhelming it. Essentially it’s an embedded device, so we shouldn’t expect to be able to stream video or anything, but I’d like to have a stable transmit rate we can expect when developing that doesn’t cause the cc3000 to freeze up. I would also add I don’t think this is a necessarily simple fix, well, not simple for me anyway.

There is a big community around the cc3000, and our community is one of the biggest users, so I wouldn’t worry about not getting the support you need.

Thanks,
David

Dave · December 18, 2014, 4:00pm

As a side note, I don’t want to hijack this thread, but

@psb777 I think maybe you missed the release and documentation for the semi-automatic and manual modes that address your concerns here:

http://docs.spark.io/firmware/#advanced-system-modes

Thanks,
David

mtnscott · December 18, 2014, 4:10pm

AMEN!

All I need it a stable platform for a few hundred bytes of data every so often. Currently I can't get 200 bytes of data every 5 seconds to be stable longer than 10-15 mins. I don't think that is unreasonable for the Spark.

guppy · December 18, 2014, 4:25pm

Agreed, I want to love the spark core but for something that is such a basic feature of it, it’s failing terribly.

Dave · December 18, 2014, 7:23pm

I moved 5 posts to a new topic: User code sometimes blocked by cc3000 in manual mode?

mdma · December 19, 2014, 12:10pm

I feel the same way. Bringing some robustness to the TCP stack would be a wonderful thing, the code being opaquely locked inside the cc3000 limits our options.

But try we must! I have some free time over the holiday period to take a look at the TCP stack, in particular the ACK/NAK application handling that the new TI driver patch offers in the hope that we can fix these issues. Fingers crossed!

mtnscott · December 19, 2014, 3:12pm

@mdma Hey - I’m happy to pitch in too. Let me know how I can help. I have a JTAG shield and Segger JLink + and I’m not afraid to use it

OK - well I have it but have not yet figured out how to use it.

mdma · December 19, 2014, 3:43pm

Thanks for your offer of help! There’s a great tutorial by Elco on how to setup JTAG debugging. Familiarize yourself with debugging first and then we can take it from there!

Topic		Replies	Views
Easy repro of hard fault in TCP stack Troubleshooting	2	1293	December 4, 2014
TCPServer / TCPClient still unreliable? Troubleshooting	2	882	June 9, 2015
TCP Server Help Troubleshooting	5	1467	August 14, 2014
TCP or UDP issues consolidation Troubleshooting	43	7274	June 14, 2015
Yet another TCPClient example (but this one works!) Tutorials	7	7020	December 23, 2014

Hard fault caused by the TCPServer example from the Spark docs and a simple Python client

Related topics