Spurious Subscribe failures

I have two cores, both connected to my home Wi-Fi. Core 2 is indoors - it Publishes the temperature / humidity as well as the state of two toggling push buttons. Core 1 is outdoors - it Publishes the temperature / humidity and Subscribes to the state of the push buttons to turn on / off two relays controlling two lights. The relays can also be controlled by a Function.
Everything works fine for days at a time then, spuriously, Core 1 fails to respond to the Published push button states. I have verified that Core 2 is still Publishing the states. Core 1 continues Publishing the temperature / humidity and responds to the Function controlling the relays.
The only way to remedy the problem is to power off both cores and then power themback on. Doing one at a time does not remedy the problem.
Any idea what could be happening and how to deal with it?

Hey @drbrodie ā€” most likely youā€™re seeing this known issue: https://github.com/spark/firmware/issues/278

No timeline on a fix, but Iā€™d love to get it into the next release. Iā€™ll add a comment to the issue linking here.

2 Likes

I implemented the fix recommended by FRAGMA Sep '14 and the problem was solved.

However, a few days ago I flashed my existing publish / subscribe code to a new third core. At first it worked just fine but then after a few hours neither Core2 nor Core3 subscribe to the data published by Core1 (which I have verified to be published). Furthermore, until I commented out the ā€œFRAGMA fixā€ I would periodically get the SOS Code 1.

Any suggestions?

Hi @drbrodie,

Hmm, I think the firmware team is hoping to address this problem this sprint, but in the meantime you could do a ā€˜scorched earthā€™ type workaround of say, adding a ā€˜heartbeatā€™ event once a minute, and if your listeners donā€™t receive it after a few consequetive minutes, have them call System.reset(); This a bit brute force, but should help things fix themselves if messages stop coming through:

unsigned int last_heartbeat = 0;
#define HEARTBEAT_PERIOD_SECONDS 60
#define MAX_MISSED_HEARTBEATS 3

void setup() {
    Serial.begin(115200);
    Spark.subscribe("heartbeat", heartbeat_handler);
    //Spark.subscribe("heartbeat", heartbeat_handler, MY_DEVICES);
    last_heartbeat = millis();
}

void loop() {
    //it might take a few seconds for the real time clock to get synced, lets assume we weren't turned on in the 1970s, and sync the clock...
    //lets essentially keep this on pause until the clock syncs up
    if (last_heartbeat < 60000) {
        last_heartbeat = millis();
    }

    
    double elapsedSeconds = (millis() - last_heartbeat) / 1000.0;
    if (elapsedSeconds > (MAX_MISSED_HEARTBEATS * HEARTBEAT_PERIOD_SECONDS)) {
        Serial.println("Subscribe is dead, long live subscribe!");
        delay(500);
        System.reset();
    }
    else {
        Serial.println("things are okay... but it's been " + String(elapsedSeconds) + " since last heartbeat");
        delay(1000);
    }
    
}


void heartbeat_handler(const char *topic, const char *data) {
    last_heartbeat = millis();
    Serial.println("Heartbeat... at " + Time.timeStr());
}

Dave,
I donā€™t know if that is a solution. I have manually reset the cores numerous times but have never been able to re-establish a subscription, not once, not even for a second, since the initial failure. Does System.reset() do anything different than a manual reset? If not, I think we need to find out why the failure to subscribe of the hardware / firmware which had been operating well for weeks coincided so closely with the addition of the third core.

Dean

Hi @drbrodie,

Iā€™m just seeing your message now, looks like I didnā€™t get tagged, sorry.

Hmm, thatā€™s weird. Would you want to share your code and we can take a look?

Thanks,
David

Dave,

Here is the code which, previous to this problem ran flawlessly for months. There are numerous other things going on here but the key point is that the it does not seem to subscribe to event TEMPHUMPOOL sent by another core. If the code previously added to correct spurious failures, ā€œspark_protocol.send_subscription(ā€œtemphumpoolā€, SubscriptionScope::MY_DEVICES);ā€ is not commented out, the core periodically goes into SOS Code 1 and resets.

// This #include statement was automatically added by the Spark IDE.
#include "LiquidCrystal.h"

// This #include statement was automatically added by the Spark IDE.
#include "dht.h"
#define DHTPIN D4
#define DHTTYPE DHT22
DHT dht(DHTPIN, DHTTYPE);

extern SparkProtocol spark_protocol;

char eventinfo[64];
unsigned int ms;
int publishdelay = 5 * 60 * 1000;

#define ONE_DAY_MILLIS (24 * 60 * 60 * 1000)
unsigned long lastSync = millis();

void displayData(const char *data, const char *poolData);
LiquidCrystal lcd(A5, A4, A3, A2, A1, A0);

int inputPin1 = D0;  //local button
int inputPin2 = D1;  //local button

int sendLedPin1 = D5;//local LED
int sendLedPin2 = D6;//local LED

int sendLedVal1 = 0; //local LED status
int sendLedVal2 = 0; //local LED status
int sendLedVal1Old = 0; //local LED status
int sendLedVal2Old = 0; //local LED status

unsigned long lastPub = millis();
unsigned long lastSub;
unsigned long elapsedSub;
unsigned long lastLcd = millis();


void PublishDHTInfo(){
    float h = dht.readHumidity();
    float t = dht.readTemperature();
    float d = dht.dewPoint(t, h);
    t = (t*1.8) +32;
    d = (d*1.8) +32;
    

    sprintf(eventinfo, "T=%.0f H=%.0f%% DP=%.0f", t, h, d);
    
    Publish(eventinfo);

}


void setup(){
    

    dht.begin();
    
    Spark.subscribe("temphumpool", displayData, MY_DEVICES);

    lcd.begin(20, 4);

    lcd.print("Out:");
    lcd.setCursor(0, 1);
    lcd.print("Waiting for data");
    lcd.setCursor(0, 2);
    lcd.print("In:");
    lcd.setCursor(0, 3);
    lcd.print("Waiting for data");
  
    pinMode(sendLedPin1, OUTPUT);
    pinMode(sendLedPin2, OUTPUT);

    digitalWrite(sendLedPin1, LOW);
    digitalWrite(sendLedPin2, LOW);
    
    Spark.publish("pToggle1", "State", 0, PRIVATE);
    Spark.publish("pToggle2", "State", 0, PRIVATE);
    
    Spark.function("fToggle1", netToggle1);
    Spark.function("fToggle2", netToggle2);
    
    attachInterrupt(inputPin1, L1, RISING);
    attachInterrupt(inputPin2, L2, RISING);
    
  }

void displayData(const char *data, const char *poolData){

   lcd.setCursor(0, 1);
   lcd.print(poolData);
   lastSub= millis();

   }

void Publish(char* szEventInfo){
    Spark.publish("temphumhse", szEventInfo);
}

   
void loop() {
 
  
    
    if (millis() - lastPub > 60000) {
        PublishDHTInfo();
        lastPub = millis();
      }

    if (millis() - lastLcd > 60000) {
    float h = dht.readHumidity();
    float t = dht.readTemperature();
    float d = dht.dewPoint(t, h);
    t = (t*1.8) +32;
    d = (d*1.8) +32;
    sprintf(eventinfo, "T=%.0f H=%.0f%% DP=%.0f ", t, h, d);
    lcd.setCursor(0,3);
    lcd.print("                    ");
    lcd.setCursor(0, 3);
    lcd.print(eventinfo);
    lastLcd = millis();
    }


    lcd.setCursor(0,0);
    lcd.print("Out:");
    
     
    elapsedSub= (millis()-lastSub)/1000;
    if (elapsedSub >100)
        {
            spark_protocol.send_subscription("temphumpool", SubscriptionScope::MY_DEVICES);
        }
        
        
    if (sendLedVal1 != sendLedVal1Old)
        {
            digitalWrite(sendLedPin1, sendLedVal1 ? HIGH : LOW);
	        Spark.publish("pToggle1", sendLedVal1 ? "ON" : "OFF");
	        sendLedVal1Old = sendLedVal1;
        }
        
    if (sendLedVal2 != sendLedVal2Old)
        {
        digitalWrite(sendLedPin2, sendLedVal2 ? HIGH : LOW);
	    Spark.publish("pToggle2", sendLedVal2 ? "ON" : "OFF");
	    sendLedVal2Old = sendLedVal2;
        }
}


int netToggle1(String command)
{
   if(command.substring(3,6) == "tgl")
        {
        sendLedVal1 = !sendLedVal1;
        digitalWrite(sendLedPin1, sendLedVal1 ? HIGH : LOW);
        Spark.publish("pToggle1", sendLedVal1 ? "ON" : "OFF"); 
        }
   return 1;
}

int netToggle2(String command)
{
   if(command.substring(3,6) == "tgl")
        {
        sendLedVal2 = !sendLedVal2;
        digitalWrite(sendLedPin2, sendLedVal2 ? HIGH : LOW);
        Spark.publish("pToggle2", sendLedVal2 ? "ON" : "OFF"); 
        }
   return 1;
}

void L1()
{
	sendLedVal1 = !sendLedVal1;
}
	
void L2()
{	
	sendLedVal2 = !sendLedVal2;
}

I have run a stripped down version of this code which eliminates everything but the portions related to displaying the subscribed data with no luck.
Dean

@Dave can you update us on the status of this fix? Any idea of a time frame for it? It is seriously holding up my projects. Thanks.

1 Like

Hi @Muskie,

I think the firmware team hasnā€™t had a chance to address this yet, so maybe I can add a workaround on the cloud. I think Iā€™ll build and expose a feature so you can ask the cloud to remember your subscriptions until you flash a new app, and set them back up when the device reconnects. This could be a workaround until the firmware can be fixed. I probably wonā€™t get a chance to look at this for at least a few days, but Iā€™ll bump this thread when I do.

Thanks,
David

Ahh, Iā€™ve totally had this problem too! @Dave, curious if there is any update on your cloud-based workaround? Thanks!

Any news on the fix?

Iā€™m subscribing to sublish events via node-red on a Pi so cannot implement the @Dave workaroundā€¦

Heya @achronite,

Hmm, if youā€™re hitting the API for server sent events, are you seeing that youā€™re losing the connection, or that events arenā€™t coming through?

Itā€™ll probably always be the case that network connections will disconnect eventually, so itā€™s good to have your SSE stuff reconnect automatically, we have a few examples for that in SparkJS https://github.com/spark/sparkjs

I hope that helps! :slight_smile:
David

It is a problem on the node-red side, that I suspect is materialising when my spark-core loses internet connection. Redeploying the nodes in node-red forced the subscription to restart. Iā€™ll submit a bug report on the node-red-node-spark code to see if your sparkjs SSE example fix can be incorporated.

Thanks.

1 Like

The bug for this on github seems to be closed - and a fix of sorts is apparently available - but has it made it to the production code?

Hi @daneboomer,

edit: oops! sent too soon. :slight_smile:

Itā€™s normal for clients listening to SSE events to disconnect periodically, I think the example I wrote in Spark-JS to resubscribe after a disconnect is out in the wild. Iā€™m not sure if you mean something else though?

Thanks,
David

1 Like

Thanks for your quick reply! :smile:
I did. I wondered if the particle firmware had been fixed yet to not need any software workarounds? Looks like itā€™s been on the cards for a while.

In the meantime, Iā€™ll try to use your code, Dave, but will this method only work four times? Thanks

1 Like

Hi @Dave, sorry, looks like my last reply wonā€™t have reached you because I didnā€™t reference you using the @ symbol. :slight_smile:

1 Like

Hi @daneboomer,

Ahh, Thanks for the @ :slight_smile:

As far as I know, I think this issue was fixed on the most recent firmware, whatā€™s being used by the Photon. Weā€™ll be making that available to the core as well in the coming weeks.

Thanks!
David

1 Like

Thanks, @Dave. So if I wait a few more weeks then I wonā€™t need to use the heartbeat/reset workaround.

In the meantime, is there an alternative Iā€™m overlooking? Could the Cores ā€œpingā€ each other more directly, bypassing the cloud but obviously still over WiFi? Is there any other more reliable way they can send each other the most basic of messages?

In essence, if a switch connected to Core A is HIGH, I would like an LED on Core B to go HIGH more or less instantly (thatā€™s instant in human not electronics terms). Obviously thereā€™s a bit more to my project than that, but if an alternative can do that, it can do everything else I would ask of it, too.

@Dave, just a thought, there isnā€™t an alpha or a beta of the firmware available I could use at my own risk to potentially get those bug fixes now, is there?

1 Like