Issues with firmware 0.7.0


#21

@ScruffR, I jumped from 0.6.0 to 0.7.0 and the behaviour was not the same as I see now. Before only devices with poor connection were disconnected from Cloud.

I can’t fully discard firmware issues. I started testing a few devices using bare minimum functions/variables to see what happens. Problem is that testing is very long since it takes time to surfaces.

@seulater, be aware that even if the code works fine with 0.6.X that doesn’t mean there are no code issues. New system firmware version does interpret your same code in a different way and it could surface an old code problem that was not affecting old system version and vice-versa.


#22

We do see more reports of connection issues with 0.7.0, but in order to get to the bottom of these instances we need to make sure none of the other issues we already know to lead to connection issues are at play - so I’m not denying that it might be a 0.7.0 issue, but we need to make sure we are not chasing a red herring.

In that case we may need to establish which of the intermediate versions introduced the issue.
Also new Device OS versions usually come with slightly differences in available RAM for the user app. So the possible impact of that needs to be considered too.


#23

I’m thinking about releasing a new update to all my online devices to keep a better track of their RAM usage, but in general, free memory is steady around 40k even when the problem surfaces and that’s more than enough.


#24

One thing System.freeMemory() doesn’t give away tho’ is how the 40k are distributed. To tick off heap fragmentation as possible cause, we’d need to know how big the biggest available consecutive chunks of memory are and whether there are enough of them.

However, with 40k the likelyhood of heap fragmentation being the root cause is slim.


#25

I agree 100%. In my case though my app is so simple its hard to believe (though not impossible) my code has an issue. in my loop all I am doing is checking the time, and when it hits 6AM it turns a port pin HIGH, and then at 9PM it turns it back low. The only other thing I am doing in the loop is if(!WiFi.ready()). Which will flash the plant light one a second until its true.


#26

While I don’t have a large sampling of field devices, I do have some Photons that are running 0.6.3 firmware. They appear (although I don’t have an accurate means to track this) to have exceptional cloud connectivity.

I also have a pair of P1’s at the same facility running 0.7.0 firmware, and there are what I’d best describe as periods of cloud connectivity “blackouts” that always seem to resolve themselves. I record a last reset time in a cloud var instance so I can rule out power outages that might eventually be resetting these P1’s.

I’m also running an external watchdog circuit with these P1’s (reset in my foreground loop) and the ApplicationWatchdog() FreeRTOS thread. Perhaps these eventually kick-in, but without any persisted cloud events I can’t go back and see what actually happened (I need to trap and persist these events down the road).

What I can definitively say is that running SYSTEM_THREAD(ENABLED) in my P1/0.7.0 tandems is an almost sure-fire recipe for memory depletion, connectivity woes and SOS codes, whereas foreground particle processing nets much more stable cloud connectivity and operation.

I’ll also note that I have rock-solid System.freeMemory() numbers (verified through yet another cloud var instance) with foreground particle processing whereas I can observe the systematic depletion of memory with identical project code running with SYSTEM_THREAD(ENABLED)

YMMV, but that’s what I’ve observed with my small sample of 14 units.

(p.s. SoftAP mode fails in 0.8.0-rc.4 as we have provisional support for our particle devices)


#27

Hmmmm… SYSTEM_THREAD(ENABLED). That is the one thing that I have not disabled. In fact when creating a new app its the first thing I do / enable.


#28

@jimini, @rwb, @bacichetti, @seulater, @italex, if any of you could provide some “simple” test code that exhibits this connectivity problem sort of reliably I’ll forward that to Particle engineers.
I’m currently discussing this with them and they want to dedicate some extra time to getting to the bottom of it, but need some test scenarios to work with.

The more samples the better to see any common cause.


#29

Impressive!

If your able to share I would love to hear what the application is. Always interesting to see how this tech is being deployed.


#30

I’m running 0.7.0 on LED Matrix Panel with a Modified version of Peekay RGB Clock Panel code which proved to be stable before 0.7.0 was released.

That code has been running for at least a few months now without any lockups. I can’t tell if the cloud connectivity is coming and going out often since I do not watch the clock all the time, but I can say it never requires a manual reset.

My wifi source is my phone, so the Wifi is coming and going often, and on this 0.7.0 firmware, it gracefully reconnects when the Wifi is present again. The constant coming and going of the Wifi does not seem to make the automatic connection process any less reliable.

I can confirm that this code is running in the standard Automatic mode since no mode is defined at the top of the code.

Maybe some of that is helpful to troubleshooting :thinking:


#31

@ScruffR, I’m putting together an example code with samples of most features that could affect background processes. I’ll post it here when I see any sign of the problem with my test devices.
Like RWB, I’m not using system thread enable and we run the system on automatic mode.

@RWB, here is where we use it: Smart Controller


#32

@ScruffR, @jimini, @rwb, @bacichetti, @seulater, @italex

New day, I have good news (good??)!
I flashed 6 devices yesterday with a test firmware and 2 of them are presenting the issue today. It really takes several hours before the behaviour surfaces.

I can see memory degradation. The healthy devices keep a steady 48k of free memory.
The 2 devices losing Cloud connection keep oscillating the total free memory. They lose from 1-4k everytime they go offline/online to a point where the memory rises back to 48k and the cycle starts again.
I can see the RGB blinking cyan during the Cloud reconnections and once every several cycles the RGB will also blink red.

here is the test code I’m using, it’s just a snippet of the code from my production devices.

const int NUM_SCHEDULES = 25;
const int SCHEDULE_LENGTH = 27;

// FUNCTION DECLARATIONS ********************************
int transmit(String command);
int setSchedule(String command);
int setVariable(String command);
int callTest(String command);

// Variables
char scheduleString[NUM_SCHEDULES * SCHEDULE_LENGTH];
char strcountTcpUsage[20];
char statusVariables[400];
char networkVariables[200];
char lastSchedule[100];     //schedule + time
char lastTransmit[25];
char locationVar[200];
char lastLostSubscription[26];

String  myLocalIp;
String  myPublicIP;
int     rssi;
String  ssid;
String  serialName;

unsigned long previousMillis;
unsigned long cycleTimer_ms;

// callback handler for Particle subscribe public ip and device name,
void handler(const char *topic, const char *data);

void setup() {
    Particle.function("transmit", transmit);
    Particle.function("setschedule", setSchedule);
    Particle.function("setvariable", setVariable);
    Particle.function("calltest", callTest);

    // Exposed Variables - max 20 (max name length is 12 char)
    Particle.variable("status", statusVariables, STRING);
    Particle.variable("network", networkVariables, STRING);
    Particle.variable("schedulelist", scheduleString, STRING);
    Particle.variable("lastschedule", lastSchedule, STRING);
    Particle.variable("lastLostSubs", lastLostSubscription, STRING);
    Particle.variable("lasttransmit", lastTransmit, STRING);
    Particle.variable("location", locationVar, STRING);
    Particle.variable("countTcpCall", strcountTcpUsage , STRING);
    
    // Subscriptions
    Particle.subscribe("particle/", handler);				// to handle Spartk web responses, like the Public IP
    
    updateWiFiInfo();
}

void loop() {
    cycleTimer_ms = millis() - previousMillis;
    
    if(millis() - previousMillis > 250){
        Particle.publish("debug/loop_delay", String(cycleTimer_ms), 60, PRIVATE );
        Particle.publish("debug/free_memory", String(System.freeMemory()), 60, PRIVATE );
    }
    
    previousMillis = millis();
}


int transmit(String command){
    Particle.publish("debug/transmit", command, 60, PRIVATE );
}

int setSchedule(String command){
}

int setVariable(String command){
    updateWiFiInfo();
}

int callTest(String command){
    
    if(command == "reset"){
        System.reset();
    }
    else if(command == "rssi"){
        Particle.publish("debug/rssi", String(WiFi.RSSI()), 60, PRIVATE );
    }
    else if(command == "memory"){
        Particle.publish("debug/free_memory", String(System.freeMemory()), 60, PRIVATE );
    }
}

void updateWiFiInfo(){
   rssi = WiFi.RSSI();
   ssid = WiFi.SSID();
   myLocalIp = String(WiFi.localIP());
   if (Particle.connected()){
       Particle.publish("particle/device/ip");
       Particle.publish("particle/device/name");
   }
}

void handler(const char *topic, const char *data) {

    //Spark.publish("received " + String(topic) + ": " + String(data));
    String header = String(topic);
    header.trim();
    String message = String(data);

    if(header.endsWith("ip")){
        myPublicIP = String(data);
    }
    else if(header.endsWith("name")){
        serialName = String(data);
    }
}

#33

Thanks @bacichetti for that.
The behaviour you are seeing there may well be in connection with what I said earlier

The “extensive” use of String objects is know to have this side effect. With previous versions it might have been simmering just below the surface.
It would be good - for both, your project and finding the 0.7.0 issue - to reduce the use of String to the bare minimum and only use it where you can’t help it (e.g. Particle.function() callbacks).

BTW, the prefered syntax for Particle.variable() would be

Particle.variable("status", statusVariables);

The old three parameter syntax may become deprecated some time from now.

Also all your Particle.function() callbacks should return an integer.

Additionally to prevent inadvertend rate limiting of the event publishes, you should add a delay(x*1000) where x is the number of consecutive Particle.publish() calls.
Your loop() with two consecutive publishes every 250ms is definetly running into the rate limit.


#34

@ScruffR, thanks for pointing out the problem with the Strings and Variable declaration!

My original program has all the returns for the functions(), forgot to add it to my test example. I’m going to make all the changes and re-flash my devices and wait several hours again to see if there is any change.

My test program does not publish unless the device loses Cloud and came back online so that can be considered as a problem. In any case, I’ll merge the messages in one Publish. The Publish limit is (or at least was) 4 Publishes every 4 seconds.

We need the simplest and shortest test example to isolate the problem origin.

However, I need to say that I don’t believe the use of String or the missing return could cause such problems.

Thanks again!


#35

I agree with you on this

but as I said

it might also only be the combination of factors that taken by themselves wouldn’t cause the issue.

It’s like having a patient with inconclusive symptoms and going through iterations of differetial diagnoses.


#36

I agree 100%! I already started making the changes.

I’m just meant that if using String ONCE on a program is causing the disconnection issue, that need to be fixed by Particle or it needs to be deprecated.

If anyone has more change suggestions or a full test firmware, please sent to me so I can have the test running still this morning. I’ll try to get at least another 6 devices running so I can test 2 versions of the program at the same time.


#37

I just read your test program above and it seems to create two small strings every 250ms. That is enough to cause to heap fragmentation in my experience. The trick is to reuse storage and not allocate over and over again. String objects can do that, but the code is non-intuitive generally involving clearing the string and appending to it. Char arrays are easier to understand and get good results. You can get good memory management with String objects but the burden is on you the programmer to code for it.


#38

@bko, take a closer look. There is no delay on the loop and previousMillis is refreshed every cycle.

It only publishes IF the program spends more than 250ms between loop cycles. In this example, that only happens when the device loses Cloud connection.

Thanks for checking the code! Please, let me know if you have suggestion for the test.


#39

Sorry for not reading more closely. My suggestions and advice are the same as @ScruffR. Start by removing any String objects you reasonably can and use a non-dynamic memory allocation strategy.


#40

I third the suggestion of utilizing non-dynamic memory allocation and ditching the string class. A bit more grunt and clunk, but you’ll eliminate frag potential…