Lost memory after publish?

Hi,

I have a question about hidden memory usage - following publish operations

I have a system set for threaded operation and MANUAL mode. The system runs on a 1 second / 10 state loop, with a manual 'Particle.process()' at the end of 10 seconds.

The system also prints stack and free memory every second (as I was experiencing crashes especially if the Wifi router was unplugged from the cloud - separate as yet unanswered post.....).

At startup I see

SP: 0x20007CC8 Free: 0x7BC4 5>Inputs: 00000000

At this point I publish two messages

Alnwick.MBA1,temps:0123,Housetemp:26.18,HPflowtemp:22.62,Outsidetemp:22.68,DHWTemp:22.81 send OK

Alnwick.MBA1,pins:01234567,immersionhtr:0,ValveOn4Hot:2,HotWaterPump:4,HeatPumpPump:6,HeatingPump:
8,ThermistorOn:10,thermostat:12,HWdemand:14 send OK

Next FreeMem shows that 2612 bytes have been allocated to something......My guess is a couple of TCP packets to handle the publish.

SP: 0x20007CC8 Free: 0x7190 6> read IO...run programs
SP: 0x20007CC8 Free: 0x7190 7> run rules
SP: 0x20007CC8 Free: 0x7190 8> rules/diags to UDP >r
processing system...

So we go round the state loop again.....

SP: 0x20007CC8 Free: 0x7190 0>21:37:31
SP: 0x20007CC8 Free: 0x7190 1>t1=26.12,
SP: 0x20007CC8 Free: 0x7190 2>t2=22.62,
SP: 0x20007CC8 Free: 0x7190 3>t3=22.68,
SP: 0x20007CC8 Free: 0x7190 4>t4=22.81,
SP: 0x20007CC8 Free: 0x7190 5>Inputs: 00000000

At this point I publish two messages again - basically the same as before.

Alnwick.MBA1,temps:0123,Housetemp:26.12,HPflowtemp:22.62,Outsidetemp:22.68,DHWTemp:22.81 send OK

Alnwick.MBA1,pins:01234567,immersionhtr:0,ValveOn4Hot:3,HotWaterPump:5,HeatPumpPump:7,HeatingPump:8,ThermistorOn:11,thermostat:12,HWdemand:14 send OK

OK so now we have lost another 1264 bytes.

SP: 0x20007CC8 Free: 0x6CA0 6> read IO...run programs

So I just 'LOST'
SP: 0x20007CC8 Free: 0x6CA0 7> run rules
SP: 0x20007CC8 Free: 0x6CA0 8> rules/diags to UDP >r

Now - we continue to run at this level, so my memory blocks NEVER come back, but remain stable, continuing to publish two reports every 10 seconds.

The system now stays like this until such time as there is a cloud problem (EG pull the ethernet out of the WiFi router). Then we lose more memory which again never comes back - until my system bombs - probably when the heap trashes the stack :-O.

NB I would appreciate a response from the dev team - as I have now posted a number of reports of this type with no response :-((. I am hoping that these have been investigated and are being fixed for the next release :open_mouth: - as I can't see how pulling the cloud is my fault :-O.

The product is now rapidly becoming unusable - as I cannot deploy into a user environment where I have no control over Internet access, there is no official watchdog (another unanswered post), and the inability to reload firmware remotely - doesn't reboot afterwards - again unanswered posts (not just from me).

Maybe its me not checking the cloud for every action which needs wifi (local or cloud), but if its going to fail it really should fail gracefully ????.

Hopefully I may get a response this time - PLEASE ???.

Thanks

Graham

Not from the dev team but we already had some discussions :wink:
If you link your “unanswered” posts it might be easier for others to follow up on them reading this.

About the problem when a publish happens without Wifi/cloud there is an open issue milestoned for 0.4.9
https://github.com/spark/firmware/issues/761
It might not be desirable having to but defensive programming is one of the responsibilities of a programmer. So if there is no cloud/WiFi connection up and running (which can be checked) any attempt to use features relying on their availability might (should) be avoided.

The delayed reboot after flash in multithreaded mode is also known and looked into but do you actually need to reset manually since the device does definetly not reboot after a few seconds - as pointed out in the thread you are refering to.
@mdma from the dev team had posted once on that thread too - but granted, the follow up is still open :wink:

When you say you can’t use AUTOMATIC anymore, what do you mean by that?

About the ever decreasing heap, have you got some code to look at? (Nevermind, if the code is posted in an other thread, if you just link to it)
It’s not uncommon that the free heap shrinks at first as objects start their duty but this should sure stabilize.
But if you e.g. use String a lot, there is the risk of continously loosing heap as each instance at least allocates 16 byte (possibly more with longer contents) and frees it later on (once destroyed), but since there is no heap defragmentation implemented in firmware (at least that I know of), these fragments might not be usable for subsequent objects.

System.freeMemory() returns the highwater mark of the heap in relation to the bottom of the stack - it’s the minimum amount of free memory guaranteed to be available - not necessary the actual amount. Once blocks of memory are freed, the high water mark does not move downwards.

1 Like

Hi,

Thanks for responding ;-)).

Yes - I have seen this referred to elsewhere, but its not really very helpful is it ???. We really need the ability (built-in) to get heap free and even walk the heap so we can see whats happening.

Maybe a better heap manager could be used - FreeRTOS actually ships with a range of variants as I recall…well it does the the PIC series.

The memory does stabilise and runs VERY well - until a network error occurs (like unplugging the router - ie simulating an Internet dropout). Then we simply lose memory like no there’s tomorrow, until eventually the app crashes - we have noted this but not retested recently.

BR

Graham

We are currently using heap3 - malloc/free, since this is what is built into WICED, and also this is what is instantiated by default for the STM32 runtime. I don’t know of any functions for walking the heap, do they exist?

I realise that System.freeMemory() could be better, but it’s built with the tools we have available - it’s limitations are documented. The main reason for its existence is to help detect runaway memory leaks.

Hi,

Again thanks for responding, sorry I didn’t see your response until after responding below :-O.

I will dig out the unanswered posts after this…

So - you mention ‘milestoned for 0.4.9’. Is this documented somewhere that we mere mortals can find ???, and what happened to 0.4.8 ???.

It would be GREAT if there was a list of outstanding issues which are accepted and scheduled to be resolved :-)). I have never seen anything like this and would not know where to look :-O.

As for the reboot issue - I raised this a few months back, and as far as I am concerned - it has never been fixed. A few people have responded that its NOT an issue - but I can assure you that it IS.

My system now never reboots after a reload. It just sits there happily running its old code. Prior investigation (some time back) showed a loos of FREE memory after a reload - suggesting to me a lost incoming packet but as has been pointed out by others, FreeMemory cannot be relied upon to tell us anything that meaningful :-((.

My system is using SYSTEM_THREADING(ENABLED) and SYSTEM_MODE(MANUAL), with a 'Particle.process() being called every 10 seconds. I also check for Particle.connected() and report this to Serial1 (my debug).

NB I am using MANUAL mode as I need better control of when system processes are stealing CPU time :open_mouth: - as I have some bit-banging code to drive 1Wire temperature sensors, and was exhibiting CRC failures due to poor reading. I also turn Interrupts of/on whilst reading/writing bytes and its now more stable but not perfect.

Tried to look at your reference to issue 761 but it didn’t really tell me a lot :-O. Seemed to be following something else ???. Sorry but I was confused by what was being said - was it a comment that users should call connected - OR that this should be built into publish ???. NB My code checks if connected before calling publish anyway).

I am more than willing to send some code, but as it belongs to my client, I would have to do that privately, and its not quite a big system…It also has custom hardware as I store operational details in an external (IIC) FRAM chip.

Thanks again for the response though. The biggest issues are :smile:

  1. Fail to reboot after reload of code.

  2. Problems seen after ‘Internet’ failure making it worryingly unreliable.

NB Don’t get me wrong - its BECAUSE this is such a great product that we are developing stuff for it. What does worry me is the lack of progress since 0.4.7 - even though issues are accepted and scheduled :-O.

I have a system deployed 5,000 miles away from my home base - so remote upload and reboot are ESSENTIAL to me ;-)).

BR

Graham

Hi,

As for heap walker - I guess not - I wrote my own many years back (with my own heap manager), so I will have to review the FreeRTOS code to see what they have…Problem is - how do I include access to the FreeRTOS code ???. Seems to be no doc on doing that …

Thanks again

Graham

@ScruffR linked to the issue on Github because that is where issues and milestones are tracked for public consumption
Here is the issue again

The fact that it is slated for 0.4.9 and you haven't gotten 0.4.8 means that it will not be in the next release but the one after :wink:

Here is the list of milestones (you can click around and see what's in them)

I don't believe it's necessary to call Particle.process() at all if you are using SYSTEM_THREAD(ENABLED). I certainly don't, but my loop is really short too. But again, I don't think that matters.

1 Like

Hi,

Thanks for the reply. Now I am VERY worried about our ability to complete this project :-O.

There seem to be known issues, some slated for fix in 0.4.9 - with no date as yet - BUT 0.4.8 is not scheduled until OCTOBER 2016 ???. So 0.4.9 will obviously be after that date :-O.

We have considerable development investment in this Photon project, modules are being shipped by large vendors and firmware development appears to have all but ceased - with the code still at pre-release stage ???.

PLEASE tell me this is not the case :-O.

As for the process issue - I will give that a go - ie remove the call…

OK - so I removed this call and you are correct - it continues to function. It EVEN rebooted - about a MINUTE after the download had finished. So I tried this again and same ol same ol - a MINUTE after download finished the board decides to RESET.

As I recall - when using this WITHOUT threading - the reboot used to be immediately after the download completes.

NB Even my own client has said that it no longer resets after downloading new app code :-O.

So…IF this is normal behaviour then I will accept it, but with no docs to tell me that - it is extremely worrying that the IDE tells me flashing is complete, then the board just sits there for a further minute (or maybe more) before rebooting :-O.

NB Its finished Magenta flashes and simply returns to ‘breathing cyan’ with my debug code still spitting out on Serial1 during this ‘dead period’. So I have no idea what it might actually be doing :-O.

Its just this sort of ‘effect’ which concerns me about whether the code working correctly - as expected - or not ;-).

Thanks again

BR

Graham

@GrahamS, I haven’t flashed firmware via OTA in while as I tend to develop locally and flash via USB. I do have a Photon running hard (RGBPongClock) with threading enabled, cloud is active and running SparkIntervalTimer with fast (under 200us) interrupts. I’ll compile and flash using CLI to see how the Photon responds during and after OTA.

1 Like

Mine reboots immediately. I wonder if your thread is starving the system thread? One of your previous comment seemed to indicate that you do this intentionally to some degree when WiFi is not needed? Maybe you could back down on that a bit?

That behavior is not normal for me, but I use SYSTEM_THEAD(ENABLED) with SYSTEM_MODE(SEMI_AUTOMATIC). My guess is that it's not normal behavior for MANUAL either. I would see if the above suggestion applies and if so see if it helps.

I just stumbled across this post from @mdma:

So, if 0.5.0 is slated for end of Q1 at the latest, then 0.4.8 and 0.4.9 shouldn't be too far off :wink:

@GrahamS, as @tjp points out, disabling interrupts system wide is most likely causing some havoc with system firmware interrupts. Bit-banging 1-Wire is a pain due to the timing sensitivity and CRC errors are common. I'm not sure how often you get those errors.

In theory, it could be combined with DMA further reducing processor overhead. Maxim wrote an app-note on this.

I also recall, a member used the USART for doing 1-wire communications.

2 Likes

Another "background" info, which was mentioned publicly but obviously only got picked up by sad beings like myself (permanently online here :blush: ;-)), but I can't recall where it was said.
You won't need to wait for 0.4.8 being released before 0.4.9 will go public since 0.4.8 will only be an internal one for the Photon, but is the current RC version for the Electron. So the next public release for the Photon will be 0.4.9 (as for the Core after 0.3.4 followed 0.4.5 or the Photon skipped 0.4.2).

This might be the impression you get, since you don't publicly see the releases, but keeping an eye on the open source repo shows that a lot of things are getting fixed and get incorporated but not released due to pending tests (against regressions or adding new issues) and man-power being bound with the Electron - but effort invested there will also contribute to the Photon's stability.

Also SYSTEM_THREAD(ENABLED) is still a beta feature, so it might not be too wise to rely on it yet without a failsafe in place. It works well enough to let people "play" with it and report issues, but be prepared to see them too.
I'd second @tjp's view that your code must - at least in parts - be contributing to the excessive reboot delay, since - as said in the other thread - 10-15sec delays are not uncommon, but I never managed to get it going beyond that.

If you are so courageous to take on this beta feature, you might also like to give local building a go with the open source latest branch that already sports some of the bugfixes milestoned for the next releases.


@GrahamS; One possible remedy - or at least workaround - for your don't reboot issue might be to add some means of pushing your device into a tight Particle.process() loop - at least not doning any cycle hogging/interrupt-blocking tasks (or even System.enterSafeMode() once it works ;-)), before triggering an OTA update (e.g. via Paricle.function() or Particle.subscribe()).

2 Likes

Hi ALL…

Sorry if I have stirred up a hornets nest here :-O.

I will try to reply to everyone’s points in this one reply …

peekay123: yup I also use USB for some use - but thats not so easy when your Photon is 5,000 miles away ;-)) - ie IT is in Florida - when I am in the UK. Not seen any USB cables that long ;-)). I also have a client installing trial units in his clients homes :-O. So we NEED OTA firmware - which is one of the biggest appeals for the Photon ;-).

tjp: as for starving the System Thread - the docs I have seen say that it uses pre-emptive threading on FreeRTOS ??? - this means that I can’t starve it :-O. My code is in a 1 second loop, with a state machine counting 1-10 and deciding what to do on each second tick. Most of these take very little time and mainly end up in a simple delay(xxx). There are 4 separate slots (one for each 1Wire sensor) as these can take up to 750 mS to do a conversion (during which time my code simply sits in a delay anyway !!. As for interrupts, I ONLY disable for the period of a single 1wire reet, readbyte or writebyte - for just that reason ;-)).

As for version 0.5.0 - thats simply confusing :open_mouth: - the previous doc defintely says October 3rd 2016 for 0.4.8 - so this must be a typo then :-O.

peekay123: see comments above on disabling interrupts. I am used to real-time programming so my interrupts are as tight as they can be ;-)). If I can’t use bitbanged IO - then I might as well add a Maxim IIC 1Wire controller (which I have done on most of my PIC-based boards for many years now…)

ScruffR: the whole .048, .049, .050 issue is a mix-up ;-)). My BIG worry was that date October 2016 for the ‘next’ (ie 048) release :open_mouth: - that definitely sounded a panic alarm !!!. Thanks for allaying those fears!!!
As for a ‘failsafe’ in place - I DID want to enable the hardware watchdog (as that would be the best failsafe !!), but that post never did get answered (months ago now - I’ll try to find it again) :-O.
OOOHHHH local building of the whole source code - I am NOT that brave ;-)). I don’t even have the compiler installed locally and rely on the Atom IDE or the CLI to build on the server :-O. So this would be a whole new ‘can of worms’ :wink:

As for system_threading - I ONLY enabled that as I have no control over when the system does its processing. Maybe I should disable that and run Particle.process() - which I did try and it made no noticeable difference. Good point about its being Beta though ;-)) Maybe I need to review that decision…
It all comes down to these damned 1wire sensors !!! - maybe safest bet is to redo the pcb and add a Maxim controller to ‘talk’ to the sensors - as I know this works with my PIC-based boards…

Thanks again guys for all your support…

BR

Graham

1 Like

OOPS.

Forgot to answer the last point, re local control of OTA timing. Problem is that when the product is released and in the field, my own client will have control of updates etc. so it needs to be SIMPLE ;-).

If I KNOW that the system will reboot after a minute (or other period) I can accept that. Its just that it APPEARS flaky - so I have limited confidence. A minute delay makes no difference at all - SO LONG AS it always works. Florida from the UK is a bit far to press a reset button ;-). Hence why I really wanted to enable the hardware watchdog, then simply refresh this in my code - as I would any other embedded system…Might be a sledgehammer but if it goes wrong it WILL recover…

G

Maybe not so much a typo, but when the milestones were set, they did not have the information they have now on their timelines, they simply have not been updated. They are REALLY busy releasing the Electron as I'm sure you can relate. :wink:

@GrahamS, I feel with you, pressing time lines can easily make you feel uneasy :wink:

About Oct 2016, I think this was just a place holder for the next release once 0.4.7 got release (round about Oct 2015).

About the HW watchdog, I think it once was activated, but caused too many other issues (as I recall with wake after sleep and things like that) - but I’m not sure if I didn’t reply on that thread back then already (although just being a customer and not a Particloyee ;-))

But for the sake of your product, if you don’t need SYSTEM_THREAD(ENABLED) for things only possible with it (as there are a few I can think of), it might be good to go with the conservative “I know what I do when I do it myself” way.

@GrahamS, also, instead of delay(xxx), you may want to run a millis() delay and run Particle.process() while waiting or simply add another state to your FSM that does that.

Playing with global interrupts is always a gamble but in your case shouldn’t be causing the havoc you are observing. I do like the maxim chip idea also.

One thing that (exists but not released) will be coming are system events where you will be able to know when an OTA has been started so you can “park” your software accordingly. If there is one thing that Particle is not short on is ideas. It’s just the people to implement them are in short supply!

3 Likes

I thought, Particle.process() was a noop in 0.4.7. I just told him it wasn’t necessary to call it. lol