SYSTEM_THREAD - full device hang

I’m making a device that controls a gas valve and keeps the gas ignited when it needs to. This works great, but I need the code to 100% certainly be running or something could go horribly wrong. My problem is that the Photon (or P1 on a custom PCB in my case) blocks all user code if Network connection is lost.

I have looked into several ways of solving this. SYSTEM_MODE(MANUAL) will not do the trick since it still blocks user code when reconnecting. The trick is apparently to use SYSTEM_THREAD(ENABLED) like @armor suggests in this thread.

I do however have a problem - if I’m using SYSTEM_MODE(SEMI_AUTOMATIC) with SYSTEM_THREAD(ENABLED) I will experience full on hardware crashes at random intervals. This is completely unreproducible and happens very rarely (3 times in 8 hours today), but when it happens it can (in my case) burn someones house down. If I turn off SYSTEM_THREAD, there is never a crash. I do not use any of the “Synchronous system functions” so that should not be an issue. I have a vague idea that since I use I2C and SPI extensively that there may be something related to that causing the problem, but there is no documentation indicating that.

Is there a guide for debugging hardware hangs? (as in full freeze)
Are there any known bugs related to this? ( Like @SadE54 mentions here )
Do I need to use SINGLE_THREADED_BLOCK or ATOMIC_BLOCK when writing data (to SPI as in the RA8875 based LCD screen or one of the 6 I2C devices on the board)?

You shouldn't rely on your code alone in a situation that is this potentially dangerous. At the very least, you should have an external watchdog; something that will shut down the valve if your code stops running, or there's a complete failure of the device.

1 Like

I agree @Ric. Implementing a Watchdog is something I have to do. The STM32F2 has a separate circuit on the chip for exactly that. It is completely unaffected by the main MCU, but right now I’m trying to solve this in a better manner than restarting the device.

To me, this seems to be either a bug with the RTOS or me doing something I shouldn’t (but I don’t know what since I don’t know it…). Maybe @mdma or one from the Particle firmware team has an idea what’s happening here? It’s not happening unless SYSTEM_THREAD is enabled,

Which version of system firmware are you using?

I would not rely on any single program running in a safety critical application. Either more than one CPU or a hardware-based safety circuit would be better. The second CPU can be very simple with a program like "if I see that that gas valve is on for more than x time without a sign of life from the main CPU, shut the value and reset the main CPU." You can't get that with built-in watch dog.

2 Likes

@mdma I’m using 0.6.2

@bko according to the STM datasheet, the watchdog is implemented with exactly this in mind. It has its own resonator and will work independently of the main CPU as long it has power. I do agree that a second MCU is a good idea though. it can probe more devices than just fire, so yeah. It’s a question of price, but for that purpose I could probably get away with one of those 8bit STM’s that are just 50cent each.

@jenschr, I think the point @bko has in mind is that the Particle firmware deactivates/overrides the STM32F HW watchdog for some reason from the past.
But it would be good to have it back (which some of use requested frequently), but we don’t know if or when.

@ScruffR Oh… I didn’t know that. That’s rather important to know. Thanks! I will implement a second MCU to ensure that we have a Watchdog running.

I am however not any closer to resolving my core problem. No ideas as to what causes problems specific to SYSTEM_THREAD?

I would try to hook up USB serial and dump data, then you will know.

I would try adding the atomic macros around i2c and spin calls and see if it helps.

I would also look at your code with a very critical eye for dynamic memory allocation since it is very easy to write code that has memory leaks or heap fragmentation, particularly when using the Arduino String class.

2 Likes

@jenschr

A couple of points to look for in your code (just reinforcing advice originally provided by @bko and @ScruffR to me).

I use SYSTEM_THREAD(ENABLED); and SYSTEM_MODE(SEMI_AUTOMATIC);

  1. I am very careful to not make multiple Particle.connect() calls. Multiple calls use up free memory and eventually lead to a hang! A simple flag to confirm that this call has been made once seems to work.

  2. Interrupt Service Routines - do you have any? If so follow the advice in the documentation, keep them short and do not make any calls to SPI or I2C within these. This is likely the cause of crashes. Ensure that you only set a flag/state value in the ISR and then deal with the flag or state change in the main loop.

Hope this helps.

4 Likes

Aside from your issue with SYSTEM_THREAD, I want to advise you to consider your design philosophy here. I had to wait to the weekend to post, and @Ric already hinted at it:

The assumption that the code can be 100% is just NOT achievable. The issue with SYSTEM_THREAD is an indication the code is NOT 100%, whether it is your code, the Particle code, or the combination of the two.

The statement "or something could go horribly wrong" leads me to think that you are designing a Full Authority Digital Electronic Control (FADEC). If that is the case, then most likely there are standards in your country based on UL.com for gas control that you would be required to follow.

One suggestion to deal with safety, is to use a gas controller (gas valve plus safety systems) that is already appropriately certified to handle the FADEC part. You can then use your device as a secondary control to signal the demand to the gas controller. This might actually be your case (and I hope so), but I recommend your safety considerations take this into account.

In the case that your device is a secondary control, the risk for your device drops from 100% down to High Reliability, because you want a good product after all.

If you do need everything at 100%, then you probably need the software (yours and Particle's) designed to a standard such as RTCA/DO-178 Design Assurance Level (DAL) "A", or similar. Likewise the hardware standard is RTCA/DO-160 (again, DAL A) and for certain embedded devices RTCA/DO-254 (DAL A) might apply. Even then, that might only get you to 99.999% reliable code. However, I really doubt that Particle could certify to standards like that, at least not without a huge investment.

I hope this contributes to a safe and successful design.

3 Likes

I think you have hit the nail on the head with your reply. The only thing that freighten’s me is the fact that this wasn’t brought up by particle themselves. I’m pretty sure they would have a few clauses regarding prohibited uses and liability.

I think that has to do with the Particle philosophy. My understanding is they want people to use their devices in products, but this stuff is beyond their scope.
(Also, I noticed they are not as active on the forums lately, so I assume they are doing a big sprint to bring something new into the Particle ecosystem.)

I was going to say that Particle does have some information on Certification.
However, when I took a look to confirm, the documentation does not deal with Certification and limitations in general, but focuses only on the wireless (RF) Certification.

https://docs.particle.io/guide/how-to-build-a-product/certification/

So, I think Particle intentionally does not prohibit anything, but expects the end product developer do the certifications that are applicable to the end product. Realistically, Particle has no idea what the end product might be; that is left to the imagination of the end product designer. However, I was surprised that their information on liability appeared to deal mainly with using their web services, and not on the use of the hardware (unless I did not find it).

I agree it would be useful for Particle to have a summary explaining their position on this subject, but I have no idea what their legal advice was on this.

I’m not sure they should have to make any statement regarding use of their hardware or api? It’s based on a STM32F2 microcontroller that is made for any kind of product. The API is based on Cypress WICED SDK that is an industry standard used in all sorts of applications.

There is no need to worry regarding my application. My device will be certified and tested by two certification agencies for all applicable EU directives (electricity & gas) before it enters the market. Anything else would be illegal and it would hurt our business. I completely agree that a FADEC design of such a system isn’t a good approach. Having a separate controller for gas/ignition is a very good idea, but it will go through certification anyway, so it’s better (and cheaper per product) for us to design our own.

Certification can be pretty complex, and often times the whole product is greater than its parts.

Yes, the STM32F2 is made for many kinds of products. And the Cypress WICED SDK is used in all sorts of applications.

I am sure many products have received certifications using these components.

However, I can think of a certain certification that has an impact test that a product must survive, but not necessarily continue to operate during the impact. The impact forces are 52 G’s in the test. A Particle Photon with headers or an Electron (which also has headers) is not likely to pass this impact test as the headers are very likely to break during the test. However, if the product is implemented on a single circuit board, then maybe it could pass the test.

It is just an example that shows everything has to be considered for certification. In general, I recommend knowing the certification requirements before designing a product so that all the requirements can be considered during design. The worst gotcha in certification is the requirement that was forgotten, and then only discovered in the final testing.

All that to say, certification is hard work; it will be difficult; it will take perseverance, but it can be done.

1 Like