Lost keys server address - corrupted flash?

Hello,

we have around 500 photons deployed in our products, vast majority working flawlessly, many of them for 3+ years with multiple OTA updates. Occasionally there is a device where communication with our cloud keeps failing, in this case we need to replace the product. I am currently searching for ways how to prevent this from hapenning…

Latest case - customer reported Photon flashing cyan with one red flash, there was no communication with our cloud from that particular Photon (I can see unsuccessful connections in log). We replaced the product with new one, got that faulty back. The faulty Photon tries to contact Particle’s cloud instead of ours - it lost the address and defaults to Particle’s.

The new product arrived to customer and it shows exactly same behaviour, customer provided a DNS log from router where the second Photon is trying to reach the Particle’s servers again. We shipped third product, waiting for second faulty to get back for analysis.

The only mutual element here on customer’s side is power adaptor (he even switched several wifi networks). I guess this might be a problem…
@Dave suggested that “When you undervolt the photon you can run the risk of corrupting the flash” - does that apply also to the place in memory where keys server address is saved?

We already experienced loss of server keys in the past - the Photon was trying to connect to cloud (I saw it in the log). I will soon test 0.7.0, it should recover the public key automatically. However, I dont know how to prevent loss of cloud server address…

Maybe @justicefreed_amper experienced something similar?

we are using 0.6.3 for all Photons, the cloud runs the Brewskey fork.

Output from clouddebug by @rickkas7

0000304420 WARN: Read Server Address = type:255,defaulting to device.spark.io
0000304437 INFO: Resolved host device.spark.io to 34.237.141.248
0000304567 INFO: connected to cloud 34.237.141.248:5683
0000304567 INFO: Cloud socket connected
0000304567 INFO: Starting handshake: presense_announce=1
0000304567 INFO: Started: Receive nonce
0000304693 INFO: Encrypting handshake nonce
0000304729 INFO: Sending encrypted nonce
0000304730 INFO: Receive key
0000304866 ERROR: Unable to receive key -19
0000304866 WARN: Cloud handshake failed, code=-19
0000305117 INFO: Cloud: disconnecting
0000305117 INFO: Cloud: disconnected
0000336866 INFO: Cloud: connecting
0000336866 WARN: Read Server Address = type:255,defaulting to device.spark.io
0000336932 INFO: Resolved host device.spark.io to 54.159.170.67
0000337057 INFO: connected to cloud 54.159.170.67:5683
0000337057 INFO: Cloud socket connected
0000337057 INFO: Starting handshake: presense_announce=1
0000337058 INFO: Started: Receive nonce
0000337250 INFO: Encrypting handshake nonce
0000337287 INFO: Sending encrypted nonce
0000337287 INFO: Receive key
0000337414 ERROR: Unable to receive key -19
0000337414 WARN: Cloud handshake failed, code=-19
0000337665 INFO: Cloud: disconnecting
0000337665 INFO: Cloud: disconnected

Thanks for your replies…

However, I dont know how to prevent loss of cloud server address…

Have you ever actually verified that you have lost the cloud server address? The problem with the keys is largely that they get regenerated to a new (and unregistered/unauthorized) value. I don’t believe the same is true for the server address. When you were looking at the DNS log, did you see the correct server location request?

Edit #1: You can very easily back this up or restore it from firmware. You can even do this without having to set this in the system firmware. It’s the same process as with the keys.

We already experienced loss of server keys in the past

So the solution to this problem depends largely on what other services you are using and how you are using device resources. There are two high level paths you can take:

  1. Maintain a local backup of both the device key and server key (or just device key v0.7.0+). You can do this by:
  • Using retained memory (if you are powering it properly)
  • Using EEPROM (not the best since it is the same storage medium as it already is on, so doesn’t cover all edge cases.
  • Use external memory (SD Card, if one is available on your board) best solution if available
  • Use program flash for server key (if below v0.7.0)
  1. Maintain a local backup of server key (if below v0.7.0) and then upload your new device key if you detect an issue connection. (This is what I do).

This requires you to have some kind of additional connection besides the particle cloud. In my case, I have an MQTT-TLS connection to AWS-IOT. You just need to be able to handle up to around 1800 Byte packets outbound iirc.

I count the number of failed cloud connection attempts in a row, and if it exceeds my threshold (2 in my use case), I read the current key in memory and send it to AWS-IOT.

In AWS-IOT, I have recreated the particle CLI in a node-js lambda function, with some changes to make it work, but adjusted to accept a payload containing the new key, format it, generate the equivalent public key, and then register that key with the device in the Particle Cloud.

You could use the API directly with OpenSSL, but I found it faster just to adapt the CLI code since it already had all the formatting and error handling.

“When you undervolt the photon you can run the risk of corrupting the flash” - does that apply also to the place in memory where keys server address is saved?

My experience has been that my devices which are connected to outlets that are switched on and off with the thing they are measuring are the ones pretty much exclusively that have flash corruption issues. For me at least, it seems to simply be shutting off power suddenly when the device is performing some critical flash operation. I am on fw v0.6.4, so theoretically preventative measures were put in later somewhat but I don’t want to take the RAM usage hit for v0.7.0+.

If you want me to show you how to do any of the above, just reply with a little more detail on your constraints and I can help you adapt the work @rickkas7 did for his device key backup library. (I adapted it for v0.6.4, which added the server key backup, and I don’t use SD card but upload it via MQTT instead).

Well, ideally, ideally, you would have a supercap or a backup battery to not have power fail in an unwanted fashion but post-facto hardware design is always so easy. Preventing the use of poor adapters may be good. However, if “pulling out the plug” vs an orderly shutdown (assuming there is one implemented) is a reality then a better adapter won’t help. This is where @justicefreed_amper’s approach comes in.

1 Like

We use high quality power adaptors, rated for 2-4x higher current than the product can actually use (product can go to 0.5A - 1A, we use 2A adaptor).

Good point - what is the correct/safe way to shutdown the Photon? Now the 12V power adaptor is connected directly to voltage regulator 7805 with smoothing/filtering caps, overvoltage and overcurrent protection. The Photon itself has another set of caps just on Vin pin to supply enough power in peaks (starting DHCP etc…).

We might add a hardware detection of supply voltage, when it is below certain level (= power adaptor was disconnected), there might be still enough energy in caps to switch the photon to deep sleep mode - would that help to prevent corruption of flash?

Thanks!

Hi @justicefreed_amper , thank you so much for taking your time when writing that reply! Its really highly appreciated!

Lost server address:
I can confirm the Photon was trying to connect back to Particle’s servers. The router was resolving DNS requests to AWS servers - several addresses. api.particle.io (the default address) is on DDNS and the requests look like that - different addresses in AWS…

When you look to the log I posted (from @rickkas7 clouddebug binary), there is this line:

0000336866 WARN: Read Server Address = type:255,defaulting to device.spark.io

It seems that it read the server address, but it is corrupted, so it uses the default one. I need to run this on correct Photon to see correct output…

I saw few photons regenerating the keys, which didnt match to what our cloud stores - I can distinguish this situation from lost server address. The Photon with different keys appears in the log, I can see the cloud throwing errors about wrong public key of the device…
We run own cloud server (Brewskey fork), this is the benefit of actually seeing the output from it.

I will try to save the information into the program flash, we dont have additional memory or SD card which I could use. I need to keep compatibility with deployed products.

I see how I can change the keys (SYSTEM_CONFIG_DEVICE_KEY), but I dont see configuration for server address - am I missing something?

I really like your solution with AWS-IOT, being able to remotely access the device using CLI is a dream…

If you are using a custom server you’ll need to restore the server public key and server address (as set by particle keys server) manually, as the automatic restore in 0.7.0 can only restore the official particle one. They’re both stored in the server public key block, but the address/DNS hostname is stored at offset 384.

Here’s the code from the CLI for how to generate the block of data:

As the server public key and address are the same for all of your devices, as long as your code does not fill all of the user flash space, you can store it as an array of bytes in your code and do not need a separate flash, EEPROM, etc. to store the backup server public key and address. It should only take 384 bytes for the server public key and a little more for the address/DNS host name.

1 Like

Yeah, unfortunately you will have to travel outside of the formal documentation because System.set only works for a couple things. Also if you ever made an Electron port you wouldn’t be able to use System.set either. You will need to write to those blocks of memory fairly directly, using a combo of what rickkas7 posted above with his example code for doing backups here.

If you need help adapting to v0.6.4 from his code, let me know what you are trying to do and I can provide a sample implementation. Ultimately everything comes from using the system firmware file dct.h