Long Range IoT Networks - Chapter 2 | Particle + LoRa Better together

As @Rftop mentioned, in one word, stability.

I have recently developed a non mesh LoRa based system, and agree with the LoRaWan vs. LoRa considerations. However the following is related to a non LoRa battery powered mesh network.

Main issues: Service disruptions from dynamically changing mesh networks, needing to sync and settle due to added/removed/dropping out/moving nodes. A significant mesh battery power overhead. Many interdependent situational and applicational parameters to tune in. Meaningful error logging depends on the network integrity that was lost.

I got the system running stable with a specific set of parameters for all situations within that case … well, as far as I could know and demonstrate. It was challenging to keep the (perceived) non deterministic nature of the beast away from users, and the system was never launched.

It looks like you have a better case. Fewer stationary nodes with big batteries + solar, in rural areas with less disturbance on the frequency, and on reachable sites. I guess it would be ok for the system to regularly “take five” to re-group?

2 Likes

If I where to work with mesh again, I would go far to put in some kind of light weight offline log with timestamps on each node. To remotely get and spot what happened, up to the point where the system decided to re-regroup or had to be reset.

2 Likes

Yes, generally speaking even if a node dropped several messages or even dropped offline for say 1 hour it’s not the end of the world. As long as it’s not all the time and is “hands off” to get things to recover and as you said, it can’t take custom parameterization for each scenario/customer. Customers are non-tech and I’m striving for simplicity. The only config right now is to turn on a DIP switch for a device to operate as a LoRa mesh device vs End Device.

In reading your post, it makes me prioritize some ideas for self recovery such as:

  • reset/power down the end LoRa device if it didn’t hear a response from the Particle LoRa gateway for something like 12 times in a row (i.e 1 hour). Possibly…
  • when it resets it attempts to make a join request try different time slots and/or wait until it hears a LoRa message from a neighbor so join requests can be robust.
  • enter “orphan rescue mode” one 5 minute period every hour or some duration of time without having a customer needing to put it in that mode manually. Maybe enter it if the Particle LoRA gateway didn’t hear from all expected LoRa nodes. Orphan rescue mode would have all mesh nodes stay awake/able to service the mesh the entire 5 minute window and any orphan would publish every 4 minutes if it’s orphaned so the orphan should be rescued. Something like that.

I like the idea of adding logging to each end node device. Just not sure how practical it’ll be. Seems once the device is in the field access to logs is lost. Logging is useful for the initial bench and initial field test but after that it’s not really accessible. I’m thinking rather then preventing every issue/scenario that could happen. Focus on functionality of auto recovery and test the features/scenarios of auto recovery on the bench.

Did you use LoRa radios and the Radiohead library or a different subGhz radios and libraries?

I like this idea !
My Mesh Networks that have 24/7 footprint coverage (Mesh routers always ON) are essentially hands off.
The battery powered sleeping nodes need new batteries every 3-5 years. They start to send warning messages months in advance, and I report the voltage anyway.
When the Gateways were previously Electrons, they would require a manual reset about once a year.
The “Gateways” are now Borons and reboot once a week on a schedule now.
It’s very close to 100% reliability.

This is why I like your Solar Enclosure so much. You should be able to approach a similar reliability with a sleeping Mesh. When a Solar Router’s battery is charged and the sun is shining, you might as well have the radio spend some extra time listening… it’s Free “extra” Airtime for your Mesh Footprint to recover any orphan.

1 Like

Yeah, it seems sleepy mesh is doable… I really like the idea of just stay awake longer if it has the battery power and solar to support it. Now that LoRa listening is 10mA I should be able to support more time in that mode. For simplicity, I think I’ll start with one 5 minute period per hour and see how that goes. The gateway “coordinates the show” so should be easy to enter/leave without making any updates to the nodes. If need be I could try a few variations of duration and how frequent to enter an orphan rescue mode.

I assume this was more preventative measure right? I currently don’t have any type of auto periodic reset. I do use the AB1805 as an external watchdog and deep power down for 30 seconds for lack of being able to connect. I also have the out of memory handler to do a reset but other than that unless something hangs up, it won’t auto reset itself. Maybe this is worth adding as a preventative measure. Or what was the reason you added it, just as a catch all?

I only have 3 PCBs right now but will have another 10 hopefully by the end of the week. I just got permission from the local county park to setup whatever I want so in 7-10 days I’ll setup a network throughout that park for a Long term test. Have 3-4 mesh nodes and the rest end of line nodes and see if everything can stay awake/stay in sync in a near real world scenario.

@Rftop what are you using for batteries in the end nodes? I currently use a single 18650 battery holder on the PCB itself. The PCB also has a USB port and separate JST port for a solar panel to be plugged in and the PMIC can do either. If it has the solar panel then it should keep it topped off except in below 32 degrees for multiple week long durations if it’s an end node without a solar panel then maybe once a year they would pop the cover off and plug in a USB cable for 1-2 hours or if they wanted they could replace/swap out with a fully charged 18650.

My devices are used 8-10 weeks out of the year. Would like to eventually get to the point where a person could enter “off-season” mode and the device could sleep for 6 or even 12 hours at a time just to “stay in sync” and use very little battery and get multiple years from a LoRa end node. This isn’t really required but just a nice feature if it would be “hands off” at the beginning and end of a season. For now, most people power them off and take them inside in the off-season. That’s fine for 5-10 devices but gets a bit out of hand as someone has more devices out there. Could always turn them off and back on while leaving them in their location but then still lots of walk around.

The Gateway’s once a week reset was just a PM, as I’m a complete newbie to Coding.

Most versions of the (Non-Rechargeable) sleeping nodes use (2) L91 Lithium Primary AA batteries in Series for ~3.2V Nominal.

That will be challenging with Li-on chemistry due to self discharge, especially with in-field conditions.
In my personal experience you’d want to decide (now, during design) if the end nodes will ever use their Solar recharging capability. Because if that answer is NO, then there are better long term battery chemistries available for such applications (verses 18650 Li-On). But my vote is add the $10 Solar Panel to all the nodes and never look back. Solar eliminates so many problems when it’s an option.

Yeah, I think you are right… All along one consideration was make them all solar panel charging capable since they all have the PMIC for it anyhow but simply don’t install the solar panel on end nodes for the only fact to reduce the cost of goods sold. That said, now that I’m learning more about self discharge of 18650s, it would likely require re-charging each node every season anyhow with 18650s OR switch to say the L91s or some other option. Seems much simpler to just make them all solar charging vs trying to manage yet another battery chemistry/type and slightly different PCB design just for end nodes vs mesh nodes. The solar panel cost in the quantities I’m considering are actually only $6.75. Maybe at most $10 if I count some consumables of a JST connector, wire, the gasket material and the time on the CNC to make it. Seems like an easy decision… Thanks for the guidance/encouragement of going all solar!

1 Like

It was not LoRa and Radiohead library, but mesh was not that different. The challenge was different running up to 100 sensors on 1/2 AA’s for years. Listening mode had to be used very sparingly.

Mesh is definitely doable, when it can suit the application, and it is super cool to have it running.

I also like this idea, when there is power for it, where the network essentially press a reset button on itself when needed, or as a regular housekeeping. It is an example of keeping the network challenges away from the operator.

1 Like

Yeah… from what I can tell… attempting Mesh without energy harvesting (or using much larger batteries) would be challenging. As @Rftop mentioned using Solar sure does eliminate many of the obstacles but maybe that wasn’t available in your case.

@thrmttnw - As always, I appreciate the info and guidance!

@all,

I have been silent on this topic for the past few weeks as I struggled with an elusive issue that would prevent the gateway from receiving new messages. @jgskarda has been giving me troubleshooting tips all along and today made a suggestion that - it seems - was my problem. I want to mention it here as it could easily be an issue for others.

First, my use case is a little different so, my implementation is different than Jeff’s. In my case, I want to connect nodes that cannot get cellular service in parks to a “gateway” that can get service. While the nodes often have to be placed in a specific spot, this gateway can be positioned anywhere it can get both cellular and LoRA connections to the nodes. I have tested this on a couple trouble spots in our mountainous NC parks and I believe it will work. This is a sleepy mesh like Jeff’s but, in my case, the nodes and gateway all have solar panels so power is not as big a deal and therefore I am not going for sub-second scheduling. My gateways will also service fewer nodes.

My problem with stability arose when I introduced sleep into the operation of the nodes. The insight Jeff had was that while the Boron was sleeping, the LoRA radio was not. There are many LoRA devices out and about today not associated with my project. When the Boron went to sleep, the LoRA radio might still receive a message (not for us but a message nonetheless) and it would raise the interrupt which would not get services as the Boron was asleep. Then, when I went into the LoRA state and started listening, new messages were ignored as the current interrupt was not yet serviced. Clearing the LoRA message queue before listening and putting the LoRA radio to sleep at the end of the LoRA state seem to have fixed the issue.

Here is the code - still under development - in case you are interested:

I also updated my Particle Carrier Board to include a LoRA radio footprint. This way, I can have the carrier boards made at scale and easily add the LoRA module to those carriers where it is needed. The Module fits under the Boron for a very neat appearance. I have a small run of 20 of these at MacroFab now.

IMG_0006

Now that things are stable (knock on wood), I need to start to add some of the features needed to support a field deployment. At least I am not suck anymore.

Comments, suggestions welcome,

Chip

6 Likes

So how many minutes after making this post did you jinx yourself and cause the gateway it to hang up again. :rofl:

Hopefully that the main and only issue! It would at least be explainable in what you were seeing.

By the way… very nice PCB Carrier board update! It’s convenient for Gen 3 Devices how the Hope RF RFM95 module fits right in-between the headers. Looks Nice!

Any initial results from that LoRa antenna vs a wire vs a whip style antenna or maybe too early to tell without some longer term field tests.

3 Likes

@jgskarda ,

Ha! I was thinking that same thing when I wrote it - now 10 hours without missing a single 10 minute period. Oh no, I hope that does not jinx it!

There are a few next steps:

  • Testing the different antennae - I can say the cheap patch antennae are better than the wire whips. I do need to test the fancy Taoglas versions to see if they are worth 5x the price!

versus

I will let you know what I find out. I also need to add some more features before I can deploy these:

  • Rescue mode (node)
  • Reset criteria (Gateway)
  • Ability to accommodate more than one node.
  • Ability to end “LoRA Mode” early if every node checks in
  • Webhooks for Alerts and for Data Reports
  • Reporting for the Gateway not just the node
  • Low power protocols
  • Invalid time protocols

To name a few. My hope is to have something I can deploy by the end of October.

Thanks,

Chip

1 Like

10 hrs + - That’s great to hear!

I’m on a similar trajectory… mostly focused on the Particle LoRa gateway functionality/cleanup that is:

  • Finish cleanup of singleton classes for better code management/organization
  • Add millisecond timestamp when LoRa messages are received and add to backend database to track long term
  • Attempt a separate thread to handle LoRa messages so messages are not missed. Maybe just add something to softDelay().
  • Small 100-250 ms corrections to AB1805 time when AB1805 time does not match cloud time. Repeat every 2 hours. AB1805 is the time master of all nodes so only make small corrections to prevent a disruption to the nodes.
  • Auto rescue mode (Automatically enter rescue mode for 10 minutes every 2 hours)
  • Parameterize the LoRa node and add a msgType of parameter update to allow the Particle Gateway to update parameters on LoRa nodes.

From there it’s on to some backend/web app work to make claiming & assigning LoRa nodes to a specific Particle Gateway automatic instead of it being Wizard of Oz (i.e. me manually doing it).

My PCBA boards are shipping tomorrow so in the next week or so should be able to setup a network of 10+ nodes in the local county park for easy access and more real world endurance testing.

1 Like

So are you still going without any issues @chipmc? Are you considering it resolved?

As for me, it’s been a fun night experimenting with threads. I am not sure if I did this right but tonight’s challenge was breaking out the servicing of LoRa messages into a dedicated thread on the LoRa Particle Gateway. I basically made my existing LoRa.h/.CPP file support threads per this very nice guide:

And then I split out my existing finite state machine (FSM) into a simple LoRa FSM and main FSM. Like this:

I did this for 2 main reasons:

  • Prior to this the main FSM had to be in the proper state to service any LoRa message it received. If a LoRa message arrived late (as the Particle Gateway was taking it’s own sensor readings, connecting, publishing data out, etc. The message was just ignored. With using a separate thread, the LoRa Particle gateway can always service any LoRa message that comes in as long as it is on and not sleeping.
  • It simplifies the main FSM and makes things a bit cleaner/easier to manage.

This is now fully functional and seems to be working how I want it to. However, I am new to threads so I had a question or two for anyone more familiar with threading then I am:
My worker thread (Lora.cpp) is fairly isolated from everything else since it handles all things LoRa. However it does still have some interaction.

  • Read the time from the AB1805 RTC. The main thread also calls methods to this class to read/set the time.
  • Read common configuration variables from memory. The main thread also reads/writes to these variables in memory.
  • Blink the status LED indicating we received a message. The main thread doesn’t do anything besides declare it in setup.
  • Call a method from a different singleton class to append/add a JSON object to an existing JSON (used to send data from LoRa nodes to the cloud) but we accumulate data via a single JSON object first to minimize data operations. Main thread can also add it’s own JSON object and then closes it before publishing.

Most variables and methods are accessed via a singleton class with a few exceptions. Do I just need to put the lock() and unlock() in just my worker thread or is this also required in the other instances in the main thread? I.e. I am doing something like this in the LoRa thread anytime there is overlap between the two threads as shown here: Is this right or is more needed in the main thread as well? Is multi threading fraught with danger or appropriate for something like this?

        lock();
        ab1805.getRtcAsTime(RTC_Time, RTC_Hundrths);
        unlock();
         
        (Some other code in between here)

         lock();
         // Add data to JSON object and publish if needed
         particle_fn::instance().addJsonMsg(jw.getBuffer(), jw.getOffset());
                    
         //Turn on the RGB light indicating we recieved a message
         LED.on();
         unlock();
1 Like

@jgskarda ,

Yes, I am running quite reliably now so thank you again for the tip on cache and sleep.

I understand your point on not wanting to miss a message while connecting but I wonder if there is an easier way to do this. I seem to remember some cautionary tales about using threads though your approach does seem well thought out.

Thanks, Chip

1 Like

@chipmc Yeah, I think what I am attempting to accomplish is necessary functionality but as you said, maybe there is an easier or at least more robust way to accomplish it without having the complexity/concerns of threads.

From the threading explainer:

Used incorrectly they can introduce new and novel issues into your code that are often more difficult to debug than single-threaded code. Exercise caution and consider alternate designs such as finite state machines when your requirements allow.

There are even harsher wording here.

My vague concern is I’m new to threads so honestly I don’t know how much of what I’m doing is incorrect vs correct. I just know it works (at least initially. :slight_smile: ).

What I am considering trying now is keeping the LoRa Finite state machine (FSM) separate as shown but instead of executing it from a thread function, simply locate the FSM within a LoRa.Process() method in the LoRa singleton class. Then within the main thread I can just execute LoRa.Process() each call to loop(). Likewise I could add it to softDelay() just like we do for Particle.Process(). I.e. kind of like this:

/*************************************************************************************/
//                            LOOP
/*************************************************************************************/
void loop()
{
  ab1805.loop();
  Particle.process();
  LoRa.process(); 
  PublishQueuePosix::instance().loop();

  //Run the main state machine
  stateMachine();
}

As well as anywhere I need to use a delay() I use softDelay() and I can include LoRa.Process() there as welll.

/*******************************************************************************
 * Function Name  : softDelay()
 *******************************************************************************/
inline void softDelay(uint32_t t) {
  uint32_t ms = millis();
  while(millis() - ms < t){
    Particle.process(); 
    LoRa.process();
    PublishQueuePosix::instance().loop();
    ab1805.loop();
  }   
}

If I did this, the LoRa FSM should be executed all the time in order to service messages all the time. I think this would accomplish the desired functionality without introducing a dedicated thread. However, my understanding LoRa.Process() can block the main thread now as several functions within RadioHead are blocking. I.e. manager.SendToWait() as an example. I believe this blocks until it receives an ACK back or a timeout. Most times will only block for 150 ms or so but at times can block the main thread for 3-4 seconds based on needing to perform a route discovery, the number of re-tries and timeouts due to retries. I assume this would still be more robust then multi-threading though right?

I’d be curious to hear what other people think of this method vs using a separate thread. Which way is better/more robust?

Running the LoRA code in a thread is probably fine. The main thing to be careful of is how you enqueue stuff to be processed. Ideally, if you can create a queue that holds every operation handled by the LoRA library and queues it, and dequeues from the LoRA thread and also calls LoRa.process() it will be both safe and not affect the main thread of your application.

The danger is when you call into the LoRA library from multiple threads, and the code is not designed to be thread safe. That’s when random hard-to-debug disasters occur.

1 Like

Thank you very much for the guidance @rickkas7! If I’m understanding you right, as long as I only call RF95 and other RadioHead methods from the dedicated LoRa thread then it should be fine. To accommodate that, I changed up the FSM slightly. I now have this running in the LoRa thread:

This ensured all calls into the Radio Head library are all initiated from the dedicate LoRa Thread. From the main thread I just set a slpRqst variable = true. This transitions the LoRa FSM running in the separate thread to the sleep prep state where it can call the rf95.sleep().

I.e. from the main thread, I’ll set the variable to true or false if I want to sleep or not:
lora::instance().slpRqst = true;

and then from the LoRa threadFunction() I’ll evaluate the same Boolean variable to advance the FSM.

        case st_LoraMsgRx:
        {     
            // If the main thread is telling us to sleep, then enter the sleep prep state where we turn the radio off using RF95.sleep()
            if (slpRqst){
                st_newLoRaState = st_LoRaSlpPrep;
            }

Is this an acceptable way to handshake between the main thread and a separate worker thread?

What about adding the data that came from a LoRa message to a JSON Array? Is there any concern in writing to and reading from a public JsonWriterStatic object concurrently from two threads? Does that happen so fast that threads won’t collide? Do I need to protect that using lock() and unlock() anytime I use the JsonWriterStatic object from either thread?

For example: When a LoRa message is received, I first construct a JSON object of that LoRa message payload. I then call a method from a different singleton class to add it to an aggregated JSON Array.

This is located in the LoRa thread to first construct the JSON object:

 JsonWriterStatic<256> jw;
  {
  JsonWriterAutoObject obj(&jw);
    jw.insertKeyValue(String(cfg.dataPtID.DevReadUTC), Time.now());
    jw.insertKeyValue(String(cfg.dataPtID.LoraBase64), encoded);
    jw.insertKeyValue(String(cfg.dataPtID.rssiRx), rf95.lastRssi());
    jw.insertKeyValue(String(cfg.dataPtID.snrRx), rf95.lastSNR());
  }

particle_fn::instance().addJsonMsg(jw.getBuffer(), jw.getOffset());

Which then calls this method from a singleton class that has the JSONArray declared as public member of the class:

    //Define Buffer for JSON Array and Initialize it
    JsonWriterStatic<1024> JSONArray;

And this is my addJsonMsg() method within that class:

 /*******************************************************************************
 * Function Name  : addJsonMsg()
 *******************************************************************************/
void particle_fn::addJsonMsg(const char *json, uint16_t json_len){
    //If adding the new JSON object extends the array beyond the max bytes of a Publish event, close the array, publish it, clear it and start another array 
    if ((cfg.maxEventDataSize - int(JSONArray.getOffset()) - int(json_len)) < 8 ){
        JSONArray.finishObjectOrArray();
        PublishQueuePosix::instance().publish("DataArray", JSONArray.getBuffer(), PRIVATE | WITH_ACK);
        JSONArray.init();
        JSONArray.startArray();
    }

    //Add new data to the JSON Array
    JSONArray.insertCheckSeparator();
    JSONArray.insertJson(json);
}

Do I need to add lock() and unlock() around this somewhere or any thread protection needed for this example? The main thread can also call addJsonMsg(). Any concerns on the side of publishing data from LoRa messages or is this fairly safe as well?

At the moment I’m waking up, listening to LoRa messages in the separate thread, publishing data and falling back to sleep successfully with no obvious issues. Just trying to avoid any unusual/difficult scenarios by doing the due diligence now on the best way to structure this to be robust.

That seems good.

Whether you need to lock or not depends on how the queue is implemented between the threads. If you use an thread safe queue you don’t need a lock.

However, it you are using a hand-build one or std::deque, you need to mutex lock around it so the queue does not get corrupted when accessing it from two different threads.

Thanks for the clarification @rickkas7. I attempted to use a thread safe queue per the link but wasn’t successful on it being robust. Can you provide a little guidance on how this would look if I’m trying to pass a JSON object via the queue? I need to handshake both ways. 1) From the LoRa thread to the normal loop() to send out sensor data, as well a from the normal loop back into the LoRa thread to pass in LoRa node configuration data (updated via Particle Function). Without trying to use the queue or Mutex lock/unlock, it seems to work quite well. I think I’m just lucky in the sequencing (i.e. the main loop doesn’t update anything when the LoRa thread uses it and vice versa based on timing. I’d still prefer to have something to guard against it though to prevent weird things happening.

So what’s the best way to pass the JSON in/out of the thread using a queue. Or what data type would you recommend for the queue? I tried a few variations of this as well as something similar with char msgbuf[256]. Anytime I tried to use the queue, the published message seemed to be partially corrupted. For now, I just went back to still using the thread but not using the queue or mutex locks. Any additional guidance?

os_queue_t queue;

//Located in Setup()
 os_queue_create(&queue, sizeof(JsonWriterStatic<256>), 5, 0);

In the Thread:

JsonWriterStatic<256> jw;
{
JsonWriterAutoObject obj(&jw);
jw.insertKeyValue(String(cfg.dataPtID.DevReadUTC), Time.now());
jw.insertKeyValue(String(cfg.dataPtID.LoraBase64), encoded);
jw.insertKeyValue(String(cfg.dataPtID.rssiRx), rf95.lastRssi());
jw.insertKeyValue(String(cfg.dataPtID.snrRx), rf95.lastSNR());
jw.insertKeyObject("metaData");
    jw.insertKeyValue(String("time"), Rx_Time_Comp);
    jw.insertKeyValue(String("hund"), Rx_Hundrths_Comp);
    jw.insertKeyValue(String("hops"), hops);
    jw.insertKeyValue(String("reTX"), Rx_Retransmit_cnt);
    jw.insertKeyValue(String("tmOut"), RX_TotalTXDelay_hundreths*10);
jw.finishObjectOrArray();
}

os_queue_put(queue, (void *)&jw, 0, 0)

And in Loop:

  JsonWriterStatic<256> jw_take;
  if (os_queue_take(queue, &jw_take, 0, 0) == 0) {
          // Add data to JSON object and publish as needed
          prtF.addJsonMsg(jw_take.getBuffer(), jw_take.getOffset());
   }