Long Range IoT Networks - Chapter 2 | Particle + LoRa Better together

As a quick update… I received some more PCBs in yesterday and was able to add 10 more nodes to my bench test (13 nodes total for now). Here is what the sequencing now looks like. Notice, the Boron (bottom right) wakes up first and then starts listening to the nodes who each wake up and transmit data in sequence. Currently configured for 1.5 seconds apart to send it’s data and listen for a response from the gateway. Also to note… the first two nodes are set up to be “mesh nodes” (i.e. listen for and if needed route the message). Since mesh nodes are configured to wake up with every message heard during the report window, the LED blinks each time it hears a message (one blink for the message from a neighboring node and a second blink from the gateway sending a message back to the neighboring node.

I also started logging the timestamp when a message was received by the LoRa particle Gateway and was able to plot some of that in Power BI to visualize. Each dot represents the time when a message was received by the gateway. The X axis is the macro time scale (i.e. time of the day) whereas the Y axis is the micro time scale (i.e. seconds since the beginning of the LoRa reporting period with hundredths of a second time stamp precision). Each color is a particular node. What I look for is each node must “stay in their lane” to not have messages collide. Overall it seems fairly consistent with some exceptions. I’m not 100% sure if exceptions are real or something with how I’m calculating the timestamp. All in all maybe this helps illustrate the concept:

Next steps is see what causes the outliers here or understand if those outliers are real or not. Then it’s put them all in enclosures and set the system up with some distance in between them all.

3 Likes

Well found out 2 things that caused the outliers in the earlier chart:

  • I was not compensating for retransmissions, route discovery messages, etc. correctly when I was capturing the timestamp when a LoRa message arrived. If I add that compensation, then it's nearly perfect. This just made the data look messy when it really wasn't. I was adding vs subtracting the compensation when determining the right timestamp.
  • I originally was resetting the hundredths value within the AB1805 every 2 hours (every two hours, I perform a Particle.SyncTime()) I was doing this even if the Unix epoch time matched on both. Figured, I only really need to adjust the AB1805 time if it didn't match system time.

After making these two updates last night, it seems it's almost perfect:

I still want to stay in sync with the cloud time so I just check Time.now() vs ab1805.getRTCastime(). If the two do not equal, I just walk the AB1805 towards Time.now() by 250 ms. That one stair step you see occurred when I did a cloud sync and then walked the AB1805 forward by 250 ms.

I currently check and then walk forward or back if needed every 250 ms. This repeats every 2 hours, every time I sync time with the cloud. What I'm thinking now is sync cloud time every 2 hours yet but walk the AB1805 forward/back by 50 ms every 20 minutes so it's not as much of a pronounced disruption. May also reduce the deadband on when a node adjusts it's own clock I.e. don't wait until you are off by 100 ms before setting time. Just make a small correction if you are off even by 20 ms.

@Rftop - I just put one of the nodes in the refrigerator, another one in the freezer and another one on the food dehydrator set to 125 degrees F. We will see what happens to the timing. :slight_smile: Then this weekend it's setting at least 10 of these up outside spread apart to truly be mesh for some endurance testing.

UPDATE: After just a little time in the freezer at 8 degrees F as well as another node at the other extreme of the Food Dehydrator at 135 degrees F, two things were apparent/expected:

  1. The RTC ran slower for both (as we'd expect) the nodes timestamp started making a saw tooth shape. Looks like at about 50 ms adjustment every 25-30 minutes or so. ~2.4 seconds/day slow (30 PPM error). This matches a typical crystal oscillator chart as @chipmc helped me understand. The system is continually correcting for this once outside of the dead band of 50 ms. Thus produces a sawtooth shape. I think this is quite reasonable. Besides, in my use case, this is pretty extreme temperature difference. What's more typical is all devices will get colder/warmer together generally. For now, I'm not planning on using any temperature compensation on the RTC. (although I suppose I could). Maybe...

  2. With the corrections that are occurring, occasionally it is not even close... but why... By plotting the data, it became apparent that those outliers are exactly 1 second out more than we'd expect it to be. Then I remember reading something in particular on the AB1805 datasheet. I'm guessing I was reading it right at the rollover of the hundredths register occasionally causing the wrong seconds value to be used. I updated the AB1805 library to use this algorithm and so far so good. I'm throwing more in the freezer over night and we will see what happens.

5.6 Hundredths Synchronization
If the Hundredths Counter is read as part of the burst read from the counter registers, the following
algorithm must be used to guarantee correct read information.

  1. Read the Counters, using a burst read. If the Hundredths Counter is neither 00 nor 99, the read is correct.
  2. If the Hundredths Counter was 00, perform the read again. The resulting value from this second read
    is guaranteed to be correct.
  3. If the Hundredths Counter was 99, perform the read again.
    A. If the Hundredths Counter is still 99, the results of the first read are guaranteed to be correct.
    Note that it is possible that the second read is not correct.
    B. If the Hundredths Counter has rolled over to 00, and the Seconds Counter value from the second read is equal to the Seconds Counter value from the first read plus 1, both reads produced
    correct values. Alternatively, perform the read again. The resulting value from this third read is
    guaranteed to be correct.
    C. If the Hundredths Counter has rolled over to 00, and the Seconds Counter value from the second read is equal to the Seconds Counter value from the first read, perform the read again. The
    resulting value from this third read is guaranteed to be correct.

1 Like

I’m a bit ashamed how long this last RTC issue took me… I’m sure learning a lot about that AB1805 RTC. After moving it to the freezer and food dehydrator, I observed more circumstances of it “jumping” by ~250 ms. After plotting the data a bit more in Power BI. I realized this always occurred after a time correction was made. I then forced it to make a time correction every time instead of when outside of a dead band and it ended up looking like this: I.e. it would jump between adding an additional 250 ms and then not.

I also was always bothered that it should report at an exact time. I.e. 08.500 (i.e. 8.5 seconds after the hour but it always seemed to be a little late (i.e. 8.650 seconds after the hour).

After ALOT of time digging into this, I finally realized the AB1805 wasn’t actually using the hundredths register to wake up on like I assumed it was, in fact it would wake up somewhere between 200 - 400 ms “late” compared to what the alarm was set for. After many many hours digging into this, I FINALLY found it. This little nugget in the data sheet called “Interrupt Mode” isn’t very clear, basically this doesn’t just mean on duration, but rather it also means precision of the interrupt. I’m guessing it waits for the alarm registers to match the current time registers but then still waits for the next 1/4 second signal. So in reality, I was only 1/4 second precise in the interrupt.

I changed this to be 01: (1/8192 seconds) and now whatever alarm I set it to, it wakes up at nearly that precise time. This should remove additional variability in when each LoRa node reports. What I really like about finding this issue, is the device now reports back exactly when the LoRa gateway told it to. I.e. tell a node to report back at 18.000 seconds and it’ll report back at almost exactly 18.000. So far it’s been only +/- 10 ms.

Little by little finding and correcting all the issues/config with the RTC. Hopefully all this detail and time precision pays off!

4 Likes

That looks like that might be the end of the RTC fun. It seems pretty darn rock solid now at least for the bench test. The 3 “outliers” I had seem to be duplicate messages but from what I can tell, the rest of the time everything stayed nearly perfectly in sync.

First graph: Color by Device ID (Confirms all devices are staying in their lane, equally spaced and generally very consistent): The little “humps” you see is where the Particle LoRa Gateway syncs time with the cloud and then slowly adjusts the time on the local AB1805. Each dot is a LoRa message received by the Particle LoRa gateway. X axis is time (messages spaced 5 minutes apart), Y axis is the seconds value when the message arrived.

Color by re-transmission/timeout (This indicates it maintained near perfect time even with re-transmissions and new route discovery messages).

Zooming into a particular device, we can see how accurate a LoRa node is. Here the target reporting window is 19.5 seconds. Variability is 19.45 → 19.52. Seems to be within +/- 50 ms or so.

This is all bench tests so far… getting ready for the field endurance test here likely later this week. Just waiting on a few more solar panels from Voltaic and in the field they go.

I hope no one is getting bored or annoyed with the posts and updates… I’m taking a “work out loud” mindset with this project and always appreciate the advice and guidance provided by this community!

6 Likes

maaan it is awesome to be able to follow and read the updates, congrats and keep it up!
Thanks a lot for sharing this with us.

3 Likes

I learn something every time I read a new post in this thread…and from everybody that’s posted.

1 Like

Thank you for this open way of working. I can't say I understand everything, but I am learning.
I am wondering, once you get to a good point, if you intend to share the project for anyone to replicate.
If you do, I am not sure where to start. I see hardware sprinkled in the thread, but I don't see the code. Perhaps a few coding examples are sufficient.

@Pescatore Great question…

I’ve thought about making this project full “open source” including the PCB design, firmware code and all supporting documents/diagrams. Either github or hackster or something like that. I mostly didn’t do that yet just do to lack of time to put it all together and feeling it wasn’t “good enough” yet to share openly. It’s also continuing to evolve. When I started this several years ago it definitely wasn’t worthy of sharing but as I’ve learned from this community and built my own skills it’s likely will end up there eventually. In the meantime… if you are looking for a few coding examples just let me know what parts/pieces. In the meantime, I know @chipmc shared his github earlier in this thread. Mine is a similar code structure. I see us continuing to converge the code base as we can/learn. Maybe an end state in all of this someday is an open source structure, library, how to guide in deploying LoRa with Particle.

1 Like

I appreciate the intention to share your work.
A few years ago I had wired up two RF95 modules to Photons. I could never get them to work and gave up. I didn’t have Rick’s tutorial, so I didn’t understand what I was doing.
I saw your “Chapter 2” post and started following, thinking about restarting my project. To be fair I should start like you did, by getting Rick’s tutorial working.
Combing through Rick’s thread I also found that someone else posted these examples.
So looks like there are plenty of crumbs to follow nowadays. Adding multi node features will be the next level up. Hopefully someday I will read your how-to guide.
Thanks,
Pescatore

Thank you for sharing, the many considerations needed, including dealing with the challenges with timing essential to batteri driven mesh.

Of particular interest, is learning the strength and weaknesses of mesh contra traditional point 2 point, depending on the applications of them.

Including setup/config of a system, with how to securely connect, add, remove and update devices, as it often makes or breaks a new system.

1 Like

@thrmttnw - Yeah, I’ll keep posting here with my observations, frustrations and victories so we can all learn together. So far, it seems most challenges can be overcome including joining new devices as well as rescuing orphans. My biggest vague concern at this point is the inability to perform FOTA (Firmware over the air). You always think what you have is fairly robust, final but in honesty it rarely ever is. That’s partly why I’m being so stringent and striving for near perfection with this LoRa timing thing. My best attempt at reducing the impact of lacking FOTA capability is using UF2 file format for the bootloader for LoRa nodes. Basically, allows the LoRa node to be recognized as a standard USB flash drive when plugged into a PC. A user can “drag/drop” a .UF2 file onto the drive to load new firmware. It’s about as easy as moving a file to a USB flash drive. It should make it really simple for a non-technical end user to manually load new firmware. It’s certainly not FOTA but if I’m in a pinch or someone really wants a new feature, it’s a way to make that happen without having to ship devices around. That all said, it’ll be a telling 4-6 months for me as we start deploying the LoRa mesh systems in the wild. I’m sure there will be some bumps in the road ahead yet…

As for the time synronization work… although it’s doesn’t seem to be an issue causing a problem, I still wanted to eliminate the duplicate message that occasionally occurred. It happens to collide/nearly collide with the neighbor LoRa node also trying to transmit at that same time so would be nice to eliminate. I.e. this:

The last two days, I spent a little time digging into the messages themselves that I thought were duplicates. I have a msg counter from the LoRa node that goes 0-255 and rolls over. It counts up each time it takes readings and assembles a new message. I was expecting this to contain the same message count as if the Particle Gateway heard the same message twice or the same message was sent twice. But in fact, the LoRa node sent messages with two different message counts. The only real way this could happen given my finite state machine is the device must of completed a normal wake/send data/sleep cycle but then when it went to sleep, it immediately woke up again. It’s either waking up with a sleep duration of just a few ms or from the AB1805 interrupt.

  • Maybe we calculate the sleep duration wrong causing it not to actually sleep? Let’s constrain the sleep duration to a minimum of 5 seconds - If it still happens it must be due to the interrupt. - It still happened. So something with the interrupt wakes it up immediately.

    LowPower.deepSleep(constrain(slpTimems,5000,7400000)); 
    
  • Maybe the AB1805 set interrupt is occasionally erroring out/not being set properly? Let’s keep calling interruptAtTime() until it returns true. Nope… still occurred.

    while (!ab1805.interruptAtTime(nxtRptStrt_time, nxtRptStrt_hund) and i <= 10) {
      delay(1);
      i = i+1;
    }
    
  • Maybe something with the interrupt bit is still on/transitioning when we set the interrupt so when we call LowPower.deepSleep() it immediately wakes up again? Not really sure… I am now clearing the interrupt right when we wake up and not resetting it until we are ready to go back to sleep 500-1000 ms later. So hopefully by the time we set a brand new interrupt and go back to sleep everything is happy.

    case st_AwokeFromSlp:
    {
      LED.pwm(20);
      tm.clearRepeatingInterrupt(); // Let's make sure the interrupt we woke up from is cleared right when we wake up. Keeps things clean and hopefully prevents double messages. 

So far so good on using clearRepeatingInterrupt() right when we wake up to prevent the duplicate. It normally happened 1-2 times on one of the 13 nodes over a 12 hour period. So far I’ve gone 20 hours or so without any duplicates. Guessing something funky happens when clearing an existing interrupt and setting a new one all at the same time and then immediately going to sleep.

Darn solar panels are not coming until Wednesday but for now the 4 that have them are spread around my backyard and another 2 are in the freezer.

1 Like

Alright, it’s been 2 days with zero occurrences of that mysterious duplicate LoRa message I had previously. Pretty sure the clearRepeatingInterrupt() right when we wake up took care of that. I’m at 2 full days with consistent message timing just like this. Everything staying in sync and no mysteries double messages.

EXCEPT: I had another occurrence of a single LoRa node falling asleep and seemingly never waking back up. This is the second time it happened in over a month across 13 nodes but it happened again yesterday. I chalked the first occurrence up to something random or dead battery but the second occurrence today got me thinking… I am a bit perplexed on how it would forever sleep. Here is the logic I had going into/out of sleep.

      // Constrain the time to 5 seconds to 21 minutes so we shall always wake up. 
      uint32_t slpTimems_cons = constrain(slpTimems,5000,1260000);

      tm.stopWDT();
      LowPower.deepSleep(slpTimems_cons); // We should always wake up via interrupt but in case we don't still set a max time.  
      tm.resumeWDT();

Even if something funky happened with the AB1805 interrupt (I.e. set it too far into the future or set it in the past due to some calc error/rollover) the device should of still woke up due to the sleep duration timer elapsing. If it got stuck in code execution, then the watchdog should of reset it. But nope…

As soon as I hit a reset on it it came right back to life but was a bit concerning that it took intervention for it to wake up. Since it was seemingly in an infinite sleep mode orphan rescue mode or anything like that wouldn’t of helped anyhow since it was never awake.

I made two changes today to help combat this unusual scenario. Appreciate any ideas, suggestions, thoughts behind my approach.

Possible Point Fix: Use a repeating alarm interrupt on the AB1805 that repeats every 1 hour instead of every 1 month. Not sure if this will fix the scenario or not, just drawing straws thinking of what could be done to make it more robust. This assumes an interrupt is set.
I.e. use REG_TIMER_CTRL_RPT_MIN instead of REG_TIMER_CTRL_RPT_DATE when setting the alarm to wake up at. This way if some reason the alarm is set wrong, some sort of alarm should go off and wake it up once an hour.

REG_TIMER_CTRL_RPT_MIN   = 0x14;      //!< Countdown timer control, repeat hundredths, seconds, minutes match (once per hour) (5)
REG_TIMER_CTRL_RPT_DATE  = 0x08;      //!< Countdown timer control, repeat hundredths, seconds, minutes, hours, date match (once per month) (2)

Within the AB1805 code select which type or repeating interrupt to set. Change this to be REG_TIMER_CTRL_RPT_MIN

bool AB1805::interruptAtTm(struct tm *timeptr, uint8_t hundredths) {
    return repeatingInterrupt(timeptr, REG_TIMER_CTRL_RPT_MIN, hundredths);
}

Universal catch all fix: Keep the AB1805 external hardware watchdog active all the time even when sleeping. If for any reason at all, the watchdog isn’t getting pet, then reset the board. However, here’s the catch… the AB1805 has a max watchdog of 124 seconds. So the only way to pet the watchdog is wake up the MCU at least every 2 minutes some finite amount of time only to pet the watchdog and then go back to sleep. This was easier than I thought to accomplish programmatically:

    case st_Slp:
    {
      
      LED.off();

      IRQ_Reason = IRQ_Invalid;    // Reset the interrupt reason (this is set within the interrupt routine)
      LowPower.deepSleep(60000); // Sleep a max duration of 1 minute to pet the watchdog

      // We just woke up, if it was from AB1805 interrupt we are done sleeping,
      if(IRQ_Reason == IRQ_AB1805){
        newState = st_AwokeFromSlp;
      }
      else{
        //We woke up but not due to the AB1805 interrupt. Must be time to pet the watchdog. 
        tm.setWDT();

        // Stay in this state to immediately go back to sleep for another 60 seconds. 
      }

    break;
    }

However, the disadvantage is now it’s waking up/sleeping every minute (or at least every 120 seconds).

So question for the group… is it reasonable to keep a watchdog active like this even when sleeping and wake up/sleep on some cadence just to pet the watchdog? Besides consuming some finite amount of battery power each time we need to pet the watchdog, what’s the disadvantage of this?

@jgskarda ,

Given the very short time the device is awake, I don’t think there is a major downside to this.

BTW, Got my boards back from MacroFeb today so I can continue development. Will share here as I make progress.

Chip

1 Like

Hey Jeff, I'm guessing each time it happened on a different module. Just thinking about the possibility of hardware issue/difference/repeatability.

1 Like

I don't know for sure but I'm almost certain it happened to 2 different LoRa nodes. One was operating as an LoRa end node and the other was acting as a LoRa mesh node at the time. With only two occurrences, it only happened 2 times in the last month+, it's hard to pinpoint anything. I typically try not to over react until something happens more than once. It's this uncertainty or lack of repeatability that leads me to the catch all band aide of just keeping the hardware watchdog on all the time. So far I'm 2 days in with the hardware watchdog active when sleeping, no issues and near perfect report timing. That said, I could go weeks before the scenario happens again to test the countermeasure.

I agree with Chip. Since you have the test equipment, modify the code to not wake up (the Lora Radio, etc) and test just the sleeping and petting for a few hours. That way, you capture the power required for the 2 minute "pet" cycle and it would be included in the Average Sleeping Current. I'm guessing it's insignificant to your power budget, especially considering the increase in reliability.

1 Like

That's a good idea! I can try and capture that average of just sleeping vs sleeping and only petting the watchdog. I like it! With the solar panel on every single node it really opens up the power budget so this should be a non-issue. Hoping to provide a more robust solution. Thanks for the idea! I'll report back with the average sleep currents assuming my little "current ranger" can capture average current adequately.

1 Like

If you're not cranking up the Radio (for the test), then the Ave Current that's being reported on the Top Right of your graphs will be your "effective" sleeping current...... correct ?

Yeah I think so... or that's what I would expect it to be. What I don't fully understand is what is the sampling period of current ranger and where does that averaging happen. For example, does current ranger write out a sample to USB for the python script to capture every 1 ms, 10 ms 100 ms, 500 ms, etc. If my awake time is only 10 ms then are we sampling it quick enough. In my plot it always captured the spike so I would think so but wasn't sure. It's a $130 open source/maker/hacker type of sleep current measurement hardware so wasn't sure how accurate it would be. As an alternative, if I had a oscilloscope, I suspect I could hook that up to the output pins on the current ranger to get a much more precise on duration and mA value. Noise, DMM & scope measurements | Current Ranger | LowPowerLab I'll have to add an oscilloscope to my Christmas wish list for Santa. :slight_smile: Either that or some more commercial hardware to measure sleeping currents and overall power consumption.

1 Like

It’s been a little bit since I last posted… thought I’d provide a quick update. The most recent learnings is special attention is needed to get a Sleepy Mesh Node to properly service/route a messages when the MCU is sleeping. As discussed earlier in this thread, I was putting the MCU asleep but keeping the radio listening when servicing the mesh network. This further conserves the battery. I use an interrupt to wake the MCU up with any received message and that worked fine. However, my mistake is I just assumed since the MCU woke up when a neighboring node sent a message, it was actually routing the message before falling back asleep. In fact, it was waking up the MCU but the MESH message was never actually routed/serviced properly.

All of the mesh routing happens “under the hood” of the Radio Head library so unless the message is for that Mesh Node, all instructions return “false” so it was hard to determine if/when it’s busy servicing/routing a mesh message.

I tried various things and different approaches, what I ended up doing is capturing the value of millis() anytime a LoRa Message was sent or received. I can then compare that millis() to determine elapsed time of no LoRa traffic and only put the MCU back to sleep after a pre-defined elapsed “idle” time.

Here is bits/pieces of the code to make this happen.

First… capture the millis value of any LoRa activity weather sending or recieving. This was edits made within the RadioHead Library:

bool RHDatagram::sendto(uint8_t* buf, uint8_t len, uint8_t address)
{
    activityMillis = millis(); // Capture millis of any message we send to determine when we are busy. 
    setHeaderTo(address);
    return _driver.send(buf, len);
}

bool RHDatagram::recvfrom(uint8_t* buf, uint8_t* len, uint8_t* from, uint8_t* to, uint8_t* id, uint8_t* flags)
{
    if (_driver.recv(buf, len))
    {
    activityMillis = millis(); // Capture millis of any message we recieve to determine when we are busy
	if (from)  *from =  headerFrom();
	if (to)    *to =    headerTo();
	if (id)    *id =    headerId();
	if (flags) *flags = headerFlags();
	return true;
    }
    return false;
}

Then create a method to determine when things are busy or not. In this case, the activityTimeout_ms is manually set to 2X the reliable datagram timeout (i.e. if I don’t receive an ACK back after sending a message, attempt to send it again). I arbitrarily made it 2X thinking that should be good and it should send a re-try message thereby resetting the activity Millis value.

/*******************************************************************************
 * Method Name: isBusy()
 ********************************************************************* **********/
bool lora::isBusy(){
 uint32_t idleDuration = millis() - manager.activityMillis;
 if(idleDuration < cfg_ActivityTimeout_ms){
  return true; 
 }
 return false;
}

Then within the main state machine. Once you wake up wait until LoRa is not busy before falling asleep again.

      LowPower.deepSleep(tm.WDT_MaxSleepDuration); // We should always wake up via interrupt but in case we don't still set a max time.  

      // If we woke up from the AB1805, our sleep cycle must be over. Go to AwokeFromSlp state
      // Otherwise we must of woke up from LoRa traffic. If that's the case, then stay awake and service it until LoRa is no longer busy and then fall back asleep. 
      if(IRQ_Reason == IRQ_AB1805){
        newState = st_AwokeFromSlp;
      }
      else{
        LED.pwm(20); 

        // Call LoRa.Rx() Once to process the recieved message and update the initial millis before we can compare to it. 
        LoRa.Rx();
        while(LoRa.isBusy()){
          LoRa.Rx();
        }
        // Stay in this state to go back to sleep
      }

So far so good, but we will see how robust this is over time.

UPDATE: Well probably too early to know for sure. but it seems this approach worked well. I have 3 nodes + 1 gateway right now and the farthest node is doing 2 hopes to get to the gateway. The intermediary LoRa sleepy mesh nodes haven’t missed a beat since this change. I also have been bouncing back and forth between a 5 minute and 1 minute reporting window with it mostly stable. So looks like that took care of it.

2 Likes