Not to derail the conversation, but @jgskarda how about you make a fork of the library and Rick can update the main library’s readme to mention/link to your fork? Your library is fantastic but we currently don’t have the bandwidth for the testing we would need to do to accept the pull request.
Agreed, if you assume the message for time correction (sent by the Boron) requires the same # of Hops/Retries as the message that the Boron Received from a remote node (that initiated the correction).
My testing was with a Cyclic Sleeping Network where the goal was to have All devices awake at the same time. At the beginning of the Wake Event, a timing message was sent throughout the network. That happened on each Wake Event (5 minutes). Then nodes would begin sending their payloads across the network to their destination (particle gateway). The Sleep Coordinator wasn’t the Particle Device in my Project. I selected a “Router” node that was centrally located in the Footprint to minimize the # of Hops for the Timing Message sent to all remote nodes each cycle.
Am I correct in my assumption that you are looking for an Asynchronous Sleeping Network ?
IE: With Lora, you want the remote nodes waking at different times to prevent one from “talking over” a node that’s located farther away ?
I’m afraid that I’ve forgotten what little I did learn years ago about Lora. My thoughts may not be valid for your Project…but it costs nothing to share them
Not necessarily, I’m more after a Cyclic sleeping network as you described. The LoRa Particle Gateway and all LoRa Mesh nodes would wake up at the same time. Then each LoRa end node would wake up 2-4 seconds apart from each other, send it’s data, wait for a “setTime” response then all devices would fall back asleep. This cylce is repeated every 5 minutes. This seems similar except the “timing message” is sent back to each particular node just before it falls back asleep. This informs that particular device of the time, and when it should wake up again.
I already do the cyclic sleeping/waking and coordinating the show of when each LoRa end node should wake up and send data. But so far this was not in a “mesh” it was just a hub/spoke model where all devices were in range to the LoRa Particle Gateway. That said, I did have the LoRa Particle Gateway also awake/sleeping and they seemed to stay in sync.
Yeah that’ll work for now… maybe a topic to discuss during the next MVP to see how to best handle forking different/existing libraries. I’ll also be making a few updates to AB1805_RK library (Calibrate the RTC, Allow deepsleep() to be much longer than 255 seconds). I’d like to continue to share the work back to the broader community but not sure the easiest way to make that happen. I may have a few additional “tweaks” but for now I’ll work under the understanding of GitHub will show it’s been forked and maybe we could add a note/link to it. Thanks for this guidance!
Great idea, let’s talk about it then!
Just to clarify… there should be no “assumptions” needed. The LoRa Particle Gateway would send the message to the LoRa node telling it what the time is. The last byte of the message would be set to 0 by the application firmware. Then before it transmits it, this edit to the RHReliableDatagram.cop code would increment it by 1. As the LoRa message is routed along, each mesh node would also increment it by one each time it gets transmitted including consecutive increments on a re-try. So when the LoRa end node receives the message to set it’s time. It would receive the time + the number of times the message was transmitted for that exact message to get to it. The LoRa end node would do the math on how much “delayed” from actual time did I receive this message based on the total number of transmissions this same message took to get to it. At this point, it’s just math/logic to determine the near exact time instead of assumptions.
That said… looking at the RadioHead library more. They do add a random delay on retries. So instead of 1 byte, I may need to use 2 bytes. 1 to count the number of re-transmissions and 2 - to add the running total of random delay that is added to each re-try attempt. Maybe just precise down to 10 or even 100 ms so we can fit it in 1 byte. We will see what it takes.
So you are only concerned with the Hops/Retries on the correction message leaving the Boron headed to the remote node. I incorrectly assumed the Boron was including the Hops/Retries based on the real payload it received- to be used in the calc for the time correction. You are doing it on the receiving end of the time correction message, which eliminates that concern. That sounds great, thanks for the explanation.
I think it’s great that you are able to share all of this with the community.
I’ll probably include the information in both directions. To and From the LoRa node. Right now, I just send it a setTime response weather it needs it or not as part of the end to end “acknowledgement” application message. I could consider looking at when a message was received at the Particle LoRa gateway and how many transmissions it took to get to me vs when I expected to receive that message from that device to determine IF the LoRa end node needs to have it’s time updated and then selectively send a “setTime” message response. This could cut down on the traffic a bit and reduce the number of bytes a Mesh node would have to re-transmit each time further reducing battery life. That might be worth it especially if I get the RTC calibrated accurately where it would only really “need” a set time message a few times a day vs every 5 minutes. I’d still likely send an acknowledgement message either way but this could be 1-2 bytes vs 20+ bytes.
Alright so I figured out how to track the total number of hops taken, the total count of transmissions and total transmission timeout delay and add that to the LoRa message while it routes to it’s destination.
I used some initial napkin math to calculate a TX Time Compensation. If I used the calculated Tx time compensation to determine the delta in time, I was off by ~50 ms but sure better than being wrong by 480 ms when I didn’t try and compensate.
Next step is to sharpen the pencil on the math that uses these values to figure out the compensation needed. I.e. how long did the message take to get to me with the inputs:
- Number of payload bytes
- Number of hops
- Number of re-transmissions
- Total Timeout time for each transmission
I may just “tweak” the numbers I assumed a little bit then setup a 3-4 hop network and somehow induce noise/put them far away/reduce TX power settings in order to force some more retransmissions and tweak it. Won’t be perfect but should get pretty darn close to perfect.
I’ll keep sharing my observations/results here as I go. Any guidance/feedback/suggestions is always appreciated!
After digging into this precision time setting of the RTC over the weekend, I think I finally have it nearly complete. A few activities:
Calibrate the AB1805 on LoRa nodes and Particle + LoRa Gateway using a GPS:
Using the PPS signal from the Ultimate GPS Breakout I was able to calibrate the AB1805 RTC on both the LoRa Particle Gateway as well as the LoRa End nodes. It’s calibrated down to ~4 PPM (~0.35 seconds per day at room temperature). The shocking part is the 4 PCBs with AB1805s I had were all originally off by ~350-400ish PPM to start. There is nothing specific in the datasheet but as I stated earlier there are a few bread crumbs in other forums that indicated it clocks fast on purpose/by design of the AB1805. The calibration registers are also biased to allow for slowing the clock down more than speeding it up. What I mean is that you can calibrate it from -610 PPM to 244 PPM. (I.e. slow it down by up to 610 PPM or speed up by up to 244 PPM). So it “centers” the calibration ability at -366 PPM which is sorta close to what I needed to calibrate it to. So maybe this is normal but certainly wasn’t expected.
I just plug in the GPS module into an existing JST connector I had on the board I used for a sensor/relay. I load a separate program that calibrates it in 30 seconds or so. It sets the ab1805 interrupt output to be a 8192 hz output on the interrupt pin and then uses 30 PPS signals from the GPS to count the 8192 pulses/second. It then calculates the PPM error, updates the AB1805 calibration registers, writes the error value to EEPROM. After it writes it, then it measures it again for another 30 seconds using the new calibrated value. The second time it is nearly spot on (0-4 PPM). I’ll eventually post the repos to github if anyone is interested. In case anyone else does this, in my setup, I had to use the 8192 hz clock vs the 32768 hz clock. Some reason, the number of pulses fluctuated more than it should with the 32768 clock. Either too fast to count the pulses, some sort of noise, or maybe the ab1805 was doing it’s thing adjusting the pulses based on calibration which messed up the interrupts. Not sure. Just an FYI…
After calibration… the clock between the Lora Particle Gateway and the Lora end node drifted apart only ~50 ms over 9 hours. Over a 20 hour period the LoRa Particle Gateway AB1805 drifted 1 second relative to Particle Cloud time (calibrated to 1 second accuracy). Overall, I’m happy with it!
Compensate errors in setting the time due to various message routing delays
This one was a bit more complicated then I originally thought. In short… here is the compensation I came up with:
// Calculate the total time this message took to arrive to us. This is used to subtract from the time received in order to account for transport delay and still set time correctly even with multiple retransmissions and/or hops.
//TxTimeComp_ms = (Total Number of TX) * (Time to TX Full Message) + (Total Number of Ack + Time to TX Ack Message) + Timeout
TxTimeComp_ms = (1+Rx_Hops+Rx_Retransmit_cnt) * (TX_HeaderAndACK_ms+TX_PayloadTimePerByteX100_ms*Rx_TotalPayloadBytes/100) + (1+Rx_Hops)*TX_HeaderAndACK_ms + Rx_TimeOut_hundreths * 10;
Where at the Default modem settings:
TX_HeaderAndACK_ms = 61; //It takes 61 ms to transmit the header and/or an Ack message
TX_PayloadTimePerByteX100_ms = 142; //It takes another 1.42 ms to transmit every byte of the user Payload for any application message (i.e. not ACK messages).
This compensates accounts for the number of hops, number of re-tries, the size of the payload, and a random timeout delay that gets added with each retry. It was working very well with very little if any error despite multiple retransmissions and/or hops. I tested 1-2 hops and 2-3 transmissions and was always within +/- 10-20ms. That is except when it had to discover a new route.
When a LoRa device says I want to send this message to LoRa node address 11 over Mesh, and there isn’t a route to 11 in it’s routing table. It first sends a “route discovery” message and waits for a response. Once it gets a response, it adds the route to the routing table and then sends the application message. This happens all in the Radiohead library. However, this time it takes to find a route needs to also be tracked. For this, I keep track of the millis() when route discovery started and when it ends and then “hijack” the same byte of a payload I used earlier for retransmission timeout delay. My initial test it was able to go from 0 hops to 1 hop to 2 hops and it kept time as expected with each route discovery message. So far so good.
Once I have a few days of data, I’ll try and post a graph again of accuracy of time. I hope to break/re-construct the mesh a few times in this time to see what happens. Overall I’m hopeful. I’ll update the Github repos for the RadioHead and AB1805 libraries in the coming days accounting for these necessary changes.
UPDATE - Also have to consider CADTimeout:
The next “gotcha” is the CADTimeout. Either need to turn it off by setting it to 0 or account for the delay in a similar fashion as the Route Discovery delay.
/// Use the radio's Channel Activity Detect (CAD) function to detect channel activity. /// Sets the RF95 radio into CAD mode and waits until CAD detection is complete. /// To be used in a listen-before-talk mechanism (Collision Avoidance) /// with a reasonable time backoff algorithm.
Default is CAD is “off” and will not add any delay but I was testing with setCADTimeout(100) 100 ms. I would of thought max delay would be 100 ms but currently in the library IF it observes CAD… it waits 100 ms to check again and then still injects a random delay between 100 - 1000 ms. I think the thought is it needs to wait until the prior message stopped sending before it would send the queued up message. I’m debating to just turn CAD off altogether OR set it to 70 ms, make the random delay 50-200 ms and then tack on the amount of delay just like I did for Route discovery.
Generally speaking… I think it’s all of these “random” extended delays that caused some of my issues by eliminating/reducing/accounting for them, it can only make things better. At least that’s the hope!
Any questions or any words of advice/things I should also consider when setting time?
Just a quick update… after adding compensation for number hops, number of re-transmissions and then for now disabling the “wait to talk” CAD timeout random delay, there seems to be very little error/randomness remaining. The main factor yet in sub second time synronization to say ~100 ms is “drift”. In this case, the node drifted from the hub by ~ 300ms over 24 hours. This seems very acceptable to me. The major “reset” from -500 → 0 that you see was the device re-setting it’s own RTC based on the absolute value of the error being greater than 500. I’m not sure what caused the little 80 ms hickup in the middle. I might of loaded a new program during that time in which it synced up with the cloud. Not 100% sure. This at least accounts for the main points of error in setting time on remote LoRa Nodes.
Now to narrow up the correction deadband from +/- 500 ms to maybe +/- 100 ms. Then add on a few extra LoRa nodes, start tracking this data in the back end instead of serial monitor and see what happens.
Today’s fun… I finally got around to updating the LoRa nodes to set an alarm and use an interrupt from the AB1805 to wakeup. The alarm includes the hundredths register on the AB1805 and since the AB1805’s are now synchronized sub second between each other they all wake up/sleep at near identical times and each one has it’s own offset on when it’s turn is to transmit. The blink you see is that particular device’s “window” when it is transmitting. For the bench test you see here, I space each node apart by 1 second. Hard to describe in words so here’s a brief video.
In this particular case, the 3 are configured to each be a LoRa Mesh Node so they all wake up at the beginning of the reporting period (9:35:00.000 PM), each one takes it’s turn to send out it’s data, and then they all fall back asleep.
From experience with a project on a mesh setup some years ago, I can recommend to simultaneously place one of those units in the fridge/freezer, and one in the sun/oven, while keeping one on the desk, and see if timing holds up.
In nature, a unit could end up in the shadow in freezing wind, and another in the hot backing sun shielded from winds.
@thrmttnw Yeah… that’s a good idea! Once I finish up some code refactoring work, I’ll put a few outside spread out, another in the freezer and another in a food dehydrator (~120F) . I also need to add some code to the Particle LoRa Gateway so it time stamps when it received the packet down to the hundredths of a second and publishes that to the cloud as well. That way I can easily track each devices accuracy in when it is transmitting the data so we can get a feel for the drift/accuracy between nodes.
To minimize the RTC accuracy issue:
- I’m initially calibrating each PCB using a GPS PPS signal so they all have the same baseline. Save that to EEPROM and then loading it to the AB1805 RTC. This gives everyone the same baseline to within a few PPM.
- With each data packet it sends, the Particle LoRa Gateway responds with the Time including the Hundredths second register from the real time clock. Typically this is every 5 minutes. Occasionally we make it every 2 hours. So each node should get a new time reference at least every 2 hours, typical would be every 5 minutes.
If I’m doing the math right, if one node was 60 PPM difference vs another due to temperature swings that would be 5 seconds error over 24 hours or ~ 400 ms difference in a 2 hour period. I’d think this would be near worst case.
If we really really had to, I do have a temperature sensor on the PCB itself. I suppose could do some temperature compensation to the RTC but trying to maintain the balance between keeping it simple yet functional/robust.
What other challenges did you run into when exploring Mesh? Was it mostly timing or other things?
I spent a little time refining the sleep with LoRa radio on. This conserves another 10-11 mA for a LoRa mesh node during the period of time when it’s servicing the mesh. Basically, at the start of a reporting period, the MCU wakes up, turns the LoRa radio on and then the MCU goes right back to sleep. Anytime the radio receives a message, it wakes up the MCU via interrupt. The MCU process it and then goes back to sleep. This repeats until either another interrupt, an interrupt from the AB1805 indicating it’s time for it to report it’s own data or an interrupt from the AB1805 indicating it’s the end of the reporting period.
Earlier efforts was to reduce the total duration of the reporting window by allowing nodes to report 1-2 seconds apart. This effort reduces the power consumed during that reporting window by ~50%.
In my current configuration, this cycle repeats every 5 minutes.
@jgskarda , thanks for sharing the link to the Current Ranger. I didn’t know that “upgrade” existed.
What was the Voltage during your test in the chart you posted above?
Would a 10 second reporting window be a conservative guess for your Mesh Repeaters (@11 mA) ?
The graph appears to be a 7 second reporting window.
I looked at the AB1805 data sheet and it includes dozens of temperature compensation curves.
But thinking about this, your timing correction already takes temperature drift into account, along with all the other factors… since it’s an end-to-end calculation.
I like where you are headed.
Thanks for the updates.
Voltage was 3.76 Volts.
This all depends on how many LoRa nodes are in the system. I currently calculate this as:
(LoRa Time Per Device) * (Number of Devices + 1) + Start of Cycle Offset.
The start of cycle offset makes sure we have the mesh nodes up/awake before any node begins reporting. The +1 ensures we keep the mesh up to the end of the window when the last node finishes reporting before it falls back asleep.
In my example where n = number of devices.
1.5s * (n + 1) + 1s = Total reporting window
1.5s * (3 + 1) + 1s = 7 seconds
Several use cases would have 2-4 nodes and could very easily last quite a long time without a solar panel even. Other use cases may have 20+ nodes. If we had 20 nodes, a reporting window would be 32.5 seconds in duration.
It in fact does… however non of it is used to compensate for the XT oscillator. Anything regarding compensation is compensating the RC oscillator when you really want minuscule power. Most of them seem to be a performance curve than a true temperature compensation curve. It does have the PPM adjustment registers so I can speed up/slow down the clock so conceptually adding some rough temperature compensation is doable, I’m just hoping that complexity is not needed. I think only would become necessary if I want to deep sleep for 12+ hours at a time or had major temperature differences between nodes in a system.
As for new devices/rescuing an orphaned node, I recently added a “sleep mode” byte as part of the message the gateway sends back to a node with every response to set the time and mode. This tells the device if it should do a normal sleep (MCU sleep + LoRa Radio Sleep - 150 uA), a RxSleep (MCU Sleeps + LoRa Radio on 10mA) or DeepSleep (AB1805 driven sleep for minimal power consumption - ~20 uA).
This way, a user can put the Particle LoRa gateway into this specific mode when needing to rescue an orphaned node, when brining on new devices or when doing a site survey. When this mode is entered, any node configured to be a mesh node enters RxSleep allowing it to service the mesh the entire time at a cost of 10mA. When the orphaned is rescued/and or after the new device is joined to the mesh network, the system can be put back into normal sleep mode. I currently do something very similar today so this was an easy add. Currently, it sets itself back to “normal” mode after 1 hour of operation.
Drilling down into recovering an orphan:
Does only the Boron Gateway have access to the Mesh Routing Table, or will each Mesh Router keep a list of it’s personal child nodes within range?
On a particular Router, let’s say we lost a remote node due to a closer node talking over it when transmitting (timing, or any other cause). Will your Router decide that it lost a child, or will the Boron decide this and send recovery instructions to that particular Router (and/or neighbors) to stay awake to get the time correction pushed down to the orphan ?
Agreed, but your power budget looks great for the Solar Panel Enclosure… plenty of Power for finding orphans or expanding the reporting window if needed. That should really take the pressure off of any critical timing issues in the field (should they appear). I really like that Solar Enclosure BTW
How the RadioHead mesh library works is each device in the Mesh starts out with an empty routing table. I.e. it doesn’t know how to reach any one node. When you publish a message, it first looks in the routing table to see if it knows how to route a message to the destination. If has a route, it just sends it to the next hop in that route. If it doesn’t have a route it first sends out a “route discovery” message. When it gets a response from the route discovery, it adds that route to it’s own routing table and then transmits the application message. This is all handled by the library.
However, when it attempts to send a message and the message fails (i.e. never receives an ack from the next hop), then that route is removed from the routing table. On the next attempt… it then performs another route discovery to find a new route. A lot goes into it… here is a pretty good description: RadioHead: RHMesh Class Reference
My key takeaways:
- The routing table can be dynamic and self updating.
- Any LoRa node as well as the Gateway only knows who the next hop is to get to a destination. It doesn’t know the full route. For example if it’s 1 → 2 → 3 → 4 and I am 1 sending a message to 4. The routing table is [Dest], [Next Hop]. In this example a routing table entry in node 1 would be: 4, 2. Meaning to get to the destination of “4”, I must first send this message to “2”. But I don’t know how 2 gets to 4.
- Once it has a route - it will not “drop” the route and find a more optimal route until a message fails. If I walk a node away from the LoRa gateway so it has to take a hop and then walk it 5’ away from the LoRa gateway again, it will continue to take the extra hop until a message deliver fails, the route is removed and a new route discovery message is sent.
I’m sure there is a way to “inspect” the route discovery message so you can view and keep track of the entire route. For now that’s all done behind the scenes in the Radiohead library and nothing is exposed. On a related note, I’ve also thought about allowing each hop in an application message to add it’s own node number to the application message. So with every message we receive, we know the route the message took to get to us. This would be handy for visualization/information but comes at a cost of additional bytes. Although minimal… Huh… maybe I’ll look at doing that in a similar technique I use to keep track of the number of hops and re-transmissions, just keep on growing/adding onto the end of an Application message with each hop. Seems doable.
That all said, my current thought is not to be so prescriptive in rescuing an orphaned node. Rather, just put the entire system into “orphan rescue” mode. This causes all mesh nodes to service all messages it receives the entire duration in that mode. Maybe the node moved and it needs to find a new mesh node to talk to anyhow so mine as well put the whole system into that mode.
Simple. I love it.