OneWire Library Bug / Proposed FIX (June 2019) Affects DS18B20 and other OneWire devices


#21

Ahhh…I think I misread both yours and @Fragma statements about xenon. I don’t know why I thought you both had tried xenon with mesh AND no mesh… are you not able to try the experiment you propose?

Sorry, these questions come from my general inexperience with these products and basically all of these topics.


#22

I tried with the Mesh off and it was perfect, no errors :smiley:
By the way, you don’t need USB to flash, just put it back in safe mode when done testing. Or wire up the mode button to toggle mesh on/off.


#23

NICE JOB! We seem to have a consensus. Here is a stretch-the-truth summary so far:

For these devices (including ones with mesh disabled: ARGON, BORON, PHOTON, XENON), the ATOMIC_BLOCK “fix” cures ALL OneWire errors that are not a direct result of capacitances/resistances/timing/setup issues. If one can successfully get one temperature conversion, they will get unlimited conversions with NO bus CRC errors. This applies to single and multiple sensors on a OneWire implementation.

For these mesh devices, that have mesh enabled (ARGON, BORON, XENON)…currently there is no way to achieve this OneWire error-free status.

The issues remaining:

  1. Why doesn’t the Atomic Blocks fix work for mesh enabled devices?
  2. Should timings changes be considered as well

My vote for #2…no keep the current timings unless it they are shown to be very out-of-spec (especially Gen3 devices as remchon has suggested)

I feel like we really need an official Particle response for these…it is very clear to me that 10s (probably into the 100s for those not posting) of people (including the Particle.io developers) have spent literally 1000s of hours in total dealing with this OneWire failure unique to particle.io devices.


#24

@Jonpcar has pinged me via PM and asked me to respond here too - so I’m just lazy and copy that here :blush:

The DS18B20 library isn’t really by me but I have taken it over from the original contributor back in the days, so I’ve never really gone down to the hardware implementation of it - especially since that’s done via the Particle (I’m not an employee) owned OneWire library.

Although I haven’t invested much time proving it, I think the OneWire library would need some tweaks to make the DS18B20 library more reliable (esp. on Gen3) - although there are some issues specificly DS18 related (e.g. resolution setting).

However, I have followed that thread and appreciate the effort that’s gone into finding cause and solutions for some of the issues.
If anybody want’s to file a PR on the GitHub repo of the DS18B20 and/or the OneWire library (including the DS18 sampel therein) I can merge to DS18B20 and get Particle to consider merging to OneWire.


#25

Hey Scruff, thanks for joining, haha.

Is there any possible solution to LOCK OUT ALL INTERRUPTS (for about 60us) for mesh devices in the critical OneWire areas? I think the posters would try it immediately so that we can do the PRs.


#26

This is (sort of) what ATOMIC_BLOCK does - not all but most IIRC.

However, with OneWire being the actual “target” of that @rickkas7 or @jvanier (as the original contributor) may be the right people to ask - unfortunately the official OneWire library doesn’t sport a Repository nor a URL entry in the library.properties file, hence there is no GitHub link in Web IDE for it to directly navigate there to file an issue or PR (a fact I’ve previously mentioned to Particle already :wink: )


#27

yes, this is going to be the problem for mesh devices…thanks for getting the right guys involved.


#28

The Particle OneWire library Github is:

It is most likely a timing issue caused by interrupts.

This modification works fairly well, about 86% of the time on Gen 3, better than current:

ATOMIC_BLOCK() {
    error = dht.read();
}

This works 100% of the time:

ATOMIC_BLOCK() {
    __disable_irq();
    error = dht.read();
    __enable_irq();
}

However, we really don’t want to do the second one, because it will likely impact the performance of the mesh networking radio. It could be an opt-in feature for users who are not using mesh.

(That was just an example for testing. It would be better to disable interrupts farther down, and for shorter periods of time where it really matters. Some of the delays in OneWire are not critical.)


#29

rick…thanks! I don’t know if you had the chance to look at the exact changes (ATOMIC BLOCK additions) I made to the OneWire file, I feel that almost all of them were critical to this issue with possibly the exception of the onewire.reset.

I made the changes rather quickly and so could be wrong there. Also, all this is pretty new to me. I will look at them a bit closer. Unfortunately, I can’t test any of this personally, I only have a Photon.

The issue with your first fix I think is mostly for those “systems” that have multiple sensors on the OneWire. The algorithm to find the addresses seems to be very transaction intensive and currently fails to consistently find all sensors when there are more than a few (4+) tied to the OneWire bus. To fix that, that algorithm could possibly be re-written to be more robust (I haven’t looked at this algorithm at all, so this may be a FALSE statement). I’m sure there are users that could accept a low percentage of CRC errors on the OneWire bus if THAT particular issue is addressed.

The 2nd fix seems nice as in opt-in for those not using mesh. EDITED: actual doesn’t really help anyone. I am assuming the ATOMIC_BLOCK change will be made in all cases. That already fixes the issue for those mesh devices that are not using Mesh…so no need for that 2nd fix unless we can find a way to apply it for those devices with mesh enabled…Rick doesn’t seem comfortable with that.

Also…I am not familiar with the particle.io family, I don’t think we have samples of the entire spectrum…or do we.

What do others think?

I’m planning on looking at the changes Rick proposed. I’m gonna look at the timings one more time, and a closer look at the Atomic Blocks and their criticality. I’ll create a file(s) to post here, unfortunately, I wont be able to get to it until probably tomorrow.


#30

@rickkas7 I thought (at a glance and out of ignorance…I am new to all this) that you were providing two methods to disable interrupts for the “deep down code” inside of OneWire. However, when I went to try it out, it’s obviously not that.

We have made changes to the lowest bit level of the OneWire protocol. Currently the largest “interrupt blockout time” needed (to make OneWire error-free) is about 60us+. If that is an acceptable level for mesh operation then all we need is the blockout method (REALLY disable interrupts)…so that we can put that into the OneWire.cpp file and test it.

If 60us is NOT acceptable, there is perhaps one other possibility that does not eliminate all OneWire errors (in my testing about 0.1% still occur). That method requires a blockout method for interrupts that is about 15us+ long. Is that acceptable?

In either case, we can’t move forward to test it unless we know how to TOTALLY blockout the interrupts for mesh enabled devices.

There is already a workaround for devices that don’t have mesh (or don’t enable it), and it passes 100% of the time for all that have tested it.


#31

The __disable_irq() should block the high priority mesh interrupts. I think 15 uS should be OK. At least it would be worth a try to see.


#32

@rickkas7 I don’t know how to compile that…what header would I need? It shows as an error when I enter that code…I didn’t actually try to compile it though

identifier “__disable_irq” is undefined

sorry…all this is new to me…I am compiling locally on my windows computer


#33

Hmm… I’ll take a look tomorrow. I didn’t run the test with __disable_irq and there is a possibility that the tests were done from a monolithic build so system firmware functions were available. I’ll see if there’s another option.


#34

Ok sounds good, thanks for your help…

FYI: The 15us delay option doesn’t wrap either the OneWire.reset or the OneWire.write_bit (0) with the high interrupt protection. They are the 60us operations but turns out are much less critical than the OneWire.read_bit op and the OneWire.write_bit (1).

EDIT: I have updated the first post of this thread with a status and “clearer problem statement” that I will try to keep updated.


#35

Update: I didn’t hear from @rickkas7 today. He’s either very busy or an “interrupt lockout” for the mesh enabled devices may not be very straightforward. I am sure we will hear an update when he figures it out.

I went ahead and updated the OneWire.cpp file in the first post of this thread. I took out the redundant interrupt disable/enables which were replaced by the Atomic Blocks. I also made a judgement call to change a couple timings. I did see a thread where someone had hooked up a scope to the OneWire bus and it seemed clear that 1us to drive the bus low by a particle.io device was plenty of time…and it meets the spec.

Changes in that file (from the original) have a comment next to them with //JC in it for now.


#36

Well, no word from @rickkas7, I am continuing to watch this, but obviously moving on. I’m done with my project and it works great on the Photon, especially with library FIX.

I attempted to file a PR on the OneWire bus protocol that was listed about 5 or 6 posts ago. Don’t know if I was successful filing the PR on the Particle Library, my own version of it, or an entirely different version of it. Github is not intuitive to me and I don’t have a lot of time to spend on it.

It does surprise me though (having worked in the industry), that issues/bugs like this don’t seem to have a central repository, I probably just can’t find it. How do new/existing customers not keep chasing the same problems over and over…like this thread a few months ago that seems to me obviously linked to this bug:

I also have concerns about the OneWire “Library” as well since all of the various DS18xxx libraries seem to have the OneWire files inside of them, and no LINK to the library pointed out by rickkas7. Even if that library gets changed, I don’t understand how it would propagate to the DS18xxx libraries that everyone seems to grab.

In any case, I don’t use MESH but here is what I would suggest based on user comments previously: IF you are doing a OneWire address scan to discover OneWire devices, disable your mesh during that operations. Once the addresses are successfully recorded, there doesn’t seem to be any other choice but to live with the high CRC error rate associated with transfers on the bus.

As I said, I’ll continue to watch this thread and others regarding OneWire and (I think) awesome DS18B20s.


#37

Unfortunately not on the particle-iot repo but on the upstream repo by Hotaman.
However, he has already merged that PR into his repo and Rick and Julien (both Particle) have also commented/contributed on that

Actually no
This is a “link” but to a specific version of that library

While this would pose a problem if the dependency-library got a fix and the dependent library wouldn’t pull that in - on the other hand, that same strategy should prevent a “bad fix” of the dependency from also killing the dependent library.
This is a canundrum that may want to be addressed - e.g. via an “override” feature where the dependency gets ignored when a project explicitly imports another version of the dependency-library.

BTW, @rickkas7 - could you activate the Issues feature in the this repo, please?


#38

Scruff, I knew my attempt with that PR did not go well, I thought about trying again but worried I would make it worse, haha. And it’s a good thing that my “lack of knowledge” about library referencing was wrong, too.

Aaahhh…that’s actually what I was looking for and couldn’t seem to find a way to do it, but I did see that it was an option in other repositories.

@peekay123 you said in another thread: "The challenge with any OneWire library is that they use bit-banging with delayMicroseconds() for timing and disable interrupts to ensure that FreeRTOS doesn’t preempt use code while OneWire is doing its thing. The problem is that’s really not DeviceOS
friendly "

One of the first things that crossed my mind when the cause of OneWire’s failures was discovered …are there OTHER protocols/operations where disabling/enabling interrupts is so key to success: other hardware protocols implemented like OneWire? Are there any obvious ones? I could imagine there may be some software protocols that use these techniques as well…

This OneWire implementation was flawed from the day it was ported to the particle.io platform because of the introduction of those higher level system interrupts that can break into an assumed safe, uninterruptible window. Atomic Blocks “fixes it” for non-Mesh devices, but at some cost to the system.

With the introduction of Mesh SUPER-interrupts that can interrupt Atomic_Blocks (yes I am making up terms), I can only imagine the scope of the validation effort. Not my job though, haha.


#39

It is great news that a solution seems to have been found. Thank you for all your efforts. I’d almost given up using OneWire with Particle.

So now, can anyone tell me what example software I could I use to let me reliably use a Xenon to poll 2 DS18B20 sensors?

If I have to copy and paste the atomic block modified code into a cpp file, which version should I use and how would I go about doing it?
Any help on this would be greatly appreciated.


#40

Hey Colin, I would use the version of the onewire.cpp file in the first post of this thread…it works great for me, and for those that I have received feedback from. There doesn’t seem to be any real push to fix this in the official library, or to officially document the problem.

As far as coding examples, I am currently sampling six DS18B20 sensors in my Photon project. It uses the code that I posted in my thread linked (also in the first post of this thread). That code can be modified fairly easily but it is pretty different than other code that I have seen.