When (3 > 120000) resolves to TRUE (with unsigned longs, millis() and dark magic)

Hi all,

Let me describe an issue that I’m having and can’t figure out:
Every now and then (in the course of less than a minute) a comparison that is NOT supposed to resolve to true, does.

I have this code as part of the firmware that is running on a Boron:

Global scope:

unsigned long receivePeriod = 0;
unsigned long receiveTimeout = 120000;

on a function that runs from loop():

  receivePeriod = millis();

on another function that runs from loop() too (I’ve added a few new lines for easier reading):

  if ((millis() - receivePeriod) > receiveTimeout)
  {
    Log.error("%s() - timeout waiting for messages after %lu msec %lu %lu %lu %d", 
     __FUNCTION__, millis() - receivePeriod, millis(), receivePeriod, receiveTimeout, 
     ((millis() - receivePeriod) > receiveTimeout));
    timeouts++;
  }

(original code without newlines):

  if ((millis() - receivePeriod) > receiveTimeout)
  {
    Log.error("%s() - timeout waiting for messages after %lu msec %lu %lu %lu %d", __FUNCTION__, millis() - receivePeriod, millis(), receivePeriod, receiveTimeout, ((millis() - receivePeriod) > receiveTimeout));
    timeouts++;
  }

Often times this code works fine, guestimate 8 out of 10 times.

But a couple of times out of ten, I observe this log (I’ve added a new line for easier reading):

0000129170 [app] ERROR: receiveUpdateFunction() - timeout waiting for messages 
after 3 msec 129170 129167 120000 0

How could this if condition resolve as 1 (or true) at 3 msec when the comparison calls for 120000 msec?

I think I see this happening in the first log:
((millis() - receivePeriod) > receiveTimeout)
(129170 - 129167) > 120000
3 > 120000 is true

But even then, when the log inside the if prints ((millis() - receivePeriod) > receiveTimeout) it resolves to 0 (zero or false), as the last value printed shows.


Another instance of the error:

0000216593 [app] ERROR: receiveUpdateFunction() - timeout waiting for messages 
after 1 msec 216593 216592 120000 0

What kind of bit manipulation, internals, or something I can’t figure out yet is hitting me here?
Thanks!

2 Likes

At first glance I can’t see the issue either, but could you replace the individual instances of millis() used in that entire block with a variable to make sure all calculations and logs are actually using the same number?

BTW, when I tried to follow the numbers in your examples I found myself needing to repeatedly look back which number was representing what - memory isn’t what it was back in the days :see_no_evil:

Maybe this would make it easier to follow

  uint32_t now = millis();
  if ((now - receivePeriod) > receiveTimeout) {
    Log.error("%s() - timeout waiting for messages after %lu msec"
             , __FUNCTION__, 
             , (now - receivePeriod)
             );
    Log.trace("if ((%lu - %lu) [=%lu] >  %lu) -> %s"
             , now
             , receivePeriod
             , (now - receivePeriod)
             , receiveTimeout
             , ((now - receivePeriod) > receiveTimeout) ? "true" : "false"
             );
    timeouts++;
  }

(I’d always write long code lines broken up this way - not only for posting in the forum. This also helps maintainability when you need to come back to it some time later)

If the above doesn’t provide some indication where the issue may come from, you could also introduce another variable for the difference (e.g. uint32_t diff = (now - receivePeriod);) and check/log that instead of recalculating the difference four times.

Possible reasons for such behaviour may also stem from non-atomic updates of the millis() counter, memory corruption due to buffer overflow or such, inadvertent assignments where you only intend to do equality checks, variable shadowing, …

3 Likes

@gusgonnet, where do you reset receivePeriod to millis() once the timeout occurs?

Hey Paul,
on the enter function to this state for the FSM I have created, like so:

void receiveEnterFunction()
{
  //reset the receive period
  receivePeriod = millis();
}

void receiveUpdateFunction()
{
  if ((millis() - receivePeriod) > receiveTimeout)
  {
    Log.error("%s() - timeout waiting for messages after %lu msec %lu %lu %lu %d", __FUNCTION__, millis() - receivePeriod, millis(), receivePeriod, receiveTimeout, ((millis() - receivePeriod) > receiveTimeout));
    timeouts++;
  }
}

the Enter function gets executed from the loop() when the state enters, and the Update function gets called all the time from loop() while the FSM is in this state.

thanks!

EDIT: this is the library I’m using

1 Like

thank you for this tip and the others.
I will test your ideas out and come back.

Perhaps what raises a flag is this:

I will look into other parts of the code as well, it could very well be that I’m doing something wrong somewhere else.

Thank you,
Gustavo.

1 Like

that was genius!
I changed it to info level and this is what I got in normal times:

0000597085 [app] ERROR: receiveUpdateFunction() - timeout waiting for messages after 120001 msec
0000597085 [app] INFO: if ((597085 - 477084) [=120001] >  120000) -> true

However, in voodoo times, I got this:

0000470703 [app] ERROR: receiveUpdateFunction() - timeout waiting for messages after 4294967295 msec
0000470704 [app] INFO: if ((470701 - 470702) [=4294967295] >  120000) -> true

or this:

0000969645 [app] ERROR: receiveUpdateFunction() - timeout waiting for messages after 4294967295 msec
0000969645 [app] INFO: if ((969642 - 969643) [=4294967295] >  120000) -> true

I went back and double-checked all the code again, and I found the culprit:

receivePeriod = millis();

that line was present in an asynchronous callback function that would execute when a message is received. I added it there so the timeout will reset every time the device receives a message, but doing it async seems NOT a good idea.
Removing that line fixed this issue.

That’s when the question from Paul hit me:

In this case, it seems I was resetting it in the wrong place (an async callback function), and every now and then it was messing up the logic in the loop() flow.

I still have some dots to connect, but for now I’ll be ok.
Thank you both for the tips and posts!

4 Likes

Using C++ std::atomic is a good way to prevent accessing a variable in the middle of its modification. There’s a brief example here:

And since it’s a standard C++ library feature, there’s a lot written about it online. It’s a little confusing at first, but very powerful.

2 Likes

oh, this is great. That could explain the other instances of my issue where I hadn’t found an explanation yet.
Your post makes it easy to get an idea about it (and I see an interesting link from PK also).
I will study this further.

Thank you!
Gustavo.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.