[Photon] v0.4.7 hanging in UDP::stop?

( Edit: moved from Troubleshooting to Firmware as seems more appropriate )

Hey guys - back with another hang and was wondering if anyone could run a friendly eye over it to see if they can help me in debugging this further.

We upgraded to v0.4.7 a while back to bring in a few of the fixes that were addressed. We then took a few units into the field for testing and in one situation we had bad WiFi conditions. In this situation, we saw our application hang up a lot.

Bringing it back into the office (our test location has ok WiFi), we started to see the same thing.

I started to force disconnects and reconnects on the desk while on the debugger with openOCD and managed to cause this again and capture a stack trace that indicated it was in the UDP::stop function.

(Note: Before I go further, the reason we are calling this function; we have a UDP socket multicasting sensor data once a second. A few releases ago we had some problems with it dropping out so we had our code re-build the socket every 30 seconds. This is probably too quickly and a bit of a hack but besides the point for now)

Program received signal SIGINT, Interrupt.
0x0803d5d6 in remove (this=<optimized out>, item=0x200120c0) at src/photon/socket_hal.cpp:522
522	                if (current->next==item) {
(gdb) info threads
  Id   Target Id         Frame
  9    Thread 536927480 (worker thread) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  8    Thread 536928304 (worker thread) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  7    Thread 536941432 (WWD) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  6    Thread 536926192 (Tmr Svc) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  5    Thread 536935856 (std::thread) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  4    Thread 536939792 (tcpip_thread) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  3    Thread 536918768 (system monitor) 0x080258c4 in vPortYield () at WICED/RTOS/FreeRTOS/ver7.5.2/Source/portable/GCC/ARM_CM3/port.c:332
  2    Thread 536925696 (IDLE) 0x08024c54 in prvIdleTask (pvParameters=<optimized out>) at WICED/RTOS/FreeRTOS/ver7.5.2/Source/tasks.c:2261
* 1    Thread 536919376 (app_thread :  : Running) 0x0803d5d6 in remove (this=<optimized out>, item=0x200120c0) at src/photon/socket_hal.cpp:522
(gdb) where
#0  0x0803d5d6 in remove (this=<optimized out>, item=0x200120c0) at src/photon/socket_hal.cpp:522
#1  socket_dispose (handle=536944832) at src/photon/socket_hal.cpp:673
#2  0x0803d7be in socket_close (sock=<optimized out>) at src/photon/socket_hal.cpp:935
#3  0x0803b718 in UDP::stop (this=0x20014140) at src/spark_wiring_udp.cpp:105
#4  0x08032068 in Coap::loop (this=0x20014140) at ../../../src/Coap.cpp:198
#5  0x080354dc in NetworkController::process (this=<optimized out>) at ../../../src/NetworkController.cpp:19
#6  0x08035b52 in PublishController::process (this=<optimized out>) at ../../../src/PublishController.cpp:32
#7  0x0803a66c in loop () at ../../../src/application.cpp:165
#8  0x0803fca0 in app_loop (threaded=<optimized out>) at src/main.cpp:231
#9  0x0803f954 in ActiveObjectBase::run (this=this@entry=0x20009490 <ApplicationThread>) at src/active_object.cpp:49
#10 0x0803fd36 in start (this=0x20009490 <ApplicationThread>) at ./inc/active_object.h:389
#11 app_setup_and_loop () at src/main.cpp:297
#12 0x0803bffc in application_start () at src/stm32f2xx/core_hal_stm32f2xx.c:502
#13 0x0802c832 in application_thread_main (arg=<optimized out>) at WICED/RTOS/FreeRTOS/WICED/wiced_rtos.c:177
#14 0x00000000 in ?? ()

Looking at this, it seems to be stuck inside a while() loop in that socket_hal.cpp:

/**
 * Removes an item from the linked list.
 * @param item
 * @param list
 */
bool remove(socket_t* item)
{
    bool removed = false;
        if (items==item) {
            items = item->next;
        removed = true;
    }
    else
    {
            socket_t* current = items;
        while (current) {
            if (current->next==item) {
                current->next = item->next;
                removed = true;
                break;
            }
        }
    }
    return removed;
}

GDB is only reporting a few of the objects in play though…

(gdb) p current
$1 = <optimized out>
(gdb) p item
$3 = (socket_t *) 0x200120c0
(gdb) p item->next
$4 = (socket_t *) 0x1
(gdb) p items
value has been optimized out

Anyone help me any further?

I’m going to remove the socket rebuild and see if it helps but would be good to debug this further.

May or may not be related to the above…

The remove() function in socket_hal.cpp looks weird to me…

bool remove(socket_t* item)
{
   bool removed = false;
   if (items==item) {
       items = item->next;
       removed = true;
   }
   else
   {
       socket_t* current = items;
       while (current) {
           if (current->next==item) {
               current->next = item->next;
               removed = true;
               break;
           }
       }
   }
   return removed;
}

This will never progress past the second item on the list…

If this:

if (current->next==item)

fails then it will get stuck in that while loop.

Should it not be:

bool remove(socket_t* item)
{
   bool removed = false;
   if (items==item) {
       items = item->next;
       removed = true;
   }
   else
   {
       socket_t* current = items;
       while (current) {
           if (current->next==item) {
               current->next = item->next;
               removed = true;
               break;
           }
           current = current->next;
       }
   }
   return removed;
}

Also - noticed this

/**
 * Maintains a singly linked list of sockets. Access to the list is not
 * made thread-safe - callers should use SocketListLock to correctly serialize
 * access to the list.
 * /

Wondering if this could maybe also be a race condition in the new threading stuff?

Do you use SEMI AUTOMATIC or MANUAL modes? I’ve been seeing my devices hang in poor wifi conditions as well. I can’t find the cause or a way to prevent it, but I can recreate the condition very consistently.
My issue is documented in the github here.

I’ve been seeing this issue since 0.4.6 and I don’t believe it’s related to threading.

I’ve been trying to find a fix to varying degrees of success. A main loop watchdog timer catches the issue most of the time. ( look in the link above to see how I implement that )
I still have cases where the device won’t connect to the spark cloud after a soft reset, but funny enough in these cases it runs super stable in my “WiFi router from Hell” setup. Or occasionally it hangs at the system level and the WDT can’t recover.

1 Like

Interesting about the WDT, I had looked into doing my own WD using SparkIntervalTimer interrupts to handle any issues so will have a look at yours.

I am indeed using Manual Mode and the threading.

Interesting your bug seems to be in UDP code too - can you run GDB and recreate my stack trace? I can run your code locally if you can’t to see if I can recreate…

Sorry, mhazley. My gdb-foo is weak, so can’t help you there at the moment.

I don’t know if it’s UDP specific. My guess is that the interface with the WiFi module can get in a funny state if the WiFi drops and ends up blocking a call somewhere that attempts to use that interface. Considering running the test code in automatic seems to hang in there just fine, a fix should be possible. Unfortunately I can’t find the offending code.

I would prefer to use hardware timer interrupts for the WDT, but I’m not sure if they’re available to be exposed. I’ve found that the soft timers sometimes don’t run, which is undesirable for a WDT. Looking more into that as well.

Just interesting that you are using UDP and seeing a hang, just like me. Would have been good to see where you are hanging to compare…

Try @peekay123’s amazing SparkIntervalTimer library for the hardware timers - so easy to use, I use it for background timers.

1 Like

Awesome, thanks for that library.

It might be worth a shot to try to reproduce the issue with a TCP pipe instead.
Still hoping @mdma weighs in on this one.

Yeah, we added a PR here to see what happened

Any more findings on this one?
Do the changes in your PR seem to fix the hangs?

Yeah @jakeypoo@mdma noted he thought he’d fixed it already but he’s included it in next release I think. It fixed our hangs so worth testing the fix locally in your stuff I’d say!

Hi @jakeypoo,
It seems like I have exactly the same issues. But I am running on 0.6.1 already.
Since this discussion is rather old I am interested if the was somehow any new ides to solve this.

Ping @mdma

Can someone please post an example app demonstrating the problem and the steps to reproduce it?