BLE.scan() crashing/blocking

I am working on a Neopixel project that needs to react to music. I have one primary Argon that has a mic, doing some computation on that, and using BLE to advertise out the data. It updates its advertisement every 25ms or so, the advertising rate is the shortest possible setting. This seems to be working fine.

However, the receiving Argons are flaky. I have the loop rendering every 33ms (for 30FPS), which first scans for nearby BLE devices for 20ms using the updateData() method below, and then renders the audio level effect on the strip. Randomly, which can happen as soon as 30 seconds or after 10 minutes, the BLE.scan() call seems to cause a crash. The loop effectively stops, I cannot flash the firmware, it still has the light blue pulsating indicator. It is strange.

There is nothing magical about the code below. I have added logging to show me where it stops, and every time the last log entry is “Begin Scan”.

void updateData()
{

    Log.info("Begin Scan");
    int count = BLE.scan(scanResults, SCAN_RESULT_MAX);
    Log.info("Finished Scan");
    for (int i = 0; i < count; i++)
    {
      len = scanResults[i].advertisingData.customData(blebuf, BLE_MAX_ADV_DATA_LEN);
      Log.info("Got Results: %d", len);
      if (len == 6)
      {
        
        // This is the dummy company id and data type we set from the advertiser
        if (blebuf[0] == 0xff && blebuf[1] == 0xff && blebuf[2] == 0x55)
        {
          Log.info("Matched Signature");
          uint16_t level;
          memcpy(&level, &blebuf[3], 2);

          uint8_t hue;
          memcpy(&hue, &blebuf[5], 1);

          Log.info("Computed Integers");

          cmd.audioLevel = level;
          cmd.globalhue = hue;

          Log.info("Set Props");

          hitCount++;
          Log.info("Hit Count++");
          break;
        }
      }
    }
  
}

Any thoughts? While the loop is working, it is quite good.

I would prefer that the secondary Argons driving the LED strips would just make a BLE connection to the primary, I have about 5 devices in the room and I hit that 3 connection limit quick unfortunately. Also open to other thoughts on how to broadcast data in realtime. I tried UDP I think I am running into that Argon UDP lag issue because I cannot get a consistent low enough latency where it is visually out of sync every few seconds.

Well, I’ve worked with BLE for several years, and I think you’d be very lucky to get an update rate of 30 FPS.

Have you tried changing the BleScanParams param active to 0 (false)? When active scanning is on, the scanning device requests more information from the advertiser, and this exchange will tie up the advertiser for a few msec or more. If multiple scanning devices are making this request at once, this could be flaky.

I also experienced “lockups” when calling BLE.scan doing some other work. The deviceOS does not report error codes from a failed call to the nRF SDK for scanning. Look for the code below in ble_hal.cpp in the device OS toolchain. Log the value of ret and if nonzero, look it up in the nRF SDK docs. You will have to rebuild the app and deviceOS as an integrated debug build to have this work. Hope this helps and good luck!

int BleObject::Observer::startScanning(hal_ble_on_scan_result_cb_t callback, void* context) {
    CHECK_FALSE(isScanning_, SYSTEM_ERROR_INVALID_STATE);
    SCOPE_GUARD ({
        clearCachedDevice();
        clearPendingResult();
    });
    ble_gap_scan_params_t bleGapScanParams = toPlatformScanParams();
    LOG_DEBUG(TRACE, "| interval(ms)   window(ms)   timeout(ms) |");
    LOG_DEBUG(TRACE, "  %d*0.625        %d*0.625      %d",
            bleGapScanParams.interval, bleGapScanParams.window, bleGapScanParams.timeout*10);
    scanResultCallback_ = callback;
    context_ = context;
    int ret = sd_ble_gap_scan_start(&bleGapScanParams, &bleScanData_);
    CHECK_NRF_RETURN(ret, nrf_system_error(ret));
    isScanning_ = true;
    // If timeout is set to 0, it should scan indefinitely
    if (bleGapScanParams.timeout != BLE_GAP_SCAN_TIMEOUT_UNLIMITED) {
        if (os_timer_change(scanGuardTimer_, OS_TIMER_CHANGE_START, HAL_IsISR() ? true : false, 0, 0, nullptr)) {
            LOG(ERROR, "Failed to start the timer for guard of scanning timeout.");
            // We don't return here, as scanning may still timeout by Softdevice as expected.
        }
    }
    if (os_semaphore_take(scanSemaphore_, CONCURRENT_WAIT_FOREVER, false)) {
        SPARK_ASSERT(false);
        return SYSTEM_ERROR_TIMEOUT;
    }
    return SYSTEM_ERROR_NONE;
}

Thank you, I did try now just setting active to 0, no luck on preventing the crash. I will try your suggest to look at the error code, but, how do I get that far? I have never dove that deep, basically how do I get to the ble_hal.cpp? I do have the integrated debugging kit.

Also, I do not get an update rate of 30 FPS, I get about ~24. Which is actually quite fine for my needs. It goes as low as 16 and as high as 28 or so per second.

Lastly, is it possible to set a scan filter? I saw this article: How to use the BLE Scanning filters on the Connection Request – jimmywongiot

I am not sure BLE is exposed enough to set all of that. Do you know?

Thank you!

Update, I am thinking this is a deadlock situation.

I tried seeing if I could start a thread that just did this work, and if it failed, kill it, and start it up again. No matter what I have tried, if it locks, anything BLE is stuck.

In fact, what doesn’t make sense is if I start a thread that simply Scans(), kill it BEFORE it locks up, then just start a new thread doing the same thing, the Scan will never complete once and blocks. I am canceling, disposing and deleting the new Thread that I started.

This seems all so strange.

FreeRTOS does not allow threads to be discarded. A thread started once has to run forever.

Huh, odd, it seems to sorta work. What is the purpose of the Cancel/Dispose public methods?

Back to my original issue, any advice ScruffR? It feels very much like a deadlock is happening in the BLE HAL. I am just continually scanning in a quick loop for 20ms. I have tried all sorts of permutations of turning off BLE, calling End() before/after my scan call, etc. No luck, eventually it just dies.

Not sure why these public functions exist when the common narrative used to be “threads need to be running eternally” - @rickkas7 is this not valid anymore?

When you want to modify the current implementation of the device OS you can use Particle Workbench (VS Code), modify the device OS code and rebuild the binaries via the dedicated task.

Specifically some instructions in this doc:

Talks about how to use the methods toward the end.

1 Like

Update (sorta),

I just found are reference to a deadlock that is fixed in the 3.0 firmware. Fixes the issue that BLE.scan() might hang the device. by XuGuohui · Pull Request #2220 · particle-iot/device-os · GitHub

That certainly feels like what is going on, so I am trying the RC release!

I can confirm, the new 3.0 DeviceOS solves the issue.