Reliability analysis from Electron -> Particle Cloud -> Azure IoT Hub

Hi Particle Cloud Community

First of all - I really like the products! I have spend the last days to investigate how reliable the data transmission from Electron -> Particle Cloud -> Azure IoT Hub is.

The particle electron publish the RSSI and a sample counter each 10s, and increments the sample count inside the loop function. However, it looks like some events are missing - and I can’t explain what actually goes wrong.

Problem:
In the middle of the night, 8 samples are missing. HOWEVER, the timestamp do not match to the sample counter.
8 Missing samples should mean that the difference between events is 80s. The log file however tells me that the delta was 385s between SampleCount 4046 and 4054. Any idea what’s going wrong?

“published_at”:“2018-02-08T00:04:30.705Z”,“sampleCount”:“4045”,“rssi”:"-87"
"published_at":“2018-02-08T00:04:40.946Z”,“sampleCount”:“4046”,“rssi”:"-87"
"published_at":“2018-02-08T00:11:05.016Z”,“sampleCount”:“4054”,“rssi”:"-89"
“published_at”:“2018-02-08T00:11:15.240Z”,“sampleCount”:“4055”,“rssi”:"-89"
“published_at”:“2018-02-08T00:11:25.682Z”,“sampleCount”:“4056”,“rssi”:"-89"

  1. Electron App
    Firmware: 0.6.4
    Application: Publish RSSI and a sample counter each 10s
    Code:
    #include "cellular_hal.h"
    uint32_t sampleCount = 0;
    
    void setup() {
    }
    
    void loop() {
        CellularSignal sig = Cellular.RSSI();
        String data = "{\"rssi\":\"" + String(sig.rssi) + "\",\"sampleCount\":\"" + String(sampleCount) + "\"}";
        // Trigger the integration (limited to 255 bytes)
        Particle.publish("v1-dev-stability", data, PRIVATE);
        sampleCount++;
        // Wait 10 seconds
        delay(10000);
    }
  1. Particle IoT Hub Integration
    I can’t see any errors in the particle console, so I assume all events have been forwarded from the Particle Cloud to the IoT Hub.

Looking forward how to troubleshoot such problems…

1 Like

One factor here to take into account is that RSSI acquisition for cellular networks might take a considerable amount of time (IIRC up to one minute) and the call is blocking.

Also the use of String objects can introduce troubles in any long running programm due to risk of heap fragmentation.

As a rule of thumb, functions that do return a fail/success response, checking the outcome is good practice to adopt.

Another point - maybe not applicable here as I assume SYSTEM_MODE(AUTOMATIC) and no SYSTEM_THREAD(ENABLED) - is to check for Particle.connected() before executing Particle.publish()

Thanks for the hint regarding blocking call of Celluar.RSSI().
Since I don’t have the debugger setup, I tried to follow the sourcecode in the GIT repo:

  1. firmware/wiring/src/spark_wiring_cellular.cpp
    CellularSignal CellularClass::RSSI() calls cellular_signal
  2. firmware/hal/src/electron/cellular_hal.cpp
    cellular_signal calls electronMDM.getSignalStrength
  3. firmware/hal/src/electron/modem/mdm_hal.cpp
    This is the place which actually makes the AT-Command to the U-Blox module: "AT+CSQ"
    bool MDMParser::getSignalStrength(NetStatus &status)

I didn’t find any information about the max duration of the AT+CSQ command in the SARA-U270 data sheet:

The AT-Commands reference manual of U-Blox is quite extensive, however I couldn’t find a max duration there neither (chapter 7.2 Signal quality +CSQ):

There doesn’t seem to be an async function for the Electron to query the signal strength, so what’s the proposed way to periodically log the RSSI of the device to the cloud?

I have now extended the logging of the device and will start another test tonight.
Let’s hope the problem can be tracked down with that test.

FYI: The current serial console outputs tell me it always takes about 20ms to query the RSSI.

#include "cellular_hal.h"

uint32_t sampleCount = 0;

// setup() runs once, when the device is first turned on.
void setup() {
    Serial.begin(115200);
    Serial.println("==== Setup for v2-dev-stability ====");
}

// loop() runs over and over again, as quickly as it can execute.
void loop() {
    // Cellular.ready() will return true once the device is connected to the cellular network and has been assigned an IP address, which means it has an activated PDP context and is ready to open TCP/UDP sockets. Otherwise it will return false.
    bool cellularReady = Cellular.ready();
    bool eventPublished = false;
    unsigned long rssiDurationMs = 0;
    CellularSignal sig;
    if (cellularReady) {
      // attention: RSSI could potentially block for a long duration (?)
      unsigned long startTimeMs = millis();
      sig = Cellular.RSSI();
      rssiDurationMs = millis() - startTimeMs;
      String data = "{\"rssi\":\"" + String(sig.rssi) + "\",\"sampleCount\":\"" + String(sampleCount) + "\",\"rssiDurationMs\":\"" + String(rssiDurationMs) + "\"}";
      // Trigger the integration (limited to 255 bytes)
      eventPublished = Particle.publish("v2-dev-stability", data, PRIVATE);
    }
    sampleCount++;
    String state = "rssi=" + String(sig.rssi) + ", sampleCount=" + String(sampleCount) + ", rssiDurationMs=" + String(rssiDurationMs) + ", cellularReady=" + String(cellularReady) + ", eventPublished=" + String(eventPublished);
    Serial.println(state);
    // Wait 10 seconds
    delay(10000);
}

One thing I definetly advise for is: Drop String and adopt character arrays in combination with snprintf() to rule out the heap fragmentation issue I mentioned above.

2 Likes

If the heap fragmentation due to some simple strings causes issues with an application freeze of several minutes, then there is something deeply wrong with the system setup.

Anyway, I have it ported now to the “good old c-style”, to eliminate this factor as well.
Any chance you can check the logs on the particle cloud, since Azure IoT Hub integration is still marked as “beta”?

#include "cellular_hal.h"

uint32_t sampleCount = 0;

// setup() runs once, when the device is first turned on.
void setup() {
    Serial.begin();
    Serial.println("==== Setup for v2-dev-stability ====");
}

// loop() runs over and over again, as quickly as it can execute.
void loop() {
    // Cellular.ready() will return true once the device is connected to the cellular network and has been assigned an IP address, which means it has an activated PDP context and is ready to open TCP/UDP sockets. Otherwise it will return false.
    bool cellularReady = Cellular.ready();
    bool eventPublished = false;
    unsigned long rssiDurationMs = 0;
    CellularSignal sig;
    if (cellularReady) {
      // attention: RSSI could potentially block for a long duration (?)
      unsigned long startTimeMs = millis();
      sig = Cellular.RSSI();
      rssiDurationMs = millis() - startTimeMs;
      char data[255];
      memset(data, sizeof(data), 0);

      snprintf(data, sizeof(data), "{\"rssi\":%d,\"sampleCount\":%u,\"rssiDurationMs\":%u}", sig.rssi, sampleCount, rssiDurationMs);
      // Trigger the integration (limited to 255 bytes)
      eventPublished = Particle.publish("v2-dev-stability", data, PRIVATE);
    }
    sampleCount++;

    char state[255];
    memset(state, sizeof(state), 0);
    snprintf(state, sizeof(state), "rssi=%d, sampleCount=%u, rssiDurationMs=%u, cellularReady=%d, eventPublished=%d", sig.rssi, sampleCount, rssiDurationMs, cellularReady, eventPublished);
    Serial.println(state);
    // Wait 10 seconds
    delay(10000);
}

Test is running - let’s generate some data!

Nope. You are dealing with a system of limited resources and neglecting that fact is the systematic problem.

Garbage collection and heap defragmentation is a demanding task which can easily be accomplished with resources to spare, but with an expressly finite amout you'd just pull them off the place where they could be put to better use - just by investing the extra time to prevent the garbage/fragmentation in the first place.

BTW, "simplicity" of String is not the point here. Plastic bags are a simple thing, but used by the tonne and dumped carelessly are still causing problems.

1 Like

Just wanted to chime in to mention that the "Elite" are not employees, and are 'just' community members like yourself. As such, we don't have access to those logs (or any back-end systems for that matter).

2 Likes

Could it be that particle.publish blocks very very long in case there is no internet connection?
I'm using the latest official 0.6.4, and the issue seems to be fixed in 0.7.0-rc.1:
https://github.com/particle-iot/firmware/issues/1281

@avtolstoy
I assume the docs don't match to the code neither? 30s, 60s?
https://docs.particle.io/reference/firmware/photon/#particle-publish-

ttl (time to live, 0–16777215 seconds, default 60) !! NOTE: The user-specified ttl value is not yet implemented, so changing this property will not currently have any impact.

The observed input looks similar like the issue I am facing:

Observed Output
126(s): publish passed: 63
128(s): publish passed: 64
130(s): publish passed: 65
152(s): publish failed: 66 // internet disabled here
316(s): publish failed: 67 // internet restored here
317(s): publish passed: 68
318(s): publish passed: 69
320(s): publish passed: 70

Actually Particle.publish()doesn’t block at all as it only enqueues the message but doesn’t wait till it’s sent - unless explicitly instructed with the extra parameter which I don’t see in your code.

But since you are running SYSTEM_MODE(AUTOMATIC) (default setting when not explicitly changed) without SYSTEM_THREAD(ENABLED) the whole program would block once a connection loss is detected.

But given your earlier log, which showed that the counter kept being incremented, it can’t have been a full block.

BTW, TTL does really have no effect, but if it had it wouldn’t play any role in connection with the timeout for the command.

2 Likes

@ScruffR thanks for the hint - wasn’t aware that the entire program stops running if there is no connection.

Test results from 9th of February:
This night it was the other way around: I received more messages in the cloud than I have sent on the device.
My first guess was that my code on the Azure cloud went crazy. However I did run the test with two Electron devices in parallel. The error happened to both, but not at the same time…

public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages) {
    foreach (var eventData in messages) {
        var data = Encoding.UTF8.GetString(eventData.Body.Array, eventData.Body.Offset, eventData.Body.Count);
        Console.WriteLine($"Message received. Partition: '{context.PartitionId}', Data: '{data}'");

        v2_dev_stability msg = JsonConvert.DeserializeObject<v2_dev_stability>(data);
        StreamWriter w = System.IO.File.AppendText("azureEvents_" + msg.DeviceId + ".log");
        w.WriteLine(String.Format("[{0}] rssi={1}, sampleCount={2}, rssiDurationMs={3}", msg.PublishedTimestamp.ToLocalTime().ToString(), msg.CellularRSSI, msg.SampleCount, msg.CellularRSSIDurationMs));
        w.Close();
    }
    return context.CheckpointAsync();
}

The Azure Metrics do not show any error:


However two spikes around 3:45AM and 4:55AM are visible:

Particle Cloud Logs (from Console)
There are 14 errors in the history but I can’t see what the error actually was. @Particle: please help :worried:

A) Electron-0001
Errors happened at 3:37 AM
A1) Device Event Logs (from Serial console)
Timestamp is the local PC time from Teraterm

[03:37:21] rssi=-91, sampleCount=4314, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:37:31] rssi=-91, sampleCount=4315, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:37:42] rssi=-91, sampleCount=4316, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:37:52] rssi=-91, sampleCount=4317, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:38:02] rssi=-91, sampleCount=4318, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:38:12] rssi=-91, sampleCount=4319, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:38:23] rssi=-91, sampleCount=4320, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:38:33] rssi=-91, sampleCount=4321, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:38:43] rssi=-91, sampleCount=4322, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:38:53] rssi=-91, sampleCount=4323, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:39:04] rssi=-91, sampleCount=4324, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:39:14] rssi=-91, sampleCount=4325, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:39:24] rssi=-91, sampleCount=4326, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:39:34] rssi=-91, sampleCount=4327, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:39:44] rssi=-91, sampleCount=4328, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:39:55] rssi=-91, sampleCount=4329, rssiDurationMs=20, cellularReady=1, eventPublished=1
[03:40:05] rssi=-91, sampleCount=4330, rssiDurationMs=20, cellularReady=1, eventPublished=1

A2) Azure Events Logs (Received messages from IoT Hub)
Some messages are received multiple times in the Azure IoT Hub. Timestamp is “published_at” from Particle Cloud IoT Hub integration.

[03:37:32] rssi=-91, sampleCount=4314, rssiDurationMs=20
[03:37:42] rssi=-91, sampleCount=4315, rssiDurationMs=20
[03:37:52] rssi=-91, sampleCount=4316, rssiDurationMs=20
[03:38:02] rssi=-91, sampleCount=4317, rssiDurationMs=20
[03:37:42] rssi=-91, sampleCount=4315, rssiDurationMs=20
[03:38:13] rssi=-91, sampleCount=4318, rssiDurationMs=20
[03:37:52] rssi=-91, sampleCount=4316, rssiDurationMs=20
[03:38:23] rssi=-91, sampleCount=4319, rssiDurationMs=20
[03:38:33] rssi=-91, sampleCount=4320, rssiDurationMs=20
[03:38:43] rssi=-91, sampleCount=4321, rssiDurationMs=20
[03:38:54] rssi=-91, sampleCount=4322, rssiDurationMs=20
[03:39:04] rssi=-91, sampleCount=4323, rssiDurationMs=20
[03:39:14] rssi=-91, sampleCount=4324, rssiDurationMs=20
[03:38:43] rssi=-91, sampleCount=4321, rssiDurationMs=20
[03:38:54] rssi=-91, sampleCount=4322, rssiDurationMs=20
[03:39:24] rssi=-91, sampleCount=4325, rssiDurationMs=20
[03:39:04] rssi=-91, sampleCount=4323, rssiDurationMs=20
[03:39:35] rssi=-91, sampleCount=4326, rssiDurationMs=20
[03:38:43] rssi=-91, sampleCount=4321, rssiDurationMs=20
[03:39:14] rssi=-91, sampleCount=4324, rssiDurationMs=20
[03:39:45] rssi=-91, sampleCount=4327, rssiDurationMs=20
[03:39:55] rssi=-91, sampleCount=4328, rssiDurationMs=20
[03:40:05] rssi=-91, sampleCount=4329, rssiDurationMs=20
[03:40:16] rssi=-91, sampleCount=4330, rssiDurationMs=20

B) Electron-0004
Errors happened at 4:57 AM
B1) Device Event Logs (from Serial console)
Timestamp is the local PC time from Teraterm
[04:56:28] rssi=-83, sampleCount=4766, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:56:39] rssi=-83, sampleCount=4767, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:56:49] rssi=-83, sampleCount=4768, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:56:59] rssi=-83, sampleCount=4769, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:57:09] rssi=-83, sampleCount=4770, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:57:20] rssi=-83, sampleCount=4771, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:57:30] rssi=-83, sampleCount=4772, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:57:40] rssi=-83, sampleCount=4773, rssiDurationMs=20, cellularReady=1, eventPublished=1
[04:57:50] rssi=-83, sampleCount=4774, rssiDurationMs=20, cellularReady=1, eventPublished=1

B2) Azure Events Logs (Received messages from IoT Hub)
[04:56:39] rssi=-83, sampleCount=4766, rssiDurationMs=20
[04:56:49] rssi=-83, sampleCount=4767, rssiDurationMs=20
[04:57:00] rssi=-83, sampleCount=4768, rssiDurationMs=20
[04:57:10] rssi=-83, sampleCount=4769, rssiDurationMs=20
[04:56:49] rssi=-83, sampleCount=4767, rssiDurationMs=20
[04:57:20] rssi=-83, sampleCount=4770, rssiDurationMs=20
[04:57:30] rssi=-83, sampleCount=4771, rssiDurationMs=20
[04:57:41] rssi=-83, sampleCount=4772, rssiDurationMs=20
[04:58:01] rssi=-83, sampleCount=4774, rssiDurationMs=20
[04:57:51] rssi=-83, sampleCount=4773, rssiDurationMs=20
[04:58:11] rssi=-83, sampleCount=4775, rssiDurationMs=20
[04:58:21] rssi=-83, sampleCount=4776, rssiDurationMs=20
[04:57:51] rssi=-83, sampleCount=4773, rssiDurationMs=20
[04:58:01] rssi=-83, sampleCount=4774, rssiDurationMs=20
[04:58:32] rssi=-83, sampleCount=4777, rssiDurationMs=20
[04:58:11] rssi=-83, sampleCount=4775, rssiDurationMs=20
[04:58:42] rssi=-83, sampleCount=4778, rssiDurationMs=20
[04:58:52] rssi=-83, sampleCount=4779, rssiDurationMs=20
[04:59:02] rssi=-83, sampleCount=4780, rssiDurationMs=20

Another 66 errors, and I have no clue why because the particle console does not provide the error logs… :disappointed_relieved:

Quick update from my side:
Particle support told me that the Particle WebHooks seems to have a 5s timeout.
However, as far as I know, Azure does not guarantee that they can fulfill the 5s latency.

**Proposal:
Can Particle increase the Azure IoT Hub Webhook timeout to 10s?

Microsoft has advised that we cannot count on a specific latency as it is not included in their SLA. The point here is that if you are designing an IoT solution, make sure that you do not assume any set latency as this time could fluctuate between hundreds of milliseconds and 10 sec.

This encourages my assumption that the Particle WebHook retries to send the messages, before it did actually get a response from Azure, which results in double dispatching the same message.

B2) Azure Events Logs (Received messages from IoT Hub)
[04:56:39] rssi=-83, sampleCount=4766, rssiDurationMs=20
> [04:56:49] rssi=-83, sampleCount=4767, rssiDurationMs=20
[04:57:00] rssi=-83, sampleCount=4768, rssiDurationMs=20
[04:57:10] rssi=-83, sampleCount=4769, rssiDurationMs=20
> [04:56:49] rssi=-83, sampleCount=4767, rssiDurationMs=20
[04:57:20] rssi=-83, sampleCount=4770, rssiDurationMs=20

3 Likes

The logs now proof that double dispatching is caused due to webhook timeouts.
I streamed all product events to my desktop, so I could analyze the failures with curl.
_curl https://api.particle.io/v1/devices/events?access_token=xyz_

Having 123 webhook failures clearly shows that it needs to be investigated:

Test setup:
The devices are publishing each 10s the v2-dev-stability event with incrementing sample counter (see code above). A console application is consuming the events from the Azure IoT Hub.
Interesting part is that while one device was affected by double dispatching, another device was totally fine during that timeframe.

  • two Electron in Europe (3e0054000d51353432393339 and 3a004a000e51353432393339)
  • one Electron in Singapore (31005e000551353431383736)

A) Particle Console Event Log:

event: v2-dev-stability
data: {"data":"{"rssi":-83,"sampleCount":12215,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:18:40.315Z","coreid":"3a004a000e51353432393339"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:18:40.350Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-75,"sampleCount":12193,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:18:43.171Z","coreid":"3e0054000d51353432393339"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:18:43.239Z","coreid":"particle-internal"}

event: hook-response/v2-dev-stability/0
data: {"data":"success","ttl":60,"published_at":"2018-02-14T17:18:43.903Z","coreid":"particle-internal"}

event: hook-error/v2-dev-stability/0
data: {"data":"ESOCKETTIMEDOUT","ttl":60,"published_at":"2018-02-14T17:18:45.693Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-69,"sampleCount":4219,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:18:48.734Z","coreid":"31005e000551353431383736"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:18:48.769Z","coreid":"particle-internal"}

event: hook-response/v2-dev-stability/0
data: {"data":"success","ttl":60,"published_at":"2018-02-14T17:18:49.144Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-83,"sampleCount":12216,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:18:50.550Z","coreid":"3a004a000e51353432393339"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:18:50.579Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-75,"sampleCount":12194,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:18:53.438Z","coreid":"3e0054000d51353432393339"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:18:53.460Z","coreid":"particle-internal"}

event: hook-response/v2-dev-stability/0
data: {"data":"success","ttl":60,"published_at":"2018-02-14T17:18:54.187Z","coreid":"particle-internal"}

event: hook-error/v2-dev-stability/0
data: {"data":"ESOCKETTIMEDOUT","ttl":60,"published_at":"2018-02-14T17:18:55.831Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-69,"sampleCount":4220,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:18:58.944Z","coreid":"31005e000551353431383736"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:18:58.983Z","coreid":"particle-internal"}

event: hook-response/v2-dev-stability/0
data: {"data":"success","ttl":60,"published_at":"2018-02-14T17:18:59.427Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-83,"sampleCount":12217,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:19:00.791Z","coreid":"3a004a000e51353432393339"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:19:00.817Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-75,"sampleCount":12195,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:19:03.674Z","coreid":"3e0054000d51353432393339"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:19:03.786Z","coreid":"particle-internal"}

event: hook-response/v2-dev-stability/0
data: {"data":"success","ttl":60,"published_at":"2018-02-14T17:19:04.140Z","coreid":"particle-internal"}

event: hook-error/v2-dev-stability/0
data: {"data":"ESOCKETTIMEDOUT","ttl":60,"published_at":"2018-02-14T17:19:06.159Z","coreid":"particle-internal"}

event: v2-dev-stability
data: {"data":"{"rssi":-69,"sampleCount":4221,"rssiDurationMs":20}","ttl":60,"published_at":"2018-02-14T17:19:09.215Z","coreid":"31005e000551353431383736"}

event: hook-sent/v2-dev-stability
data: {"data":"","ttl":60,"published_at":"2018-02-14T17:19:09.264Z","coreid":"particle-internal"}

event: hook-response/v2-dev-stability/0
data: {"data":"success","ttl":60,"published_at":"2018-02-14T17:19:09.828Z","coreid":"particle-internal"}

B) Azure Event Log for Device ID 3a004a000e51353432393339
Let's have a look at sampleCount 12215, 12216, 12217, 12218, 12219 of device ID 3a004a000e51353432393339 which are obviously double dispatched due to the ESOCKETTIMEDOUT above:

[14.02.2018 18:17:49] rssi=-81, sampleCount=12210, rssiDurationMs=20
[14.02.2018 18:17:59] rssi=-83, sampleCount=12211, rssiDurationMs=20
[14.02.2018 18:18:09] rssi=-83, sampleCount=12212, rssiDurationMs=20
[14.02.2018 18:18:19] rssi=-81, sampleCount=12213, rssiDurationMs=20
[14.02.2018 18:18:30] rssi=-83, sampleCount=12214, rssiDurationMs=20
[14.02.2018 18:18:50] rssi=-83, sampleCount=12216, rssiDurationMs=20
[14.02.2018 18:18:40] rssi=-83, sampleCount=12215, rssiDurationMs=20
[14.02.2018 18:18:50] rssi=-83, sampleCount=12216, rssiDurationMs=20
[14.02.2018 18:19:00] rssi=-83, sampleCount=12217, rssiDurationMs=20
[14.02.2018 18:19:11] rssi=-83, sampleCount=12218, rssiDurationMs=20
[14.02.2018 18:19:21] rssi=-83, sampleCount=12219, rssiDurationMs=20
[14.02.2018 18:19:31] rssi=-83, sampleCount=12220, rssiDurationMs=20
[14.02.2018 18:18:40] rssi=-83, sampleCount=12215, rssiDurationMs=20
[14.02.2018 18:19:11] rssi=-83, sampleCount=12218, rssiDurationMs=20
[14.02.2018 18:19:41] rssi=-83, sampleCount=12221, rssiDurationMs=20
[14.02.2018 18:19:21] rssi=-83, sampleCount=12219, rssiDurationMs=20
[14.02.2018 18:19:52] rssi=-81, sampleCount=12222, rssiDurationMs=20
[14.02.2018 18:19:00] rssi=-83, sampleCount=12217, rssiDurationMs=20
[14.02.2018 18:20:02] rssi=-83, sampleCount=12223, rssiDurationMs=20
[14.02.2018 18:20:12] rssi=-83, sampleCount=12224, rssiDurationMs=20
[14.02.2018 18:20:22] rssi=-83, sampleCount=12225, rssiDurationMs=20
[14.02.2018 18:20:32] rssi=-83, sampleCount=12226, rssiDurationMs=20
[14.02.2018 18:20:43] rssi=-83, sampleCount=12227, rssiDurationMs=20
[14.02.2018 18:20:53] rssi=-83, sampleCount=12228, rssiDurationMs=20
[14.02.2018 18:21:03] rssi=-83, sampleCount=12229, rssiDurationMs=20
[14.02.2018 18:21:13] rssi=-83, sampleCount=12230, rssiDurationMs=20
[14.02.2018 18:21:24] rssi=-83, sampleCount=12231, rssiDurationMs=20

C) Azure Event Log for Device ID 3e0054000d51353432393339
Everything is fine and no double dispatching of messages.

[14.02.2018 18:17:41] rssi=-75, sampleCount=12187, rssiDurationMs=20
[14.02.2018 18:17:51] rssi=-75, sampleCount=12188, rssiDurationMs=20
[14.02.2018 18:18:02] rssi=-75, sampleCount=12189, rssiDurationMs=20
[14.02.2018 18:18:12] rssi=-75, sampleCount=12190, rssiDurationMs=20
[14.02.2018 18:18:22] rssi=-75, sampleCount=12191, rssiDurationMs=20
[14.02.2018 18:18:32] rssi=-75, sampleCount=12192, rssiDurationMs=20
[14.02.2018 18:18:43] rssi=-75, sampleCount=12193, rssiDurationMs=20
[14.02.2018 18:18:53] rssi=-75, sampleCount=12194, rssiDurationMs=20
[14.02.2018 18:19:03] rssi=-75, sampleCount=12195, rssiDurationMs=20
[14.02.2018 18:19:13] rssi=-75, sampleCount=12196, rssiDurationMs=20
[14.02.2018 18:19:24] rssi=-75, sampleCount=12197, rssiDurationMs=20
[14.02.2018 18:19:34] rssi=-75, sampleCount=12198, rssiDurationMs=20
[14.02.2018 18:19:44] rssi=-75, sampleCount=12199, rssiDurationMs=20
[14.02.2018 18:19:54] rssi=-75, sampleCount=12200, rssiDurationMs=20
[14.02.2018 18:20:05] rssi=-75, sampleCount=12201, rssiDurationMs=20
[14.02.2018 18:20:15] rssi=-75, sampleCount=12202, rssiDurationMs=20

@Dave Please help! Can you attempt to increase the Azure web hook timeout to 10s?

2 Likes

I know @jeiden was part of this integration so he may have some input on what you’re seeing.

Heya @mguntli,

Thanks for the ping and the research you’ve put in here! We’re talking about this internally, and while the webhook system isn’t trying to avoid double delivery in this scenario, I think a longer timeout would be a simple and helpful way to cut down on this sort of thing, but there are pros and cons for us to change the timeout in terms of how that impacts capacity. We’ll post here when we know more! :slight_smile:

Thanks,
David

6 Likes

@mguntli – Jeff here, a product manager here at Particle. First off, thank you very much for the detailed report and investigation into the issue. It’s so awesome to get this kind of depth of information about a concern with our platform with a clear, reasonable ask from our team. That’s the way to get us to act swiftly!

I wanted to let you know that there’s already a fix in the works to bump up timeouts. The specifics are still being hammered out, but it’s likely to be shipped sometime over the next few weeks.

Thanks again, and we will follow up with this thread to communicate progress.

Best,

Jeff

4 Likes

News update after the weekend test:
We deal now again with the problem that we loose certain data samples.
I am mostly worried about the “sleep” erros of the Particle Azure WebHook :thinking:

The corresponding chart of IoT Hub events looks like this, with three major dips (24h between first and second, 12h between second and third).

Unfortunately my curl logging process terminated before grabbing any useful log from particle cloud, so we have extended now our stability test logging setup (each sample is identified by a sampleCount)

  1. Log Events from Particle firmware throught USB serial to electron.log
  2. Log Events from Particle Cloud (including WebHook status) to particleEvents.log
  3. Log Events received on Azure IoTHub to iotHubEvents.log
  4. Detect duplicated and missing samples on the end user web site (TypeScript debugger output)

Hint regarding double dispatching:
We solved the “double dispatching” problem in Azure with a ServiceBus “duplication detection”.
If the same message is posted twice (identified by setting the Message.MessageId field with a hash of payload), the Service Bus will filter it out.
image

Next steps:
The problem is now reduced to find the root cause for missing packets (which I also believe is due to timeouts from Particle -> Azure IoT Hub).**

1 Like

Hey all

I'm working together with @mguntli and here are a few more logs from our stability test. Maybe it can help.
The test setup is described above.

Having 1 webhook failures

We lost a message from deviceId (3a004a000e51353432393339) with sampleCount 12218. Unfortunately there no hook-error logs with curl on particle cloud. Message (12217) are sent 01:45:23 and the next (12219) 01:47:57. But there is no 12218 message.

A) Particle Console Event Log

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-77,""sampleCount"":12217,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:45:23.292Z"",""coreid"":""3a004a000e51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:45:23.321Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:45:23.741Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12233,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:45:24.308Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:45:24.335Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:45:24.904Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12234,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:45:34.528Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:45:34.569Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:45:35.361Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12235,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:45:44.817Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:45:44.840Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:45:45.465Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12236,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:45:55.032Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:45:55.082Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:45:55.473Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12237,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:46:05.272Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:46:05.357Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:46:05.832Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12238,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:46:15.507Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:46:15.610Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:46:16.108Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12239,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:46:25.755Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:46:25.782Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:46:26.373Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12240,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:46:35.993Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:46:36.117Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:46:36.695Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12241,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:46:46.257Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:46:46.287Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:46:46.885Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12242,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:46:56.496Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:46:56.618Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:46:57.262Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12243,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:47:06.747Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:47:06.772Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:47:07.144Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12244,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:47:16.977Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:47:17.009Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:47:17.467Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12245,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:47:27.209Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:47:27.234Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:47:28.074Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12246,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:47:37.428Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:47:37.458Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:47:37.941Z"",""coreid"":""particle-internal""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-75,""sampleCount"":12247,""rssiDurationMs"":20}"",""ttl"":60,""published_at"":""2018-02-21T01:47:47.709Z"",""coreid"":""3e0054000d51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:47:47.734Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:47:48.246Z"",""coreid"":""particle-internal""}",

event: spark/status,
"data: {""data"":""online"",""ttl"":60,""published_at"":""2018-02-21T01:47:55.823Z"",""coreid"":""3a004a000e51353432393339""}",

event: v2-dev-stability,
"data: {""data"":""{""rssi"":-69,""sampleCount"":12219,""rssiDurationMs"":30}"",""ttl"":60,""published_at"":""2018-02-21T01:47:57.289Z"",""coreid"":""3a004a000e51353432393339""}",

event: hook-sent/v2-dev-stability,
"data: {""data"":"""",""ttl"":60,""published_at"":""2018-02-21T01:47:57.317Z"",""coreid"":""particle-internal""}",

event: hook-response/v2-dev-stability/0,
"data: {""data"":""success"",""ttl"":60,""published_at"":""2018-02-21T01:47:57.735Z"",""coreid"":""particle-internal""}",

B) Device Log for 3a004a000e51353432393339
Here we can see, the message 12218 is sent to particle cloud.

[2018-02-21 02:44:42.150] rssi=-77, sampleCount=12214, rssiDurationMs=20, eventPublished=1
[2018-02-21 02:44:52.400] rssi=-77, sampleCount=12215, rssiDurationMs=20, eventPublished=1
[2018-02-21 02:45:02.625] rssi=-77, sampleCount=12216, rssiDurationMs=20, eventPublished=1
[2018-02-21 02:45:12.873] rssi=-77, sampleCount=12217, rssiDurationMs=20, eventPublished=1
[2018-02-21 02:45:23.108] rssi=-77, sampleCount=12218, rssiDurationMs=20, eventPublished=1
[2018-02-21 02:45:33.346] rssi=-77, sampleCount=12219, rssiDurationMs=20, eventPublished=1
[2018-02-21 02:47:57.398] rssi=-69, sampleCount=12220, rssiDurationMs=30, eventPublished=1
[2018-02-21 02:48:07.676] rssi=-69, sampleCount=12221, rssiDurationMs=30, eventPublished=1
[2018-02-21 02:48:17.947] rssi=-69, sampleCount=12222, rssiDurationMs=30, eventPublished=1
[2018-02-21 02:48:28.212] rssi=-69, sampleCount=12223, rssiDurationMs=30, eventPublished=1
[2018-02-21 02:48:38.481] rssi=-69, sampleCount=12224, rssiDurationMs=30, eventPublished=1

C) Azure Event Log for Device ID 3a004a000e51353432393339 and 3e0054000d51353432393339

[2018-02-21T01:45:23.292Z] device_id=3a004a000e51353432393339, rssiDurationMs=20, sampleCount=12217, rssi=-77
[2018-02-21T01:45:24.308Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12233, rssi=-75
[2018-02-21T01:45:34.528Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12234, rssi=-75
[2018-02-21T01:45:44.817Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12235, rssi=-75
[2018-02-21T01:45:55.032Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12236, rssi=-75
[2018-02-21T01:46:05.272Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12237, rssi=-75
[2018-02-21T01:46:15.507Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12238, rssi=-75
[2018-02-21T01:46:25.755Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12239, rssi=-75
[2018-02-21T01:46:35.993Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12240, rssi=-75
[2018-02-21T01:46:46.257Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12241, rssi=-75
[2018-02-21T01:46:56.496Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12242, rssi=-75
[2018-02-21T01:47:06.747Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12243, rssi=-75
[2018-02-21T01:47:16.977Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12244, rssi=-75
[2018-02-21T01:47:27.209Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12245, rssi=-75
[2018-02-21T01:47:37.428Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12246, rssi=-75
[2018-02-21T01:47:47.709Z] device_id=3e0054000d51353432393339, rssiDurationMs=20, sampleCount=12247, rssi=-75
[2018-02-21T01:47:57.289Z] device_id=3a004a000e51353432393339, rssiDurationMs=30, sampleCount=12219, rssi=-69

@jeiden any updates regarding the Azure IoT Hub WebHook timeouts?
Can you maybe fire up your Microsoft contact in order to know which maximum latency you should expect?

We are particularly worried about the missing packets.

1 Like