Our team is working to create a better dashboard for monitoring connection of devices in an attempt to see connectivity issues before they effect end users. We’ve recently flagged two devices that had interesting connections and would love to learn more about what other Particle community members have seen.
What are the main factors that would result in an extended response time while the signal strength and quality are excellent?
What factors would cause the signal quality to be so poor while maintaining a relatively quick response time?
In both cases, the above connection screenshots led to a failed operation by our end users. Any context or information about what affects these values would be appreciated.
First a quick note, it is a best practice to blur out your device number in the forums.
This is an area I am also very interested in. In my approach, I collect connection issues in my code and put an “alerts” webhook into the queue to send on next connection. This way, I can track failures very granularly.
I keep track of:
- cellular connection timed out
- cellular connection time in seconds
- cellular signal strength and quality
- connected to cellular but did not successfully connect to Particle
- connected to Particle but could not send webhook
- sent webhook but no response
- failed to disconnect from Particle
- failed to turn off the cellular modem
I would like to add - network service provider.
These alerts trigger a message in a Slack channel I monitor. With this approach, I am generally very quick to notice any connectivity errors Particle or its cellular partners might be having.
Hope this helps and please share what you come up with.
I’ve updated the images to protect our devices. Thank you.
We’re moving in the direction of alerts via Slack, but we’re not there yet. How much continuity do you have with your alerts? Are there nearly constant alerts? Or does your firmware only publish when one of your failure conditions are met?
I use my back-end service to forward alert messages via Slack that are reported from my device via WebHook. Despite this two-step process, notifications are received in Slack almost immediately.
My devices are in out-of-doors areas so they only connect once an hour to deliver the messages in their queue. So if a device fails to connect, the alerts will be delivered on the next successful connection. In general connectivity is fairly reliable (despite remote devices having poor cellular connectivity) so this works well.
Key to this working is the wonderful publishQueuePOSIX-RK library one of @rickkas7’s many contributions to this community.
I was planning to open a thread on this too. So I have devices in the US and Canada and periodically a device might have some very odd connectivity issues. It runs fine for weeks at a time and out of the blue there may be a period where calling any function or variable fails or it will respond to every 10th call or so. I initially thought it was my code but cleaned it up to the bare minimum and it sits idle untill a function is called via the api. I keep track of uptime and that never reset during these periods of connection issues. I really would like to better understand what happens in the background that can cause this and what can be done to keep these issues to a bare minimum. The devices are in a static location with grid power. Also, I have been monitoring this for over 2 years and have not seen any patterns.
Any chance someone from the particle support team can chime in on this? I think this would benefit the community if we can better understand what is happening on the backend that would cause this behavior. It would be awesome to have an error code returned if this issue is caused by cell service/backend services and not just a timeout/failed to return data message.