WiFi reconnection issue - what does this trace mean?

Currently in the middle of investigating a long standing issue that others have found:

Once a WiFi connection is lost, it can only be re-instigated by a reset.

This is bad.

I note that this issue was seemingly solved here: [SOLVED] WiFi Reconnection Issue - Is there a solution?) based on @rickkas7 WiFi State Machine technique.

So, why has it not remained solved for me?

THEORY is that it has something to do with the SPI interface. Am now using the SDFAT 0.7 library, but before was not… More to follow on this as I progress

In the mean time, to assist with the investigation, can anyone respond to what causes the following [hal] and [hal.wlan] traces?

====== ETH cable pulled ==============
0000379927 [hal.wlan] TRACE: connect cancel
0000379928 [hal] TRACE: 20015b1c socket list: 0 active sockets closed
0000379928 [hal] TRACE: 20015b24 socket list: 0 active sockets closed
0000390242 [app] WARN: WiFi.ready()=0 WiFi.localIP()=0  

Thanks!

Which device OS are you using that exhibits this behaviour?

What do you mean by WiFi Connection is lost - AP turned off or out of range or internet connection lost?

It would appear that you have an Argon in an Ethernet Featherwing is that the case? And you have unplugged the cable?

@armor, I should have been more specific!

DeviceOS 1.5.0 on a P1.

The issue is when a connection drops for whatever reason.

I am simulating by removing the ethernet cable from the access point (ie loss of internet) and also turning off the access point (loss of WiFi), then re-instating. Problem for me is that the cloud connection does not come back, even though the access point is operational.

Am now circling my coding being at issue (isn't that the case in 90+% of the time!) because I resurrected this solution which works with the tests above:

I will report back once I have more to say. Any insights are always appreciated!

I tried devices this last weekend (they are on Photon and have SPI to a TFT and SD card) on 1.5.0 and I had a number of issues related to SPI so have reverted to 1.4.4 and will sit out 1.5.0 until it is clear the SPI and connection issue have been solved or at least it is clear what has changed to invalidate the program models that were working and how they need to be remediated.

@armor, I did say that the SPI / SD CARD was a theory in relation to my issue, but too early to be definitive. I note that my issue pre-dates DeviceOS 1.5.0.

Wrt your SPI issues, refer to Updating from DeviceOS 1.4.4 to 1.5.0-rc.2 -> Broken SPI Functionality. There was an issue of Display vs SD CARD in the release candidates, but with DeviceOS 1.5.0 I have no had an issue with issue with these two peripherals, so not sure what is going on. Suggest that you raise a separate topic for this.

Happy to report that I have found the reason for the issue:

Once a WiFi connection is lost, it can only be re-instigated by a reset.

It was lack of enough free memory.

It had nothing to do with SPI or erroneous code (but solved by code of course by reducing memory usage).

This also explains why what was once working now was not.

I think the interaction with SPI is that the SPI class uses a recursive os mutex. If there is no free memory, allocation fails, the mutex cannot get a lock and a deadlock occurs.

I have seen it before when hardware debugging. The code hangs on a mutex in the app thread, but the real issue is running out of memory, which makes the mutex deadlock.

I have removed any use of the SPI class now and use raw HAL functions instead. If you know how you use SPI, you don’t need all these mutexes in between on every SPI call.

The particle classes are very defensive and oriented at beginners, so in many places in my application I have just looked up what they do and implemented the same using HAL functions without all the bloat.

I prefer managing my own mutex using std::mutex and std::unique_lock<std::mutex> if I need one, to manage ownership and automatic unlocking on destruction.

1 Like

@Elco, now that is very interesting indeed! Makes sense.

Have you raised a support ticket for your finding as per @ScruffR comments here: WiFi.ready() == FALSE but WiFi is connected - #3 by ScruffR ?

@avtolstoy, any comment on @Elco's statement:

I think the interaction with SPI is that the SPI class uses a recursive os mutex. If there is no free memory, allocation fails, the mutex cannot get a lock and a deadlock occurs.

Re accessing the HAL functions directly, how is this done? I did not think that this could be done from the application. If so, great!

Yes, I also sent in a ticket and got a reply.

Well, I don’t know how you develop, but I use vscode and makefiles. So I have the entire particle device-os repo as a dependency in the same directory. So I just browse the device-os files to see how they implemented the SPI class (“spark_wiring_spi.h” / “spark_wiring_spi.cpp”) and re-used code from it.

But this is only for experienced embedded software developers that can fully understand what the particle code is doing. This is not documented and particle doesn’t expect you to use it this way.

The documentation often leaves out crucial details, like the exact function prototypes with the type of arguments and possible overloads. So I end up browsing the code anyway.

I think my next hardware revision will not use a Particle device anymore, but a bare ESP32 instead. I don’t use their cloud and the easy to use framework often gets in my way more than it helps me. I’m just not their target market. I’m an early adopter that got in during the Spark Core days, when ESP32’s were not even a thing. I have the skills to develop on bare metal and don’t need Arduino style libs.

1 Like

@Elco, I now understand how you are linking to the HAL layers - you are compiling the whole code base ie DeviceOS and Application, hence have access to all the DeviceOS functions.

I use Particle Dev as I don’t want the overhead of maintaining the environment, so this avenue is not available to me.

I understand your frustration…

I still use the system layers, not a monolithic build. But yes, I go deep into the framework.
Our code is also built in the cloud by Azure and automatically tested and deployed, so that’s why we use makefiles too.

1 Like

Can you elaborate what kind of maintance effort you are anticipating that keeps you from transitioning to Workbench?

@ScruffR, good question. The inertia is there due to convenience (you know, devil you know vs the devil you don’t).

I am by no means versed in the using repositories, and so have been assuming that there was an overhead in keeping in sync with the latest DeviceOS incarnations.

From your question I take it that Workbench is the preferred option for serious development. I shall give it a go in a few months.

2 Likes

@no1089,

Thought it appropriate that I respond to the communications that we had under the now closed post:

The issue that I was complaining about in the above post, and this post:

has been (re) solved today.

In short:

  • Implement a WiFi connection state machine
  • Ensure that there is enough heap memory

It seems that DeviceOS 1.5.2 has larger memory requirement and that is what unexpectedly threw up my recent woes (hence the frustration).

I believe that Particle are working on reducing DeviceOS memory requirements and if this can be improved, that would be great moving forward.

Thanks!

1 Like

I have seen the same issue you noted, a Photon not reconnecting to an AP if the AP is power cycled, and I agree, it’s frustrating to deal with.
I will check internally if this is specifically being worked on.

Yes, the 2.0 LTS release thread memory reduction will hopefully address many issues.

From your post, you are unblocked for the moment?

@no1089, confirming that am all good.

The implemented “WiFi state machine” works a treat when I don’t run too low on memory.

2 Likes

Thanks for your results @UMD!
We have come across this problem recently, so I’m looking at implementing the WiFI state machine (from [SOLVED] TCPCLIENT intranet connection fails if no cloud connection right?) and I wondered if you came to a rough figure of the available heap memory to keep free?

@dan.s, good question re memory.

Memory is especially an issue with later versions of DeviceOS.

Here are some posts for you to ponder:

etc....

Mention was made here that 10K of free memory was required when using SoftAP:

My guess - at least 10K of free memory is required. You need to instrument your code using freeMemory() to test.

IIRC you shouldn’t use SoftAP with less than 20KB free memory (before first time entry into Listening Mode).

Yes the softAP memory issue is a particular thorn for us, you can see my easy fix at Listening mode on the Photon cannot work reliably in current implementation

Is this reconnection issue related to softAP?