Narrowing down issues with CAN bus

Hello hello, I believe this is my first post here!

I am having an issue with a complete system we are working on. We are in the stages of trying to debug everything but it is a bit hard at this point to separate a possible firmware issue from a hardware issue due to our lack of some test equipment so I am hoping to get some feedback on a few questions about the code first here.

I am working with a MCP2515 TJ1A050 CANBus Transeiver (using a pre-made board like this) and the boron LTE, and utilizing the dependencies.MCP_CAN_RK=1.5.1 library.

For a while I was generally seeing CAN data arrive, much like the example, the code has a form like this:


#include <mcp_can.h>
#include <SPI.h>

//  CANBUS Shield pins
#define CAN0_INT D6           // Set INT to pin 9
MCP_CAN CAN0(A5);             // Set CS to pin A5

//  MCP_CAN DATA
long unsigned int rxId;
unsigned char len = 0;
unsigned char rxBuf[8];


setup() {
  // ... other things
  CAN0.begin(MCP_ANY, CAN_250KBPS, MCP_8MHZ) == CAN_OK)
  CAN0.setMode(MCP_NORMAL);
  pinMode(CAN0_INT, INPUT);
  // ... other things
}

// ... etc

loop {
  // ...etc other things
  if(!digitalRead(CAN0_INT)) {
    CAN0.readMsgBuf(&rxId, &len, rxBuf);
    // Call other function that does things with this data based on the rxId
  }
  // ...etc other things
}

// ... etc

And we are having an issue where the CAN data is more likely to be read right after pushing a new firmware or potentially power cycling the device, and then at some point it stops receiving sometimes. What I have done in the meantime is add a counter variable within the if(!digitalRead(CAN0_INT)) and print that out along with our usual 1-second debug serial print logic. But of course, we have not been able to replicate the issue since we started observing it more so I cannot say if that data is coming in over CAN and is just “corrupt” or if it is missing. And I have no real way to analyse this data until we sort out a PC based adapter solution.

We do know that the device continues to be responsive, even when not responding to assumed data on the CAN bus from the BMS system, and even when not connected to the USB serial monitor, as we do see the particle publish send out bogus data.

My basic question is: does it make sense that if the INT pin is an “interrupt” that we should just be monitoring for it in the loop? Or should we be attaching an interrupt to this pin?

Also, does anyone have any resources for debugging something like this WITHOUT using the microUSB cable. I do have the particle JTAG debugger on my desk and have still not been able to use it. It would be wonderful to be able to add breakpoints and see what is going on with the hardware but I am a bit lost when it comes to how that integrates with the VS code ecosystem, if at all. I have never done debugging outside of an IDE running code on my local machine.

Another related question for the Boron, The main difference now that we are not seeing the issue is that we are delivering power over the USB port of our computers vs the voltage pin. Maybe the issue is in the power delivery is the issue here anyways. Is there any reason we could not snip the 5V wire in the USB and still be able to talk to the Boron without providing a different means of power?

Thanks for any assistance, I hope to be able to update the thread once we are again able to replicate the issue, or try to eliminate any potential hardware problems.

I know there is a lot going on in this post but I think a couple extra eyes particularly with experience either debugging with breakpoints or with specific experience in this MCP_CAN stuff would get us out of this black hole where we are having a hard time eliminating variables.

  • Mark Amber
1 Like

Hi Mark-

Thanks for posting!

My basic question is: does it make sense that if the INT pin is an “interrupt” that we should just be monitoring for it in the loop? Or should we be attaching an interrupt to this pin?

Monitoring from loop is prone to errors and issues. Attach an interrupt!

Also, does anyone have any resources for debugging something like this WITHOUT using the microUSB cable.

I’d implement a proper interrupt handler to read the data and proceed from there.

Is there any reason we could not snip the 5V wire in the USB and still be able to talk to the Boron without providing a different means of power?

Snipping 5V USB turns off USB on the NRF. Proceed with caution.

Colleen

1 Like

Hi and welcome to the community!
Are you sure the device is sending CAN data periodically?
Sometimes, you have to query the remote device to receive a response in the CAN bus.

When I had to debug my code, I bought a CAN bus J1939 simulator, like this one:

I talk a bit more about it in this topic.

Or you can use the CAN simulator that Particle has created, it helped me in my CAN adventures incredibly.
It’s in a zip file in the Simulator folder, and it is referenced here:

Edit: here’s a pic of the photon-based simulator:

3 Likes

Thanks Gustavo! It was a while getting back to this, as soon as I had started digging in a bit further we were dealing with some EMI problems. I got two, but it turned out each kit came with 2, so now I have 4 CAN testers! We were able to start things up today and the interrupt handler seems to have made things more "stable" but in order to hook up these testers to start understanding things more. I just wanted to pop in and say thanks even though there is no updated question yet.

3 Likes

gusgonnet

Thanks again for the help!

FINNNALLLY getting back to working on this problem. We have narrowed down some issues with our board design, with our circuit, etc and everything seemed to work but suddenly we are getting CAN_SENDMSGTIMEOUT back from CAN0.sendMsgBuf(0x055, CAN_STDID, 1, data). We just bought an oscilloscope to try and track down if the MCP2515 is even spitting out data and what’s going on the CAN and SPI side.

I am a bit lost on the next steps here, once we get the scope. Like what does CAN_SENDMSGTIMEOUT mean in the MCP_CAN_RK library mean?

Uff, that’s a tough one. I haven’t seen that error, but this is what I would pursue:

  • ask around (you did this already).
  • check the MCP2515 datasheet, since it has many buffers, registers, and modes to configure, there are a lot of combinations to try out.
  • I would look into understanding what exactly the CAN_SENDMSGTIMEOUT means and go from there.
  • if still in trouble, I would ask the developer of the MCP_CAN_RK lib to see if they know what is that error.
  • is the other node sending responses up until a certain point, did it send any responses at all for this particular message, are we “bombarding” it with so many messages per second that it overflows for whatever reason? Are we breaking it?

Best

1 Like

Alrighty- we figured out the issues on the CAN bus itself, it was bad hardware (the CAN interface chip, not the MCP2515)

Now the original issue comes back up which is either bug/crash in the MCP_CAN_RK library or a breakdown some sort at an electrical level of the SPI communication. This is the issue I earlier described as

The data remains on the CAN bus but it seems like we can't read the MCP2515 once this failure occurs. It takes a few days for the issue to crop up, which leads me to think it might be more likely hardware components or 'wiring' (pcb traces) than code issue. But it does happen regardless of which MCP2515 chip we have used.

The symptom on the particle side is that we do see CAN interrupts, but the new can messages are not reading from the MCP2515 is #define CAN_GETTXBFTIMEOUT (6) which is specifically what happens when this function in the library is returning MCP_ALLTXBUSY

THIS is the issue we were having at the start, we were even having this issue even when using the "Amazon Style boards" before we even started sending CAN messages (just using for listening).

Here is our scope on the SPI signals between the particle Boron and the MCP2515 when it is "broken"

and here is when it is working

It is evident that MOSI is fairly similar, but MISO is absent when it is 'broken'

The clock being present on MISO was evident to me but I am not an EE nor SPI expert, it could have been my testing method or it could be normal perhaps, It didn't mean anything to me.

Hoping someone who has a bit more EE or SPI background can chime in and say if there is anything that jumps out. My next steps will be to setup the scope to trigger on specific SPI commands and parse through the MCP2515 datasheet then wait a few days for this to happen again and try to understand what exactly is happening, but I assume the traces I took here just using a clock trigger could be helpful to anyone who has a bit more background experience than I do.

Thanks so much for the help so far,

Mark

@MarkAmberSpl, can you elaborate on how you service the CAN interrupts (code if possible) and the frequency of these interrupts?

2 Likes
  // when the canbus has data (int pin false)
  if(!digitalRead(CAN0_INT)) {
    // Read MCP2515 canbus data
    CAN0.readMsgBuf(&rxId, &len, rxBuf);
    // canIntReq = false;
    canIntCount++;
  }

Also sending CAN message

// If the system time is 1s later than the last time we printed
    if (System.millis() > lastPrintMillis + 149) {

    lastPrintMillis = System.millis();

    byte sndStat;

    if (powerGlobalState){
      
      sndStat = CAN0.sendMsgBuf(0x155, CAN_STDID, 1, mpee);
    } else {
      sndStat = CAN0.sendMsgBuf(0x155, CAN_STDID, 1, mped);
    }

    if(sndStat != CAN_OK){
      canErrCountSnd = sndStat;
    }

  }

I can provide the entire file after I sanitize it tomorrow and I will put it on a gist. Mostly based on the fabulous work of this internet stranger https://github.com/tomstor82/arduino-Battery-Monitor (we are using the same BMS, CAN hardware, etc)

Friendly reminder about this! Code gist please.

Here is the code - I am seeing something more on the scope now that I have reduced the length of the leader wires and there is much less noise. It is basically still showing the same as the broken image above but without the noise on pink (MISO).

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.