CPU Hangs


#1

I’ve got a difficult debug problem with my Q2687 on a custom AVL board using ADL. The board is in its 3rd version and has worked well for the last two versions.

But recently I’ve been getting increasing cases of the application hanging. It doesn’t respond “OK” to any command, and only responds to wopen=0 (thank goodness). So one could expect a software bug, but it seems to be happening across SW versions and hardware versions.

WC has an 8 second WD timer that resets the CPU if something causes it to “hang”, but somehow, whatever is causing my problem is outside of the influence of this timer. It feels like power supply, I could also suspect some memory allocation problem, I2C hanging or even noisy CTS/RTS. Tracing doesn’t seem to produce any clues, it just stops where-ever.

I’ve recently subscribed to the error handler to see if anything pops up there. All memGets seem to be properly released and I stopped I2C processing and still see the problem.

Any ideas on how to catch this thing? What else could cause the processor to “hang”?


#2

A what?

Power supply problems can certainly cause some very “interesting” and obscure effects!

Have you looked very carefully at the supply - very close to the module’s pins?
Get your scope to trigger on voltage droops.

Poor grounding of the can legs can also cause some very “interesting” and obscure effects…


#3

Automatic vehicle Location, with GPS as well.

I checked the VBATT while transmitting, and it’s within limits below 100 KHz: 100mV at 10 KHz and 40 at 100 KHz. With the scope I have, it’s difficult to say what it’s doing above 100K.


#4

It’s not completely dead. I noticed that if given enough time, it will revive sometimes. Also I saw what looked like the Wavecom closing down the connection (bearer?) by itself after it had been hung for a time.

Oh well…


#5

What OpenAT version do you use? Do you use multitasking? How did you excluded I2C in your testing?

I’ve got my I2C hanging after bus reading and freezing the entire CPU. With multitasking and OASIS it does not freeze CPU but still hanging. It was non periodical event and I guess it was coupled together with SCL stretching.

Remove the entire I2C module from your software and try again.


#6

I use version 4.22 with firmware 6.63 and ADL, which is multitasking to the extent that timers and interrupts manage the bulk of the work while the main app runs in a loop.

I do an I2C scan once per second, so to disable I2C, I bypassed the call to the scan. There’s no I2C activity outside of that scan. I verified that there was no !2C data activity. If there were a noise spike or something… although I only enable the bus for any transaction and disable it after. I don’t think it should respond or hang with noise when disabled.

Wouldn’t there be a way for Wavecom to bust out of an I2C hang through the WDT? Anybody from Wavecom that can answer that?

There are a few things that can hang the CPU, which shouldn’t happen, one is this with the I2C and another is CTS/RTS noise. If CTS/RTS aren’t jumpered, and the processor is emitting serial data (unsolicited commands or debug information), eventually the CPU will hang in the field. Also the processor won’t come back from an over the air download if the CTS/RTS signals aren’t jumpered.

Very anoying.


#7

That is not a good idea for an Open-AT application!


#8

I think that’s the problem discussed here: viewtopic.php?f=4&t=428&p=10232&hilit=credit+dota#p10234 :question:

And that is a really BIG Problem :!: :open_mouth: :angry:


#9

What the heck am I saying!! I guess I need a vacation…

I’m looping my PICs NOT my ADL app!!


#10

Hey BlackyBlack,

When I originally bypassed I2C, I did so after 20 seconds, so that the I2C could set up a few things (I/O) on the PICs. Given that the nature of the problem seems to be arbitrary and after “some time of operation” I figured that I could get away with initial initialization.

But Yesterday taking into account what you said about removing the I2C altogether, I blocked it right from the beginning. The code is there, but it never runs.

And guess what? It SEEMS as though it’s not hanging any more. Most of the time, the units that are effected hang after one hour or so. So far, two units that have the I2C blocked have not hung for near 15 to 20 hours.

Still I find it strange. It’s true that I’ve been doing a little touching up of the I2C code on the PICs (so I’ll have to go over the most recent changes). But If just 20 seconds of I2C can cause eventual problems an hour later…??? I’m still a little stumped.

The only thing I can think of is that I must not be unsubscribing the bus somehow and that leaves the bus open to hang when the first glitch of noise comes along. The bus lines are “high” when it’s hung, so, it doesn’t seem to be “stretching” related.

Go figure.

Anyways, I’ll keep y’all up to date once it’s straightened out, perhaps it will help someone else.

Manchine


#11

i also noticed that I2C on the Q26xx modules is a pretty delicate flower.
i hope you find the source of your troubles.


#12

Yes, that does sound like a reasonable hypothesis…

There might even be a bug in the unsubscribe that leaves something “hanging”…

Can you subscribe the pins as GPIOs and force them into some “safe” state…?

Are you leaving any timers “hanging”, or suchlike?


#13

Yes although I was avoiding to think the same, this will probably be inevitable.


#14

BlackyBlack, From another post you mentioned external WDT!!

Here I have identified the hanging as I2C related, and as we all know, certain I2C failures hang the CPU, so internal timers are useless in detecting failure.

I have an external PIC and could use it to detect a hung W/C… I guess I haven’t used the external WC reset because of the strong advise throughout the WC documentation that it is not advised to do hard resets. But I guess this would be a valid case.

Have you used external WD reset? Anyone else?

In general I don’t lke to use WDTs, prefering to try to resolve the problem through logic. But on the other hand there wouldn’t be WDTs if there wasn’t a valid use… would there??..

In this case the solution REALLY rests with Wavecom, but I think the timeout period on that one is rather large.

Manchine


#15

Yes, we are using external watchdog. By the way it is I2C based. But with wavecom it is still not enough to make non hanging device. In some cases you have to power cycle wavecom module to make it work - reset is not enough. Also if you expect to have hard reset in less than an hour of work you will definetely have problems with your device. There are several reasons:

  1. after reset you will reconnect GPRS and will loose money on it.
  2. after reset you may loose your GPS data and wil have to wait for GPS reconnection.
  3. there is a possibility of not starting module after reset.
  4. if you drive outputs you will have undesired effects during restart.
  5. if you store your data in ram cache you will loose your data.

#16

BlackyBlack,

Thanks for the info. Fortunately (I guess) I already have soft restart in a few cases, so suffer and have adapted to most of the conditions you mention.

I would recycle power to the Wavecom only then, so no problem with point 2) and at least with soft restarts I haven’t had problems with point 3).
I wouldn’t have expected to see such a problem. Would it be entirely related to the power recycle? Is it possible to elaborate a little more on that item? Possible causes for example (in your appreciation of course).

Thanks.

Manchine


#17

I haven’t expected it too but I see that after reset we have oftenly our GPS lost for a while. It can be a schematics issue but you better try it itself. Also it may be even worse for C-GPS…
About point 4. We have observed outputs during soft reset and noticed zero level for a long period of module initialization. We haven’t elaborated it - maybe pull-ups will help. But for stable outputs on power cycled device you have too design it very carefully. Maybe using external cpu with different power source.
About point 5. Since soft reset comes from the application you always can store your cache to persistent storage. With external reset it is impossible.