Processing lockup

Hello all,

Processor: Q2686
OS Firmware 7.2 (or 7.2a)

Can anyone tell me what is the most common cause for this processor to lock up (showing no sign of processing)?

My application runs long periods of time but on some devices this lock up occurs at random intervals.

On some of these devices I have noticed (for whatever reason) the OS firmware loaded shows a different type of processor marking e.g. at+cgmr reports type H but the processor is type G and visa versa. Could this be a cause for the lock up?

Regards

Barry

Hello all,

Has anyone experienced lock-up or freezing issues?

Any constructive comment would be greatly appreciated.

Regards

Barry

Can you confirm these problems are also happening on a more recent firmware release?

Hello tomridl,

Not at present no but that is in the plan to see if this helps. However, we are planning to develop/upgrade/check our application to work with latest OS Firmware. At present I would like to know other peoples experience/comments.

Regards

Barry

Hi Barry,
i think people has the similar issues. at least i have had some lock up problem as you described. Those were sometimes related to the software logic of my open at software. Some of them could be a firmware bug which could be fixed by upgrading firmware.
Currently, in our hardware desing, we put an external watchdog on the pcb card. We trigger the watchdog pin in our open at application now. If software gets stuck or locked, it can not trigger the watchdog and the watchdog resets the module inorder to perform a clean start.
I recommend using external watchdog while these lockup problems may seem to continue.

Hi zafer,

Thanks for your information.

As an update, we have a device here which appears as though the application stopped running (I am aware of multiple resets can do this) but I could still communicate with the device over the debug port. I know this is a slightly different issue than my original question but I cant help think that they maybe related. It appears to have restarted all by itself days later after leaving the device switched on.

How can you confirm that your application stops running? If you decide by just look at the serial port communication, maybe it is related to uart baud/character settings. do you change uart settings or AT/data mode switching in your application?
i realize that my application stops by identifying that the internal logs which i write on flash memory arent there. So i understand that my application stops because my logs arent written. You may use similar kind of logic to ensure your application has really stoped.
i hope i am clear.

Hello zafer,

Sorry for the delay in response.

I confirm that the device has locked up by means of observation i.e. the ancillary devices connected to the CPU stop operating.

As a further update to this issue I believe I have found one area where a device lookup is occurring.

Our device connects to a host via a GSM/GPRS/TCP connection. If any one of these connections fail to connect or have been connected then subsequently fail they try to reconnect. If however they fail to reconnect after x time period then the device is automatically re-started by means of the at+cfun=1 command. When the device reconnects or is restarted is all controlled via the adl_tmrSubscribe command. When the device is restarted via the at+cfun=1 command a record of this is stored in persistent memory (i.e. information that can survive a soft restart).

When the device has locked up (i.e. no ancillary device action and unable to communicate to the device via the debug port) I have noticed that the persistant information is telling me that the device tried to perform a device restart (at+cfun=1) but, instead of the device actually re-starting it remained locked up.

I do have further investigations to do but if anyone has any further suggestions please let me know. I may well end up using an external watchdog chip as zafer suggests.

I would first suggest updating to the latest firmware. Up until 7.45.1 we were seeing a number of network related problems in the field. We confirmed that they were mostly fixed with 7.45.1 and later firmware.

There are other things that I’ve seen. For example, sometime when you send an at command you call back function is called immediately instead of waiting for the send command to exit and then the call back occurs. This can cause an infinite loop. Therefore, if you have a chain of AT commands to call sometimes you need to stick a timer with a 1 tick delay in between calls to the Send AT function to break the recursive loop.

I have yet to see a lock up that can’t be worked around. I’d suggest adding a whole lot of print statements and see if you can capture the failure. Add a print statement for every OpenAT API that can return an error. You may not be handling an error properly. I have also noticed that most of the OpenAT samples don’t handle 1/10th of the errors they should be handling and we are lured into a false sense of security when the sample works great at our desk but fails in the field or with more rigorous testing.

Hope this helps your search for the root cause.