Crash after sending a large number of commands


#1

Hi,
We have an application that makes use of CMUX mode on UART1. We have been stress testing the application by using a python script to send a large number of commands on UART1 DLC1 using the Windows CMUX driver. During the test I found that our application was resetting.
While debugging the cause of the reset I tried running the same test script first with no application running on the modem - no reset or timeouts, then with a modified version of the hello world sample - and the modem reset - always at about the same line in the log captured by the test script. Using the TMT I could see that the error back-trace string was ADL Get memory error…
The test script sends three commands - AT+WIND? and two custom commands. Each command is sent 30 times and the sequence of all three commands is repeated 100 times for a total of 9000. After sending a command, the script waits for a response for up to 2.5 seconds, checks anything received and sends the next command. Once a response to a command is received, there is no delay before sending the next command.
I tried inserting a delay of 10ms, then 50ms between commands with no improvement.
I also tried increasing the application stack size from 3kB in increments up to just under 64kB. With a larger stack - and correspondingly smaller heap - the reset and associated timeouts actually occurred earlier. With a smaller stack - 1kB, the reset happened a bit later and the error back-trace string was RTK Except 161 180ff748 30.
The last test I tried was to repeat the test on UART1 directly rather than using the CMUX protocol. I found that I got the same reset and timeouts at about the same point in the log.

This seems to suggest that heap is being corrupted within the ADL layer, but this seems pretty hard to believe - as surely these types of test have been run many times before successfully? I would appreciate any suggestions for further things to check as I am still hoping that I am doing something stupid!


#2

The simple application I was using that demonstrated the reset is included below.

#include "adl_global.h"

const u16 wm_apmCustomStackSize = 1023 + 1024 * 63;

//! @brief The GPIO handle returned by the Open AT ADL GPIO subscription
//! call.
s32 g_health_signal_h;

//! @brief The GPIO number to use for the health signal indicator.
//! Use GPIO21 as the health signal - this is connected to the CHARGER LED
//! on the Q26 Development Kit board.
const u16 g_gpio_number = 21;

#if 0
void HelloWorld_TimerHandler ( u8 ID, void * Context )
{
    /* Hello World */
    TRACE (( 1, "Embedded : Hello World Blink" ));

    // Read the current value of the GPIO and invert it.
    adl_ioDefs_t health_signal = g_gpio_number | ADL_IO_GPIO;
    s32 state = adl_ioReadSingle(g_health_signal_h, &health_signal);

    adl_ioWriteSingle(g_health_signal_h, &health_signal, !state);
}
#endif // 0

void adl_main ( adl_InitType_e  InitType )
{
#if 0
    TRACE (( 1, "Embedded : Application Init" ));

    /* Set 1s cyclic timer */
    adl_tmrSubscribe ( TRUE, 10, ADL_TMR_TYPE_100MS, HelloWorld_TimerHandler );

    adl_ioDefs_t health_signal;

    health_signal = ADL_IO_GPIO | g_gpio_number | ADL_IO_LEV_LOW | ADL_IO_DIR_OUT;
    g_health_signal_h = adl_ioSubscribe(1, &health_signal, ADL_TMR_TYPE_100MS, 0, 0);
#endif // 0
}

Excerpt from the test script log demonstrating the timeout and modem reset:

2010-03-22T17:55:24.978000: Sent: "AT+WIND?
"
2010-03-22T17:55:27.866000: Rx'ed in 2.888s: "AT+WIND?
    

"
2010-03-22T17:55:27.867000: Timeout!
2010-03-22T17:55:27.869000: Sent: "AT+WIND?
"
2010-03-22T17:55:27.883000: Rx'ed in 0.014s: "+WIND: 3



+CREG: 0

AT+WIND?


+WIND: 255



OK

"

WARNING: Spurious data received before response

2010-03-22T17:55:27.886000: Sent: "AT+WIND?
"
2010-03-22T17:55:27.899000: Rx'ed in 0.013s: "AT+WIND?


+WIND: 255



OK

"

All testing was done on a Q2687 CPU on a Q26 development kit.
UART1 and the CMUX channel were operating at 115k baud, 8 data bits, No parity, 1 stop bit, hardware.
I was using a USB to serial adaptor to communicate with the dev. kit.
The reset - as evidenced by a +WIND: 3 message in the log and the associated error backtrace - seemed to occur after roughly 4500 commands were sent.
I would be interested to know if anyone else has conducted similar testing, and if so, what the result was.
Also any details about limitations on sending commands - is there a maximum rate that commands should be sent at / is there a minimum back-off between successive commands?