TCP/IP data corruption

Hi,

I don’t know where to send bugreports, so here it is.

I noticed tcp/ip communication problems on our SL8082T modules (gprs, wip) and spent some time to create a minimal example. Didn’t take much - just send data and it gets corrupted. Well, maybe not that simple… The example uses two timers to periodically send some unique bytes to server. I think the error appears even with one timer, but much less frequently (don’t remember which code I had running overnight). With this code, I see about 10 errors per 30 minutes. Changing the timers from 3 seconds to 2 seconds showed no errors during the 30 minutes I tested. Scroll down the code a few lines for a log snippet.

Could be the same problem described here: https://forum.sierrawireless.com/t/lost-data-over-tcp-ip-ssl/7328/1 (lost data over TCP/IP (SSL))

If anybody knows a workaround, I’d very much like to try it. I must be missing something, or misusing something…

-“Open AT Application Framework package”,“2.52.2.A5.201412171046”
-“Open AT OS Package”,“6.52.0.201402271535”
-“Firmware Package”,“7.52.2.A5.201411281219”
-“Internet Library Package”,“5.56.0.201305170830”

#define GPRS_APN        "replace"
#define GPRS_USER       ""
#define GPRS_PASSWORD   ""
#define GPRS_PINCODE    "7999"
#define TCP_PEER_STRADDR "replace.com"
#define TCP_PEER_PORT    40900


// 12 errors in 30 minutes
#define HEARTBEAT1_PERIOD   31
#define HEARTBEAT2_PERIOD   32

// no errors in 30 minutes
//#define HEARTBEAT1_PERIOD   21
//#define HEARTBEAT2_PERIOD   22


// -----------------------------------------------------------------------------

//
// netcat can be used on the server to listen for the connection.
//
// example of data corruption (if no errors, wait at least for 30 minutes):
//
//    abc@machine: nc -vvv -l 0.0.0.0 40900
//    Listening on [0.0.0.0] (family 0, port 40900)
//    Connection from [83.180.13.158] port 40900 [tcp/*] accepted (family 2, sport 49152)
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvtuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_``abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrsttuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkljklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    ...
//

/*
ati9
"DWL"," S4_1_0_21BT R2371 CNSHZ-ED-XP0031 2014/11/27 16:53:47","","Sierra Wireless",0,"","00000000","00000000"
"FW","FW_752_A5_68_F2_5.SL808Fx","R7.52.2.A5.201411281219.SL8082T","Sierra Wireless",1804452,"112814 12:19","02e18b16","10002020"
"MODEM 3G+","Revision: S4_1_0_21AP R2371 CNSHZ-ED-XP0031 2014/11/27 16:53:47"
"OAT","1.0.0.20150403050918","MyApplication","Test",109016,"040315 05:09","6318a5a2","10700000"
 -"Developer Studio","3.3.0.201501281210"
 -"Open AT Application Framework package","2.52.2.A5.201412171046"
 -"Open AT OS Package","6.52.0.201402271535"
 -"Firmware Package","7.52.2.A5.201411281219"
 -"Internet Library Package","5.56.0.201305170830"
"ROM","8400000"
"RAM","4000000"
"OATRAM","b00000"
"DWLNAME","SL808x"
*/


#include "adl_global.h"
#include "generated.h"

#include "wip.h"

//  Stack size Declaration
#ifdef __GNU_GCC__
    // GCC needs more stack than ARM compilers
    #define DECLARE_CALL_STACK(X)       (X*3)
    #define DECLARE_LOWIRQ_STACK(X)     const u32 wm_apmIRQLowLevelStackSize = X*3
    #define DECLARE_HIGHIRQ_STACK(X)    const u32 wm_apmIRQHighLevelStackSize = X*3
#else
    #define DECLARE_CALL_STACK(X)       (X)
    #define DECLARE_LOWIRQ_STACK(X)     const u32 wm_apmIRQLowLevelStackSize = X
    #define DECLARE_HIGHIRQ_STACK(X)    const u32 wm_apmIRQHighLevelStackSize = X
#endif // #ifndef __GNU_GCC__

/////////////////////////////////////////////////////////////////////////////

#define GPRS_BEARER     "GPRS"
#define REG_STATE_REG   1
#define REG_STATE_ROAM  5
#define CREG_POLLING_PERIOD 20

#define ASSERT( pred )  if( !(pred)) TRACE(( 1, "ASSERT %i: " #pred "\n", __LINE__))

static wip_channel_t g_socket;

// every call returns the next byte in a cycle from '1' to 'y', and ends with '\n'.
// 123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
// 123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
// ..
static u8 get_next_byte(void)
{
    static u8 b = '1';
    if (b == 'z') {
        b = '1';
        return '\n';
    } else {
        return b++;
    }
}

// send len num bytes to g_socket. get the bytes on-by-one from get_next_byte()
static void send_next_bytes(u8 len)
{
    if (g_socket && len) {
        static u8 buf[255];
        memset(buf, '*', sizeof(buf));
        u8 i;
        for (i = 0; i < len; i++) {
            buf[i] = get_next_byte();
        }

        int n = wip_write(g_socket, buf, len);
        if (n <= 0) {
            TRACE(( 1, "send_next_bytes wip_write error %i", n ));
        }
    }
}

static void SendHeartbeat2(u8 Id, void* ctx)
{
    send_next_bytes(1);
}

static void SendHeartbeat1(u8 Id, void* ctx)
{
    send_next_bytes(3);
}

// Handling events happening on the TCP client socket.
static void cbevh(wip_event_t *ev, void *ctx)
{
    switch (ev->kind)
    {
        case WIP_CEV_OPEN:
        {
            TRACE((1, "WIP_CEV_OPEN"));
            g_socket = ev->channel;
            break;
        }

        case WIP_CEV_READ: break;

        case WIP_CEV_WRITE:
        {
            TRACE((1, "WIP_CEV_WRITE - %d bytes can be written", ev->content.write.writable));
            break;
        }

        case WIP_CEV_ERROR:
        case WIP_CEV_PEER_CLOSE:
        default:
        {
            TRACE((1, "WIP_CEV EVENT %d. restart modem.", ev->kind));
            wip_close(ev->channel);
            g_socket = 0;
            break;
        }
    }
}

static void cbEvhBearer(wip_bearer_t b, s8 event, void *ctx)
{
    TRACE((1, "cbEvhBearer: event  %i", event));
    if (WIP_BEV_IP_CONNECTED == event) {
        TRACE(( 1, "cbEvhBearer: connecting to client " TCP_PEER_STRADDR ":%i", TCP_PEER_PORT ));
        wip_channel_t socket;
        socket = wip_TCPClientCreate(TCP_PEER_STRADDR, TCP_PEER_PORT, cbevh, NULL);
        ASSERT(socket);
    } else {
        TRACE((1, "cbEvhBearer: unknown event. restart modem.")) ;
    }
}

static void OpenAndStartBearer(u8 Id, void* ctx)
{
    TRACE((1, "OpenAndStartBearer"));
    static wip_bearer_t br;

    s8 r = wip_bearerOpen(&br, GPRS_BEARER, cbEvhBearer, NULL);
    ASSERT(r == OK);

    r = wip_bearerSetOpts(br,
                            WIP_BOPT_GPRS_APN, GPRS_APN,
                            WIP_BOPT_LOGIN, GPRS_USER,
                            WIP_BOPT_PASSWORD, GPRS_PASSWORD,
                            WIP_BOPT_END);
    ASSERT(r == OK);

    r = wip_bearerStart(br);
    ASSERT(OK == r || WIP_BERR_OK_INPROGRESS == r);
}

static void PollCreg(u8 Id, void* ctx);

static bool cbPollCreg(adl_atResponse_t* Params) {
    TRACE((1, "cbPollCreg"));
    ascii regStateString[300]; // nowhere is said how large this should be to guarantee safety.

    wm_strGetParameterString(regStateString, Params->StrData, 2);
    s32 regStateInt = wm_atoi(regStateString);

    if (REG_STATE_REG == regStateInt || REG_STATE_ROAM == regStateInt) {
        TRACE((1, "cbPollCreg: Registered on GPRS network."));
        OpenAndStartBearer(0, NULL) ;
    } else {
        adl_tmrSubscribe(FALSE, CREG_POLLING_PERIOD, ADL_TMR_TYPE_100MS, (adl_tmrHandler_t)PollCreg);
    }
    return FALSE;
}

static void PollCreg(u8 Id, void* ctx)
{
    TRACE((1, "PollCreg: waiting for GPRS network registration."));
    adl_atCmdCreate("AT+CREG?", FALSE, (adl_atRspHandler_t)cbPollCreg, ADL_STR_CREG, NULL);
}

static void cbEvhSim(u8 event)
{
    TRACE((1, "cbEvhSim: event %d", event));
    if (ADL_SIM_EVENT_FULL_INIT == event) {
        PollCreg(0, NULL);
    }
}

void CfgGprs()
{
    TRACE((1, "CfgGprs: waiting for SIM."));
    adl_simSubscribe(cbEvhSim, GPRS_PINCODE);
}

void main_task(void)
{
    TRACE(( 1, "main_task init" ));
    wip_logEvents = true;

    adl_tmrSubscribe(TRUE, HEARTBEAT1_PERIOD, ADL_TMR_TYPE_100MS, (adl_tmrHandler_t)SendHeartbeat1);
    adl_tmrSubscribe(TRUE, HEARTBEAT2_PERIOD, ADL_TMR_TYPE_100MS, (adl_tmrHandler_t)SendHeartbeat2);

    // initialize the network stack
    s8 ret = wip_netInitOpts(WIP_NET_OPT_END);
    if (OK == ret) {
        CfgGprs();
    } else {
        TRACE((1, "ERROR: net stack init failed: %i : restart modem.", ret));
    }
}

what is the error you are getting…/? during wip_write?

Ah, sorry. I meant to use “data corruption” instead of “error”. There are no errors returned anywhere, but the data that reaches the server is corrupted. Running the program from my post results in this when listening from a server (all lines should be the same if there’s no corruption):

// netcat can be used on the server to listen for the connection.
//
// example of data corruption (if no errors, wait at least for 30 minutes):
//
//    abc@machine: nc -vvv -l 0.0.0.0 40900
//    Listening on [0.0.0.0] (family 0, port 40900)
//    Connection from [83.180.13.158] port 40900 [tcp/*] accepted (family 2, sport 49152)
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvtuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_``abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrsttuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkljklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
//    ...
//

Yesterday I tried with WIP_COPT_NODELAY turned on, and so far there’s no data corruption for 12 hours.

Hiya,

I’ve seen this before, and it was an internal buffer truncation issue.

If I read your WIP_WRITE code correctly, it looks to me you are not correctly dealing with the cased where the internal TCP write buffer is full. In particular, these lines:

int n = wip_write(g_socket, buf, len);
        if (n <= 0) {
            TRACE(( 1, "send_next_bytes wip_write error %i", n ));
        }

wip_write() returns the number of bytes successfully written to the internal buffer (positive number) or an error (negative number). This MAY NOT be the whole contents of your “buf” buffer that you have supplied.

wip_write() is non-blocking, so will return immediately.

You need to check that

n==len

after wip_write() returns. If n != len, then only the first n bytes of your buffer were loaded into the internal TCP write buffer, and you need to wait for the next WIP_WRITE event to load the rest of the buffer.

Just out if interest, see if you are getting dropped data at around 1492 or 1500 bytes (or multiples thereof).

ciao, Dave

Well that was an embarrassing error on my part : ) but still, no dice.

I have the following code now

int n = wip_write(g_socket, buf, len);
        if (n != len) {
            TRACE(( 1, "send_next_bytes wip_write error %i", n ));
        }
    }
}

static void SendHeartbeat2(u8 Id, void* ctx) {
    send_next_bytes(1);
}

static void SendHeartbeat1(u8 Id, void* ctx) {
    send_next_bytes(1);
}

and got extra bytes on server after first ~20 characters. No log entries about wip_write problems nor WIP_WRITE events.

Can anybody from Sierra Wireless confirm this bug exists? Seems a pretty serious and fundamental problem if TCP can be fuzzed with timers.

The example code provided in the first post is the simplest program I could make that connects to a server (plus just two timers and a special data-stream generator to be able to easily visually detect errors in the received stream). I’d be happy to help however I can.

Thanks davidc for pointing out the wip_write return code handling problems in my code. But in this case those problems didn’t matter because wip_write always reported that it accepted the whole buffer.

I am also seeing a few bytes being duplicated within packets. I my instance it seems to occur (sometimes) when the TCP stack breaks the data into multiple packets. The last byte/s of a packet are duplicated on the next packet… e.g.

I send the following to wip_write(
“ABCDEFGHIJKLMNOPQRSTUVWXYZ”,…)

Tracing by clikcing “Start IP traffic analyser” shows that the buffer is split into two packet as follow as

1st : “ABCDEFGHIJK”
2nd: “JKLMNOPQRSTUVW” -> “JK” is repeated here

Note that “JK” is repeated corrupting the originally intended packet. As mentioned above, this does not occur every time the packet is split but occasionally.

Any ideas?

Thanks,

Looking into this further seems to indicate that the modem FW is not managing its internal buffer properly.

Symptoms indicate that this “COULD” be an error where the internal pointer is not updated correctly but the count is. i.e. the pointer is moved 1-3bytes to little causing those bytes to be resent. But the resultant count of bytes in the TCP packet are then correct.

This also seems to only happen when calling wip_write while there is still data in the internal TCP Buffer. I have a scenario where I send 512 bytes at a time. Normally the previously 512 bytes are sent before I call wip_write again. But on the occasion where the TCP buffer is not empty and I call wip_write (i.e. adding to the buffer), the error happens (not always, but frequently enough).

I would appreciate someone at SW looking into this and double checking that they are handling the interalal TCP buffer.

But in the meantime the work around is to make sure the TCP buffer is completely empty before sending any more data by calling

wip_getOpts(channel, WIP_COPT_NWRITE, &NWRITE, WIP_COPT_END);

and ensuring the returned value equals the size of the buffer which I set using

wip_setOpts(ev->channel, WIP_COPT_SND_BUFSIZE, MAX_TCP_SND_BUFFER, WIP_COPT_NODELAY, TRUE, WIP_COPT_END)

I hope that helps someone else.

There is a CRC in an ethernet frame header which will drop a frame if it detects corruption. IPv4 has a header checksum to detect corruption in the IP header (IPv6 doesn’t have this). TCP has a checksum for the entire TCP segment to detect data corruption, and TCP guarantees delivery and out-of-order reassembly