UART bug in OpenAT 7.x

Hi,

after a lot of debugging we’ve discovered what seems to be a bug in Open AT 7.44 (at least for Q26x devices) and possibly some earlier versions. The problem is that the ~CTS pin is set high to late when the UART buffer is about to be overrun. This leads to a loss of data while, for example, sending data through GRPS to a server.

We can reproduce the problem on both a Q26 Extreme and Q2687 with Open AT 7.44 and WipSoft 5.40 on the Q26 development board with a standard PC and an RS232 cable. If we downgrade the Q2687 to earlier versions of Open AT (in the 6.XX region) the problem goes away.

Steps to reproduce the problem:

  1. Program device with the latest software packages (Open AT 7.44, WipSoft 5.40)
  2. Connect to the device via RS232 cable to the development board or som other USART connector.
  3. Start the application (WipSoft) with “at+wopen=1”
  4. Start a server somewhere (e.g. “nc -l 8080” on a POSIX server)
  5. Connect to the server with the following AT-commands:

AT+IFC=2,2
AT+WIPCFG=1
AT+WIPCFG=2,14,1
AT+WIPBR=1,6
AT+WIPBR=2,6,11,“YOUR.APN.HERE”
AT+WIPBR=4,6,0
AT+WIPCREATE=2,1,“YOUR.SERVER.HERE”,8080
AT+WIPDATA=2,1,1

  1. Run a script (Python, bash or whatever) which does the following:
for(i=0; i<5000; i++) {
 while(CTS is false) {
  wait 1 second
 }
 send i to Q26_UART_1
 send "\n" to Q26_UART_1
}

This results in 1-3 lost bytes at each point where CTS is de-asserted in never versions of the software. To simplify the testing we limit the incoming bandwidth on the server so that the modem has to use flow control.

We’re currently trying to find an older Open AT and WipSoft version for the Q26 Extreme (of which we’re currently have around 50 which are unusable) which doesn’t have this bug to use until Sierra fixes the bug or at least takes a look at it.

Hopes this helps some of you from wasting as much time as we have on this issue.

Best regards
Fredrik

I have found odd things with R7.44 UART which I have not resolved yet.
https://forum.sierrawireless.com/t/control-is-not-going-to-data-handler-in-fcm-uart1/4689/1
R7.40 was OK for me.

Ok, sounds promising. We’ve managed to find 7.40a in the Developer Studio repository but no earlier versions or even WPKs for 7.40a. Where did you find 7.40?

R7.40a should be OK on the UART front. Give it a try. I downloaded (R7.40) when it was current.

The R7.4a version we found seems to somewhat unstable. Do you have a wpk of R7.4 for the extreme you could post?

Best regards
Fredrik

Our local distributor just sent us R7.4a00 and WipSoft5.12 which works as expected but have the same CTS problem as all the newer versions. I guess the bug is older than that.

It’s quite odd that no one else here has run in to the same bug.

I’ve run into something like that, but generally ignored, as I don’t need to send such amounts of data, did some test with large data, found I’m losing some of it, but left it like that.

We’ve done some further testing and have found the bug as early as Open AT 7.4a, that is, all versions that are compatible with the Q26 Extreme.

Is someone from Sierra reading this? Either there’s a major bug in your software or the documentation left something important out. We’re calling our local distributor every other day for help, but they don’t seem to get any answers from Sierra.

I’ve been informed by our local distributer that this thread has been sent to the development team, so hopefully, we will have some answers here soon.

It’s a real shame that SiWi don’t see fit to actually engage on their own forum.

:frowning:

They could certainly learn a lot from the way that TI engage on their forums…

awneil: I agree. We still haven’t heard anything from Sierra about this issue despite several e-mails to our local distributer. Not even an email saying they are looking at it.

Is anyone associated with Sierra reading this thread? We’ve been waiting for a about 3 weeks for an answer now. Either there is a major bug in Open AT 7.44 which should be quite easy to fix (very easy to reproduce on alla modules we’ve tested), or there’s something missing in the documentation.

We’ve had representatives from our local distributer here who have verified the bug with their own equipment and sent traces to Sierra. Unfortunately no response at all from Sierra yet.

Just like to add my 2 cents. We are also experiencing this same issue and have gotten no response from our problem reports.

I have to agree with awneil, the level of support from TI puts these guys to shame.

What is TI, Texas Instruments? 8)

Yes: TI = Texas Instruments

That’s really disturbing. We’ve designed our products for the Q26E with scheduled deployment during 2011. We’re expecting to deploy at least a few thousand units during 2011/2012, but that won’t work with a manufacturer that doesn’t even answer basic questions.

Have you guys had any experience with other brands of GSM/GPRS/UMTS-modules. We’re looking for temperature tolerant, energy efficient modules. But, most of all robust and from a manufacturer who understands the importance of documentation and support. Without that, the best HW becomes unusable for large deployments.

Short update for those with similar issues:

We’ve recieved a response from Sierra (who has analyzed the debug traces we collected) suggesting that there might be three possible explanations to this issue:

  1. Hardware flow control is not enabled
  2. At module side, UART RX FIFO is too small
  3. At PC side, UART TX FIFO size is too large, so that when PC receive the CTS signal and stop the transfer, the data already in FIFO will still be transmitted to module side and causing the overrun

#1 is not true since we use AT+IFC=2,2 and the CTS line asserts and deasserts when we flood the unit with data (just a little late). #3 could be true, but we’ve tried with both embedded modules where we control every bit put in the TX line and standard RS232 cables from a PC with the same result. So, if this is true, the buffer on the Q26 side is so small that basically no equipment can communicate with it using flow control.

So, that leaves us with #2, which is what we’'ve been suspecting for a while. To verify this Sierra has sent “OASIS 2.35 WP10” that supports the “AT+WHCNF=6,2” which should increase the RX FIFO.Unfortunately, this firmware version won’t recognize and external SIM-card which makes it impossible for us to test anything. We’re currently waiting for an answer from Sierra once again.

Update for those following this issue:

We’re still waiting for Sierra to fix the external SIM-card issue mentioned above so that vi can assert that the buffer increase resolves the bug. We recieved a new build of Oasis 2.35 last week, but with the exakt same SIM-card issue as the first one. We’re still trying to figure out why they sent us this. Either they are really stressed and just sent us something to make us happy for a few hours, or, our feedback is lost on they way to the Sierra support via our local distributor.

I’ve seen a forum administrator answering some questions in other threads, how come there isn’t even a comment on this whole issue here?

We have now recieved a version of 7.45 beta where we can set the FIFO size, but the bug remains on the Q26 Extreme. All responses from Sierra (via our local distributer) are conflicting information about possible causes for the bug and how different buffers interact which doesn’t make sense. We’ve even recieved questions on how to set up a simple server!

Since the deployment of our whole system is stalled due to this bug (causing major economic losses each day), we have decided to make contact with people higher up in the Sierra organization.

Parallel to this, we have put together a client and a server to reproduce the bug using a standard Sierra demo-board and a Q26 Extreme module running Open AT 7.4x (remove the WHCNF line if you are using 7.4X) with simple instructions (client and server attached to this message):


The two files attached are a client and a server to reproduce the bug found in all OpenAT 7.x firmware releases running on Q26Extreme. Follow these instructions step by step:

  1. Put the file uplinkTestServer.py on a computer with a public IP
  2. run “python uplinkTestServer.py” on that computer
  3. Connect another computer to a Sierra demo-board with a Q26Extreme running OpenAT 7.5x beta and WipSoft 5.40 on it
  4. Send “AT+WIND=255” to the Q26Extreme followed by “AT&W”
  5. Replace the string YOURHOSTNAMEHERE.COM in testUplink.py with the IP or domain name of the computer running uplinkTestServer.py
  6. Run “python testUplink.py” on the computer connected to the demoboard
  7. Reset the Q26Extreme

The client will now use the Q26 to connect to the server running uplinkTestServer.py and write 0-100000 via TCP/IP to the server with one number per line. As soon as the server recieves a line that does not consist of the last number +1, it will exit and print the erronous line recieved. We have tried this with several releases of Open AT, both 7.4x and 7.5x, and they will all produce erronous lines!


Arkiv.zip (2.38 KB)

Just something to check which caused us problems once.
Have you looked at your TCP/IP packet sizes? If the TCP/IP is set to NO_DELAY you can get really small packets which the server/connection can struggle to handle, so you loose some bits of data in a large transfer.
Perhaps the default has changed??