GobiNet Linux Driver Bug

We have discovered a bug in the QMIDevice of the GobiNet Linux driver. We are using driver version S2.27N2.40 on Linux kernel 4.5.4. We are principally using this with MC7354 and MC7304. We have observed this behaviour on all available drivers on the SW website.

The bug we are seeing is that any child process exiting (e.g. a system call) causes the QMIDevice to close, which results in the network interface changing to carrier-off state when we have a data session active.

The following test app demonstrates the behaviour. The GobiNet driver needs to be loaded with debug=1

#include <fcntl.h>
#include <stdio.h>

int main()
{
  int fd = open( "/dev/qcqmi0", O_RDWR | O_NONBLOCK );
  printf("Opened qmi device %d\n", fd);

  sleep(10);

  printf("running system call\n");
  system("sleep 10");
  printf("system call complete\n");
  // look in dmesg for for 'GobiNet::UserspaceClose'

  sleep(10);

  printf("closing fd\n");
  close(fd);
  sleep(10);
  // look in dmesg for 'GobiNet::UserspaceClose' and 'bad file data'

  printf("DONE\n");

  return 0;
}

The problem is caused by the QMIDevice closing down in ‘flush’, which is called when any fd is closed, such as a duplicated fd in a child process. The following patch fixes the problem in our system. A better fix might be to use ‘release’ instead of ‘flush’.

--- S2.27N2.40-orig/GobiNet/QMIDevice.c 2017-01-05 09:10:28.000000000 +0000
+++ S2.27N2.40/GobiNet/QMIDevice.c  2017-03-03 11:38:14.774798855 +0000
@@ -3310,6 +3310,7 @@ int UserspaceClose(
    fl_owner_t          unusedFileTable )
 {
    sQMIFilpStorage * pFilpData = NULL;
+   long c = 0;
    DBG( "\n" );
 
    if(pFilp ==NULL)
@@ -3317,7 +3318,14 @@ int UserspaceClose(
       printk( KERN_INFO "bad file data\n" );
       return -EBADF;
    }
-   
+
+   c = atomic_long_read(&pFilp->f_count);
+   if (c > 1)
+   {
+     DBG("f_count %ld - ignoring close", c);
+     return 0;
+   }   
+
    pFilpData = (sQMIFilpStorage *)pFilp->private_data;
    if (pFilpData == NULL)
    {

Hi,

Thanks for the detailed feedback, I have pushed your posting to the relevant team internally for them to look at and potentially improve the drivers.

Regards

Matt

Hi,

A quick update - the bug is still present in the latest driver (S2.28N2.44). There have been a lot of changes to QMIDevice.c but nothing that fixes this problem. Consequently we have had to update our patch:

--- S2.28N2.44-orig/GobiNet/QMIDevice.c	2017-07-06 09:04:36.000000000 +0100
+++ S2.28N2.44/GobiNet/QMIDevice.c	2017-08-08 20:45:22.752289132 +0100
@@ -3817,12 +3817,20 @@ int UserspaceClose(
    pid_t pid = -1;
    int iTimeout = 0;
    int iFile_count = 0;
+   long c = 0;
    mb();
    if(pFilp ==NULL)
    {
       printk( KERN_INFO "bad file data\n" );
       return -EBADF;
    }
+
+   c = file_count(pFilp);
+   if (c > 1)
+   {
+     DBG("f_count %ld - ignoring close", c);
+     return 0;
+   }   
    
    pFilpData = (sQMIFilpStorage *)pFilp->private_data;
    if (pFilpData == NULL)

Is there a plan to fix this in the driver at some point so that we don’t have to keep regenerating this patch every time we take a driver update?

Best regards,
Andrew

Hi,

Just as some feedback we are going to be integrating this change into our drivers in the very near future.

Regards

Matt

Thanks Matt.

As a general point would it be possible to find more information about what combinations of module and kernel versions are tested at each release? I understand it is virtually impossible to test every possible combination but we have recently encountered 2 separate issues which indicate that the combinations could not possibly have been tested.

Driver S2.27N2.40 with MC74xx modules and Kernel 4.5.4 on Intel x64
No IP data is ever received by application code and packet socket data is corrupted on receive (ethernet header was present when using SOCK_DGRAM with packet sockets)

This issue was eventually resolved by updating to S2.28N2.44 - but nothing on the release notes would indicate this was likely to resolve the issue. There is a mention of a change to hard_header_len on ethernet mode - but this problem only occurs in raw-ip mode.

Driver s2.28N2.44 with MC74xx and MC73xx with Kernel 3.6.8 on Atmel SAM9
QMIDevice triggers a kernel BUG by ‘scheduling while atomic’ during device registration.

We are still investigating this issue and will raise it separately here and with our supplier.

In both of these instances the fault is very serious and means the combination could not have been tested at all. What kind of support can we get in the future to help avoid these type of issues?

Regards,
Andrew

Andrew,

The short answer is no. What you are after is our SDK/driver/firmware validation test plan which I do not have access to (and certainly no one externally will) and will change for each and every release we do. The generic rules I think will be (and I am guessing here)

  • We always test with the latest greatest of everything.
  • For the SDK/drivers 99% of our validation testing will be done on i686, we might perform a passing test on ARM (because it is so popular) but the other arch’s will just be generated as part of the standard proven build process.

As an example we will not generate a release of firmware on a given unit and then perform backwards compatibility testing on all driver versions, SDK’s and arch as the scope is just way too large.

What you have referred to below are extremely low level and specific issues. Like with all validation you can only test to a degree and when you find a problem it is blindingly obvious that it is there so people do ask how did you not find it but it is very difficult.

Regards

Matt

Hi Matt,

Thanks for that. I agree that those issues are very specific and only become obvious once you have found them. Knowing that there is likely to be no regression testing does help us in terms of our planning and expectations when we take a driver update or move to a new hardware revision (e.g. 74xx).

Do you know if the ‘scheduling while atomic’ issue is a known issue? If so, is this likely to be fixed soon?

Best regards,
Andrew