MC7455 MBIM mode rcu_preemptive stalls


#1

Hello there!

We are experiencing some issues with the modem MC7455 module. Firstly, I will give you a bit of context:

- We are currently using Linux Kernel version 4.4.93.
- We faced some trouble when trying to integrate the QMI drivers in our image, so we decided to move to the MBIM Kernel-integrated drivers.

Now, everything is apparently working as expected. However, our device crashes randomly (we haven’t been able to reproduce the crash at our will to date, it just occurs time to time, from a few hours to a several days). Taking a look at the debug console log (see below) it seems that the issue is related to the ‘mbim-proxy’ process whose task consists of communicating with the modem.

rcu_preempt kthread starved for 5176724 jiffies! g1109445 c1109444 f0x2 s4 ->state=0x0
INFO: rcu_preempt detected stalls on CPUs/tasks:
(detected by 3, t=5247862 jiffies, g=1109445, c=1109444, q=3051957)
All QSes seen, last rcu_preempt kthread activity 5183005 (9234805-4051800), jiffies_till_next_fqs=1, root ->qsmask 0x0
mbim-proxy R running 0 619 1 0x00000002
[<8001a5b4>] (unwind_backtrace) from [<80014d54>] (show_stack+0x20/0x24)
[<80014d54>] (show_stack) from [<800546d4>] (sched_show_task+0xc4/0x114)
[<800546d4>] (sched_show_task) from [<80082054>] (rcu_check_callbacks+0xa60/0xa6c)
[<80082054>] (rcu_check_callbacks) from [<800870fc>] (update_process_times+0x4c/0x74)
[<800870fc>] (update_process_times) from [<80099438>] (tick_sched_handle+0x58/0x5c)
[<80099438>] (tick_sched_handle) from [<800994a0>] (tick_sched_timer+0x64/0xa8)
[<800994a0>] (tick_sched_timer) from [<80087e58>] (__hrtimer_run_queues+0x17c/0x3f4)
[<80087e58>] (__hrtimer_run_queues) from [<8008880c>] (hrtimer_interrupt+0xc0/0x20c)
[<8008880c>] (hrtimer_interrupt) from [<800192f8>] (twd_handler+0x40/0x50)
[<800192f8>] (twd_handler) from [<80076eac>] (handle_percpu_devid_irq+0xac/0x238)
[<80076eac>] (handle_percpu_devid_irq) from [<800723a4>] (generic_handle_irq+0x34/0x44)
[<800723a4>] (generic_handle_irq) from [<800726c8>] (__handle_domain_irq+0x8c/0xfc)
[<800726c8>] (__handle_domain_irq) from [<800094f8>] (gic_handle_irq+0x58/0x9c)
[<800094f8>] (gic_handle_irq) from [<800157c0>] (__irq_svc+0x40/0x74)
Exception stack(0xbd4dfb70 to 0xbd4dfbb8)
fb60: 00000002 00000003 00000003 bf7c304c
fb80: bf7cd900 80cc4614 bf7cd904 80cc4dc4 00000004 00000001 80cc4dc4 bd4dfbf4
fba0: 00000001 bd4dfbc0 8009e20c 8009e23c 20070113 ffffffff
[<800157c0>] (__irq_svc) from [<8009e23c>] (smp_call_function_many+0x2b0/0x2d4)
[<8009e23c>] (smp_call_function_many) from [<8009e3f0>] (on_each_cpu_mask+0x48/0xb4)
[<8009e3f0>] (on_each_cpu_mask) from [<800ec3c0>] (drain_all_pages+0xec/0xf4)
[<800ec3c0>] (drain_all_pages) from [<800ef918>] (__alloc_pages_nodemask+0x5f4/0xa58)
[<800ef918>] (__alloc_pages_nodemask) from [<800f3c0c>] (__do_page_cache_readahead+0x118/0x25c)
[<800f3c0c>] (__do_page_cache_readahead) from [<800e8e48>] (filemap_fault+0x26c/0x48c)
[<800e8e48>] (filemap_fault) from [<801b7428>] (ext4_filemap_fault+0x3c/0x50)
[<801b7428>] (ext4_filemap_fault) from [<8011021c>] (__do_fault+0x4c/0xa8)
[<8011021c>] (__do_fault) from [<801133b4>] (handle_mm_fault+0x2c8/0xccc)
[<801133b4>] (handle_mm_fault) from [<8001ec10>] (do_page_fault+0x148/0x3a8)
[<8001ec10>] (do_page_fault) from [<800092a0>] (do_PrefetchAbort+0x48/0xac)
[<800092a0>] (do_PrefetchAbort) from [<80015ca0>] (ret_from_exception+0x0/0x20)
Exception stack(0xbd4dffb0 to 0xbd4dfff8)
ffa0: 0013bfa0 00000000 00000386 0013de70
ffc0: 0013bfa0 0013dea8 00000000 00000000 00135488 7eab9bd8 7eab9bd4 00000000
ffe0: 0015e04e 7eab9b90 76f46470 76f3fc4c 20070010 ffffffff
rcu_preempt kthread starved for 5183029 jiffies! g1109445 c1109444 f0x2 s4 ->state=0x0
INFO: rcu_preempt detected stalls on CPUs/tasks:
(detected by 3, t=5254167 jiffies, g=1109445, c=1109444, q=3051957)
All QSes seen, last rcu_preempt kthread activity 5189310 (9241110-4051800), jiffies_till_next_fqs=1, root ->qsmask 0x0
mbim-proxy R running 0 619 1 0x00000002
[<8001a5b4>] (unwind_backtrace) from [<80014d54>] (show_stack+0x20/0x24)
[<80014d54>] (show_stack) from [<800546d4>] (sched_show_task+0xc4/0x114)
[<800546d4>] (sched_show_task) from [<80082054>] (rcu_check_callbacks+0xa60/0xa6c)
[<80082054>] (rcu_check_callbacks) from [<800870fc>] (update_process_times+0x4c/0x74)
[<800870fc>] (update_process_times) from [<80099438>] (tick_sched_handle+0x58/0x5c)
[<80099438>] (tick_sched_handle) from [<800994a0>] (tick_sched_timer+0x64/0xa8)
[<800994a0>] (tick_sched_timer) from [<80087e58>] (__hrtimer_run_queues+0x17c/0x3f4)
[<80087e58>] (__hrtimer_run_queues) from [<8008880c>] (hrtimer_interrupt+0xc0/0x20c)
[<8008880c>] (hrtimer_interrupt) from [<800192f8>] (twd_handler+0x40/0x50)
[<800192f8>] (twd_handler) from [<80076eac>] (handle_percpu_devid_irq+0xac/0x238)
[<80076eac>] (handle_percpu_devid_irq) from [<800723a4>] (generic_handle_irq+0x34/0x44)
[<800723a4>] (generic_handle_irq) from [<800726c8>] (__handle_domain_irq+0x8c/0xfc)
[<800726c8>] (__handle_domain_irq) from [<800094f8>] (gic_handle_irq+0x58/0x9c)
[<800094f8>] (gic_handle_irq) from [<800157c0>] (__irq_svc+0x40/0x74)
Exception stack(0xbd4dfb70 to 0xbd4dfbb8)
fb60: 00000002 00000003 00000003 bf7c304c
fb80: bf7cd900 80cc4614 bf7cd904 80cc4dc4 00000004 00000001 80cc4dc4 bd4dfbf4
fba0: 00000001 bd4dfbc0 8009e20c 8009e23c 20070113 ffffffff
[<800157c0>] (__irq_svc) from [<8009e23c>] (smp_call_function_many+0x2b0/0x2d4)
[<8009e23c>] (smp_call_function_many) from [<8009e3f0>] (on_each_cpu_mask+0x48/0xb4)
[<8009e3f0>] (on_each_cpu_mask) from [<800ec3c0>] (drain_all_pages+0xec/0xf4)
[<800ec3c0>] (drain_all_pages) from [<800ef918>] (__alloc_pages_nodemask+0x5f4/0xa58)
[<800ef918>] (__alloc_pages_nodemask) from [<800f3c0c>] (__do_page_cache_readahead+0x118/0x25c)
[<800f3c0c>] (__do_page_cache_readahead) from [<800e8e48>] (filemap_fault+0x26c/0x48c)
[<800e8e48>] (filemap_fault) from [<801b7428>] (ext4_filemap_fault+0x3c/0x50)
[<801b7428>] (ext4_filemap_fault) from [<8011021c>] (__do_fault+0x4c/0xa8)
[<8011021c>] (__do_fault) from [<801133b4>] (handle_mm_fault+0x2c8/0xccc)
[<801133b4>] (handle_mm_fault) from [<8001ec10>] (do_page_fault+0x148/0x3a8)
[<8001ec10>] (do_page_fault) from [<800092a0>] (do_PrefetchAbort+0x48/0xac)
[<800092a0>] (do_PrefetchAbort) from [<80015ca0>] (ret_from_exception+0x0/0x20)
Exception stack(0xbd4dffb0 to 0xbd4dfff8)
ffa0: 0013bfa0 00000000 00000386 0013de70
ffc0: 0013bfa0 0013dea8 00000000 00000000 00135488 7eab9bd8 7eab9bd4 00000000
ffe0: 0015e04e 7eab9b90 76f46470 76f3fc4c 20070010 ffffffff

We would like to know if you have any evidence of a similar previous error and/or if you could shed light on this matter. Every furhter valuable info is welcome!

Kind regards!


#2

@danlor,

Just to give you a response to this, running MBIM in Linux is not the normal way to use the modems (not that it is invalid) as a result there is going to be very limited implementation by other customers with this, I am not personally aware of anyone else using the unit this way. The MC/EM products are typically very stable and as a result resets are quite unusual.

Steps I would take are as follows.

  • Get a design review performed through your commercial channel.
  • There is logging that might be able to be performed to see why the unit is resetting but this will need to be done under guidance but the fact you are using the modem the way you are complicate matters.

Why did you decide to go down this route rather than the normal QMI drivers/QMI daemon/AT commands?

Regards

Matt


#3

@mlw,

Thank you for your prompt response.

We tried to make it work with the QMI drivers, but we experienced some issues due to Linux Kernel “raw-ip” incompatibilities. Whilst the MC7533 did work via QMI, the MC7455 didn’t. So we had to choose between upgrading our Kernel to >4.5 or just use MBIM, being the latter the option that seemed the more straightforward to us.

Kind regards,

Daniel


#4

@danlor
The crash you are experiencing seems to be in the MBIM driver . If you are using open source drivers, the qmi_wwan driver for RMNET maybe better suited than. A patch made to qmi_wwan as explained in https://www.systutorials.com/linux-kernels/739143/qmi_wwan-set-dtr-for-modems-in-forced-usb2-mode-linux-4-14-73/ made sure that the MC7455 modem works well with it.


#5

No, it is not. The current userspace task is mbim-proxy, but that does not tell you anything useful.

The warnings look completely unrelated to me. Can’t see any network/usb/mbim stuff in the stack traces. Don’t know what the problem can be, but I do see ext4_filemap_fault etc in there, which might indicate some file system issue? But who knows.

Anyway, trying other drivers is a good idea. I suspect you’ll have exactly the same issue since that is probably unrelated. But it’s good to have i verified.

And to those who worry: Yes, the MC7455 works very well in MBIM mode. Might not be a common mode for the MC version, but it’s certainly the most common mode for the EM7455 which is essentially the same


#6

Thank you both for your responses.

We’ve switched to the Sierra Gobi drivers. Today we’ll start the testing phase. I’ll update the post if we are finally able to clarify this matter, but I’m afraid that the issue is not produced by the mbim driver as @dl5162 says…Only time will say.

Kind regards,
Daniel