MC7304 loses mobile connection and won't auto-reconnect until power cycle

I’ve inherited support for a bunch of Windows 10 industrial tablets which have embedded MC7304 modems in them. These units are installed onboard trains which I don’t have easy physical access to.

The issue I have is that sometimes a unit is booted and the modem never connects, and sometimes it boots, connects OK, then some point loses connectivity and never reconnects. Most of the time though the units work as expected (connect, auto-reconnect if connection is lost, etc).

From what I can tell the units have always had this issue. Previously the units would just locally store their data while offline and then next time they were power cycled (eg by a user noticing they weren’t connected) the modem would (hopefully) reconnect and the unit would upload/download its data. However a feature has been added to the software on the units which requires more real-time data exchange with the devices and thus requires more constant connectivity and thus the need to power cycle to devices to return connectivity is an issue.

I’ve put a script onto the devices which runs netsh mbm show interfaces, pings our server, does some DNS lookups periodically to test connectivity, and during these disconnected periods netsh shows the modem as ‘Connected’ with a good signal strength. But the pings and DNS lookups all fail. It sort of looks like the modem gets itself into a state where it thinks it is connected but isn’t actually.

I realize these modems are EOL, but I was hoping that someone here might recognize this issue and tell me “change this setting and it will all come good”.

I found some old forum posts which sound very similar to my issue
https://forum.sierrawireless.com/t/mc7304-driving-me-nuts/9343
https://forum.sierrawireless.com/t/mc7304-on-windows-8-1-embedded/7820
https://forum.sierrawireless.com/t/mc7304-suddently-stops-working/7836
but none of these posts end with a solution.

The devices were on firmware SWI9X15C_05.05.58.00 9904567 05, but I saw that SWI9X15C_05.05.78.00 was available so I’ve tried that (from https://source.sierrawireless.com/resources/airprime/software/airprime-em73xx_mc73xx-fw-package-build-4837/).

I also updated to the latest driver I could find on the web site (https://source.sierrawireless.com/resources/airprime/software/airprime-em_mc-series-windows-drivers-qmi-build-5087/).

This has improved things - in that I’d say the frequency of the modem getting itself into a permanently disconnect state until a power cycle has halved, but it is still happening multiple times a day across the fleet. Any one modem might go days without having the issue though. Unfortunately when the modem goes into this state I cannot remotely connect to do in place diagnostics.

If there isn’t a solution that someone knows, are there diagnostic tools people can suggest that may allow me to troubleshoot the issue? The modems appear to be put into MBIM mode by the windows drivers so I cannot use standard AT commands to ask the modems what is happening so most of my probing has been via netsh.

try this and see if AT command port is enumerated:

  1. download the GenericDriverSetup_5087.exe from SOURCE website
    https://source.sierrawireless.com/resources/airprime/software/airprime-em_mc-series-windows-drivers-qmi-build-5087/#sthash.YyNMMlJJ.dpbs

  2. open a folder and put GenericDriverSetup_5087.exe and Configuration.ini together

  3. open Configuration.ini and add the following:

[Default Values]
CHECKFORDEVICE=0
LOCATIONDRIVER=1
WIN8LOCATION=1
USBCOMP=08

Other user can make it work:

Thanks for that. We did try this but the COM port didn’t enumerate. We’ve tried again on our bench device and found that the device appears in a different place in the device manager to what we were expecting, but we now have the AT port.

We’ll set up a script to poll AT!GSTATUS? to see if it shows anything interesting during the disconnect periods, unless someone can give us a setting to change or a better place to look for the issue.

you might also try if restart the telecom stack can let module register network:

AT+CFUN=0
AT+CFUN=1

I was hoping for a setting/recommendation that prevented the issue in the first place.

We have thought about doing AT+CFUN or AT!RESET or netsh mbn set powerstate=off then on to try and recover, but all these require us to first detect the modem is “locked up” and won’t ever reconnect vs just going through a tunnel or coverage blackspot and thus temporarily lost connection. We can try and wait a while to see if the modem reconnects, but ideally we’d get the modem to just auto-reconnect.

If this is not possible, we want a value we can poll for in the modem that tells us it is locked up, then we can try and reset it.

What is the return of AT!GSTATUS? when module cannot register network

I don’t know yet. I only just got AT access to the bench device!

I’ll have to remote reinstall the driver in USBCOMP=08 into a field unit, periodically log the GSTATUS data to a file, wait until the modem gets into this unresponsive state, wait for a user to powercycle the unit so it gets connectivity back (although we are contemplating just force restarting the device at 2am while we investigate) then I can remote access the file and have a look at what happened and hope that it shows something useful. That may take a while assuming the device we pick for all this triggers this case reasonably soon after we add all the logging.

If you have any idea of what the trigger could be that puts the modem into this state, I could try and create that situation on the bench device. I’ve been unable to simulate the situation myself - so I have to use the field units to work out what is going on. I’m not on a train, bouncing around and moving through tunnels though…

I think you need to do some workaround like writing a script to check the registration state, if it cannot register network for particular of time, then reset the telecom stack

We managed to get the driver updated and scripting installed yesterday and managed to capture an outage soon afterwards followed by getting a power reset done.

To cut the log volume we only logged periodically once our ping script reported a lost connection and grabbed GSTATUS each 2 mins. Here are 2 GSTATUS outputs. The first is 4 seconds after we detected a connection loss. The second is 14 minutes in to the outage.

!GSTATUS: 
Current Time:  16908		Temperature: 25
Bootup Time:   0		Mode:        ONLINE         
System mode:   LTE        	PS state:    Attached     
LTE band:      B1     		LTE bw:      15 MHz  
LTE Rx chan:   522		LTE Tx chan: 24225
EMM state:     Registered     	No Cell        
RRC state:     RRC Idle       
IMS reg state: No Srv  		

RSSI (dBm):    -60		Tx Power:    0
RSRP (dBm):    -88		TAC:         29E3 (10723)
RSRQ (dB):     -9		Cell ID:     0042F613 (4388371)
SINR (dB):     -1.8


OK

!GSTATUS: 
Current Time:  17748		Temperature: 24
Bootup Time:   0		Mode:        ONLINE         
System mode:   LTE        	PS state:    Attached     
LTE band:      B3     		LTE bw:      20 MHz  
LTE Rx chan:   1617		LTE Tx chan: 24225
EMM state:     Registered     	No Cell        
RRC state:     RRC Idle       
IMS reg state: No Srv  		

RSSI (dBm):    -53		Tx Power:    0
RSRP (dBm):    -105		TAC:         29E3 (10723)
RSRQ (dB):     -20		Cell ID:     00332902 (3352834)
SINR (dB):     -13.2


OK

We’ve analyzed the data a bit more. The train is moving along the tracks and appears to lose connectivity. At time 16908 (which is immediately after we detected a disconnect) the SNIR is poor.

Current Time:  16908		Temperature: 25
Bootup Time:   0		Mode:        ONLINE         
System mode:   LTE        	PS state:    Attached     
LTE band:      B1     		LTE bw:      15 MHz  
LTE Rx chan:   522		LTE Tx chan: 24225
EMM state:     Registered     	No Cell        
RRC state:     RRC Idle       
IMS reg state: No Srv  		

RSSI (dBm):    -60		Tx Power:    0
RSRP (dBm):    -88		TAC:         29E3 (10723)
RSRQ (dB):     -9		Cell ID:     0042F613 (4388371)
SINR (dB):     -1.8

At time 17028, it has changed to a new band and a different cell but SINR is still not great

Current Time:  17028		Temperature: 24
Bootup Time:   0		Mode:        ONLINE         
System mode:   LTE        	PS state:    Attached     
LTE band:      B3     		LTE bw:      20 MHz  
LTE Rx chan:   1617		LTE Tx chan: 24225
EMM state:     Registered     	No Cell        
RRC state:     RRC Idle       
IMS reg state: No Srv  		

RSSI (dBm):    -58		Tx Power:    0
RSRP (dBm):    -91		TAC:         29E3 (10723)
RSRQ (dB):     -14		Cell ID:     00780901 (7866625)
SINR (dB):      0.6

At 17148 it is still on B3, but now a new cell, but the SINR seems extremely bad

Current Time:  17148		Temperature: 26
Bootup Time:   0		Mode:        ONLINE         
System mode:   LTE        	PS state:    Attached     
LTE band:      B3     		LTE bw:      20 MHz  
LTE Rx chan:   1617		LTE Tx chan: 24225
EMM state:     Registered     	No Cell        
RRC state:     RRC Idle       
IMS reg state: No Srv  		

RSSI (dBm):    -53		Tx Power:    0
RSRP (dBm):    -105		TAC:         29E3 (10723)
RSRQ (dB):     -20		Cell ID:     00332902 (3352834)
SINR (dB):     -13.2

For the next 7 hours the modem basically reports identical information. The cell doesn’t change, the signal levels don’t change. Only time and temperature change. The modem appears to be ‘locked up’. The train is still moving

!GSTATUS: 
Current Time:  42228		Temperature: 28
Bootup Time:   0		Mode:        ONLINE         
System mode:   LTE        	PS state:    Attached     
LTE band:      B3     		LTE bw:      20 MHz  
LTE Rx chan:   1617		LTE Tx chan: 24225
EMM state:     Registered     	No Cell        
RRC state:     RRC Idle       
IMS reg state: No Srv  		

RSSI (dBm):    -53		Tx Power:    0
RSRP (dBm):    -105		TAC:         29E3 (10723)
RSRQ (dB):     -20		Cell ID:     00332902 (3352834)
SINR (dB):     -13.2

but it shows network attached…
do you mean you cannot establish data connection?
if so, have you tried reset telecom stack by AT+CFUN=0 and AT+CFUN=1?

Yes, it says attached, but it also says ‘RRC Idle’ and ‘No Cell’ and the Cell ID, SINR, RSSI etc never change for 7 hours even though the device is moving.

From 16908 until the machine was rebooted just after time 42228 all data requests failed. We are using ping 8.8.8.8 as the primary detection, but all HTTP, TCP etc connections dropped at time 16908 and all reported failures for the next 7 hours.

We are setting up a test today to try different types of ‘reset’ to see what it takes to unlock it.

But I’m still keen to know if you (or anyone else) have seen this situation before and whether there is a setting we could potentially set to prevent it getting into this state in the first place.

can it send SMS at that state?

I doubt the SIMs have an SMSC configured or a plan with SMS enabled (I don’t supply the SIMs - the train operators have to source the SIMs for their carriers) or allow international SMS (I’m not in the same country as most of the devices), but I can try if you give me the command to use. We only use the modems for data so that is all we care about.

at+cmgf=1
AT+cmgs="1234567_your_phone_number"
>test
(here type CTRL+Z to send the SMS)

Or you can send a SMS to the SIM card of this module and see if module can receive it

As we don’t have remote connectivity to the device when the modem is broken I cannot do these AT commands manually. We’ve not yet scripted them and I have no yet got permission from the owner of the SIM to do it automatically.

However, we did update our script to do a CFUN cycle when the ping fails 3 times in a row and that does seem to get it to reconnect successfully in that no outage for the devices we were monitoring was longer than 20 seconds with this in place (as opposed to the 7 hour outage yesterday).

So if we have proved the radio is getting stuck, is there anything we can do to prevent it getting in this state?

Or are we going to have to poll GSTATUS, parse the output, see if the cellid fields stop changing value, and when they do, do a CFUN cycle?

no, i don’t have any other idea on why it comes to this state.
Probably you need to implement your workaround.