[openib-general] Loading / unloading IB modules

Mon Mar 29 08:53:10 PST 2004

OK, here's the output of going throught the list of rmmod'ing one by one:

# oib9 /root > ifdown ib0
# oib9 /root > lsmod
Module                  Size  Used by    Not tainted
nfs                    98384   3  (autoclean)
lockd                  58832   1  (autoclean) [nfs]
sunrpc                 96348   1  (autoclean) [nfs lockd]
ib_useraccess          10180   0  (unused)
ib_useraccess_cm       15808   0  (unused)
ib_cm                  46680   0  [ib_useraccess_cm]
ib_udapl               40920   0  (unused)
ib_ip2pr               29500   0  [ib_useraccess_cm ib_udapl]
ib_ipoib               58380   0  [ib_udapl ib_ip2pr]
ib_sa_client           29076   0  [ib_udapl ib_ip2pr ib_ipoib]
ib_client_query        12736   0  [ib_udapl ib_ip2pr ib_ipoib ib_sa_client]
ib_tavor               23940   0  (autoclean) [ib_useraccess_cm]
mod_vapi              129892   0  (autoclean) [ib_useraccess_cm ib_udapl 
ib_tavor]
mod_vipkl             219360   0  (autoclean) [mod_vapi]
mod_thh               257920   0  (autoclean) [mod_vapi]
mod_hh                 15608   0  (autoclean) [mod_vipkl mod_thh]
mod_mpga               23584   0  (autoclean) [mod_vapi]
mod_vapi_common        65276   0  (autoclean) [ib_useraccess_cm ib_udapl 
ib_tavor mod_vapi mod_vipkl mod_thh]
mosal                 110053   0  (autoclean) [mod_vapi mod_vipkl mod_thh 
mod_mpga mod_vapi_common]
ib_mad                 21068   0  [ib_useraccess ib_cm ib_client_query]
ib_poll                14616   0  [ib_cm ib_ip2pr ib_client_query]
ib_core                47636   0  [ib_useraccess ib_useraccess_cm ib_cm 
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_tavor ib_mad]
ib_packet_lib         147024   0  [ib_mad ib_core]
ib_services            16932   0  [ib_useraccess ib_useraccess_cm ib_cm 
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_client_query ib_tavor ib_mad 
ib_poll ib_core ib_packet_lib]
e1000                  74208   1 
microcode               5472   0  (autoclean)
# oib9 /root > rmmod ib_useraccess
# oib9 /root > rmmod ib_useraccess_cm
# oib9 /root > rmmod ib_cm           
# oib9 /root > rmmod ib_udapl
# oib9 /root > rmmod ib_ip2pr
# oib9 /root > rmmod ib_ipoib
# oib9 /root > rmmod ib_sa_client
# oib9 /root > rmmod ib_client_query
# oib9 /root > rmmod ib_tavor       
# oib9 /root >

At this point, the console sees (but the system is still running):

VAPI(4): hobul.c[559]: HOBUL_delete: Invoked while 0x7 resources are still 
allocated
 VAPI(2): EVAPI_release_hca_hndl HOBUL_delete failed return: Resource is busy
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp 
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp 
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp 
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp 
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorMemoryDeregister][tavor_mr.c:126]InfiniHost0: 
VAPI_deregister_mr failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorCqDestroy][tavor_cq.c:124]InfiniHost0: 
EVAPI_clear_comp_eventh failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorCqDestroy][tavor_cq.c:131]InfiniHost0: VAPI_destroy_cq 
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorPdDestroy][tavor_pd.c:76]InfiniHost0: VAPI_dealloc_pd 
failed, return code = -244 (Invalid HCA Handle.)

Now, back to the rmmod'ing:

# oib9 /root > rmmod mod_vapi
# oib9 /root > rmmod mod_vipkl
# oib9 /root >

At this point (still no crash) I see on the console:

 VIPKL(1): em.c[88]: EM delete:found unreleased async object
 VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array 

 VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array 

 VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array 

 VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array 

 VIPKL(1): cqm.c[62]: CQM delete:found unreleased cq
 VIPKL(1): mmu.c[97]: MM delete:found unreleased mr
 VIPKL(1): pdm.c[44]: PDM delete:found unreleased pd
 THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!

 THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!

 THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!

 THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!

 THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!

 THH(1): tmrwm.c[1409]: found unreleased mr!!!!

More rmmoding:

# oib9 /root > rmmod mod_thh  
# oib9 /root > rmmod mod_hh 
# oib9 /root > rmmod mod_mpga
# oib9 /root > rmmod mod_vapi_common
# oib9 /root > rmmod mosal          
# oib9 /root > rmmod ib_mad
# oib9 /root > rmmod ib_poll
# oib9 /root > rmmod ib_core
# oib9 /root > rmmod ib_packet_lib
# oib9 /root > rmmod ib_services  
# oib9 /root > lsmod
Module                  Size  Used by    Not tainted
nfs                    98384   3  (autoclean)
lockd                  58832   1  (autoclean) [nfs]
sunrpc                 96348   1  (autoclean) [nfs lockd]
e1000                  74208   1 
microcode               5472   0  (autoclean)
# oib9 /root >

After the rmmod of mod_thh, I see the following on the console:

 THH(1): thh_mod_obj.c[378]: cleanup_module: destroying InfiniHost0
THH kernel module removed successfully

Interestingly, the system still ran (without a problem).  So, I rebooted and 
attempted groups:

# oib9 /root > ifdown ib0
# oib9 /root > lsmod
Module                  Size  Used by    Not tainted
nfs                    98384   3  (autoclean)
lockd                  58832   1  (autoclean) [nfs]
sunrpc                 96348   1  (autoclean) [nfs lockd]
ib_useraccess          10180   0  (unused)
ib_useraccess_cm       15808   0  (unused)
ib_cm                  46680   0  [ib_useraccess_cm]
ib_udapl               40920   0  (unused)
ib_ip2pr               29500   0  [ib_useraccess_cm ib_udapl]
ib_ipoib               58380   0  [ib_udapl ib_ip2pr]
ib_sa_client           29076   0  [ib_udapl ib_ip2pr ib_ipoib]
ib_client_query        12736   0  [ib_udapl ib_ip2pr ib_ipoib ib_sa_client]
ib_tavor               23940   0  (autoclean) [ib_useraccess_cm]
mod_vapi              129892   0  (autoclean) [ib_useraccess_cm ib_udapl 
ib_tavor]
mod_vipkl             219360   0  (autoclean) [mod_vapi]
mod_thh               257920   0  (autoclean) [mod_vapi]
mod_hh                 15608   0  (autoclean) [mod_vipkl mod_thh]
mod_mpga               23584   0  (autoclean) [mod_vapi]
mod_vapi_common        65276   0  (autoclean) [ib_useraccess_cm ib_udapl 
ib_tavor mod_vapi mod_vipkl mod_thh]
mosal                 110053   0  (autoclean) [mod_vapi mod_vipkl mod_thh 
mod_mpga mod_vapi_common]
ib_mad                 21068   0  [ib_useraccess ib_cm ib_client_query]
ib_poll                14616   0  [ib_cm ib_ip2pr ib_client_query]
ib_core                47636   0  [ib_useraccess ib_useraccess_cm ib_cm 
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_tavor ib_mad]
ib_packet_lib         147024   0  [ib_mad ib_core]
ib_services            16932   0  [ib_useraccess ib_useraccess_cm ib_cm 
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_client_query ib_tavor ib_mad 
ib_poll ib_core ib_packet_lib]
e1000                  74208   1 
microcode               5472   0  (autoclean)
# oib9 /root > modprobe -r ib_udapl

At this point, on the console, I see the same messages all the way to the 
point of rmmod'ing the mod_thh, but then:

Unable to handle kernel paging request at virtual address f8ccb183
 printing eip:
f8ccb183
*pde = 376d4067
*pte = 00000000
Oops: 0000
nfs lockd sunrpc ib_ipoib ib_sa_client ib_client_query mod_vapi_common mosal 
ib_mad ib_poll ib_core ib_packet_lib ib_services e1000 microcode  
CPU:    0
EIP:    0060:[<f8ccb183>]    Not tainted
EFLAGS: 00010282

EIP is at __insmod_mod_vapi_common_S.data_L4 [mod_vapi_common] 0xc928b 
(2.4.21-9.EL-IB_patches/i686)
eax: c483e580   ebx: f7984200   ecx: f7983000   edx: 00000000
esi: f8ccb183   edi: f7b3c5b8   ebp: f63e3edc   esp: f63e3ea8
ds: 0068   es: 0068   ss: 0068
Process modprobe (pid: 792, stackpage=f63e3000)
Stack: f8ba16e0 f7984200 00000282 f63e3ed8 00000282 f7b3c408 f7b3c49c 00000282 
       00000015 f7b3c408 f7b3c408 f7b3c400 f7b3c408 f63e3f0c f906eff7 f7984200 
       f9070147 00000015 00000003 00200000 f907012a f7b3c408 f7b3c408 f7b3c408 
Call Trace:   [<c010cd9d>] show_stack [kernel] 0x6c
[<c010cf31>] show_registers [kernel] 0x169
[<c010d0b2>] die [kernel] 0x8a
[<c011cdbc>] do_page_fault [kernel] 0x2e4

Code: Bad EIP value.

Kernel panic: Fatal exception

This is pretty reproduceable for me.  Basic sequence:

0) Reboot with the modules.conf that I sent out on Friday (result is full boat 
of modules, and ipoib).
1) ifdown ib0
2) modprobe -r ib_udapl

It seems that you must do it this way, because the following has a different 
path of failure:

0) Reboot
1) modprobe -r ib_udapl
2) ifdown ib0

This way, I see the following:

# oib9 /root > modprobe -r ib_udapl
# oib9 /root >

On the console is the same messages all the way up to unloading the mod_thh 
but then the following is spewed slowly:

[KERNEL_IB][ib_cached_sm_path_get_Rsmp_93f76742][core_cache.c:84]Bad magic 
0x2000000 at f7630c80 for DEVICE
[KERNEL_IB][ib_cached_gid_get_Rsmp_efcd7458][core_cache.c:137]Bad magic 
0x2000000 at f7630c80 for DEVICE
[KERNEL_IB][ib_mad_send_Rsmp_43faef95][mad_ib.c:132]Bad magic 0x2000000 at 
f7630c80 for DEVICE

Back to the command line:

# oib9 /root > lsmod
Module                  Size  Used by    Not tainted
nfs                    98384   3  (autoclean)
lockd                  58832   1  (autoclean) [nfs]
sunrpc                 96348   1  (autoclean) [nfs lockd]
ib_ipoib               58380   1 
ib_sa_client           29076   0  [ib_ipoib]
ib_client_query        12736   0  [ib_ipoib ib_sa_client]
ib_mad                 21068   0  [ib_client_query]
ib_poll                14616   0  [ib_client_query]
ib_core                47636   0  [ib_ipoib ib_sa_client ib_mad]
ib_packet_lib         147024   0  [ib_mad ib_core]
ib_services            16932   0  [ib_ipoib ib_sa_client ib_client_query 
ib_mad ib_poll ib_core ib_packet_lib]
e1000                  74208   1 
microcode               5472   0  (autoclean)
# oib9 /root >

That's pretty scary, "modprobe -r" of ib_udapl seems to unload all of the 
mellanox modules as well.  Therefore, the ipoib is completely hosed.  So, if 
I attempt to ifdown ib0:

Unable to handle kernel paging request at virtual address f8cca343
 printing eip:
f8cca343
*pde = 3747f067
*pte = 00000000
Oops: 0000
nfs lockd sunrpc ib_ipoib ib_sa_client ib_client_query ib_mad ib_poll ib_core 
ib_packet_lib ib_services e1000 microcode  
CPU:    0
EIP:    0060:[<f8cca343>]    Not tainted
EFLAGS: 00010246

EIP is at __insmod_ib_mad_S.data_L288 [ib_mad] 0x110223 
(2.4.21-9.EL-IB_patches/i686)
eax: f72c33cc   ebx: f72c33c0   ecx: f74ba400   edx: f726cb80
esi: f74ba408   edi: 00001002   ebp: f624de44   esp: f624de10
ds: 0068   es: 0068   ss: 0068
Process ip (pid: 777, stackpage=f624d000)
Stack: f8ba0ca9 f72c33c0 00000002 f90703fc f74ba408 00000800 00000800 f624de50 
       00000100 00000297 f74ba400 f74ba400 f74ba408 f624de74 f906ecf9 f72c33c0 
       f624de88 f90683a4 f74ba408 f74ba49c 00000286 f74ba408 f74ba52c f72c33c0 
Call Trace:   [<c010cd9d>] show_stack [kernel] 0x6c
[<c010cf31>] show_registers [kernel] 0x169
[<c010d0b2>] die [kernel] 0x8a
[<c011cdbc>] do_page_fault [kernel] 0x2e4

Code: Bad EIP value.

Kernel panic: Fatal exception

Quite strange behaviour.  Hopefully you've made it all the way to the bottom 
of this email...

On Friday 26 March 2004 14:16, Roland Dreier wrote:
> Thanks for the information.  Which module triggers the problems on
> unloading?  In other words, if you go through and start unloading
> modules in the order they appear in lsmod, when do you get the crash?
>
>  - Roland

-- 
(((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))
 Makia Minich                 The memories of a man in his old age,
                   are the deeds of a man in his Prime.
 makia at llnl.gov                                    -- Waters/Wright
(((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))

-- 
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.