[openib-general] Loading / unloading IB modules
Makia Minich
Mon Mar 29 08:53:10 PST 2004
OK, here's the output of going throught the list of rmmod'ing one by one:
# oib9 /root > ifdown ib0
# oib9 /root > lsmod
Module Size Used by Not tainted
nfs 98384 3 (autoclean)
lockd 58832 1 (autoclean) [nfs]
sunrpc 96348 1 (autoclean) [nfs lockd]
ib_useraccess 10180 0 (unused)
ib_useraccess_cm 15808 0 (unused)
ib_cm 46680 0 [ib_useraccess_cm]
ib_udapl 40920 0 (unused)
ib_ip2pr 29500 0 [ib_useraccess_cm ib_udapl]
ib_ipoib 58380 0 [ib_udapl ib_ip2pr]
ib_sa_client 29076 0 [ib_udapl ib_ip2pr ib_ipoib]
ib_client_query 12736 0 [ib_udapl ib_ip2pr ib_ipoib ib_sa_client]
ib_tavor 23940 0 (autoclean) [ib_useraccess_cm]
mod_vapi 129892 0 (autoclean) [ib_useraccess_cm ib_udapl
ib_tavor]
mod_vipkl 219360 0 (autoclean) [mod_vapi]
mod_thh 257920 0 (autoclean) [mod_vapi]
mod_hh 15608 0 (autoclean) [mod_vipkl mod_thh]
mod_mpga 23584 0 (autoclean) [mod_vapi]
mod_vapi_common 65276 0 (autoclean) [ib_useraccess_cm ib_udapl
ib_tavor mod_vapi mod_vipkl mod_thh]
mosal 110053 0 (autoclean) [mod_vapi mod_vipkl mod_thh
mod_mpga mod_vapi_common]
ib_mad 21068 0 [ib_useraccess ib_cm ib_client_query]
ib_poll 14616 0 [ib_cm ib_ip2pr ib_client_query]
ib_core 47636 0 [ib_useraccess ib_useraccess_cm ib_cm
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_tavor ib_mad]
ib_packet_lib 147024 0 [ib_mad ib_core]
ib_services 16932 0 [ib_useraccess ib_useraccess_cm ib_cm
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_client_query ib_tavor ib_mad
ib_poll ib_core ib_packet_lib]
e1000 74208 1
microcode 5472 0 (autoclean)
# oib9 /root > rmmod ib_useraccess
# oib9 /root > rmmod ib_useraccess_cm
# oib9 /root > rmmod ib_cm
# oib9 /root > rmmod ib_udapl
# oib9 /root > rmmod ib_ip2pr
# oib9 /root > rmmod ib_ipoib
# oib9 /root > rmmod ib_sa_client
# oib9 /root > rmmod ib_client_query
# oib9 /root > rmmod ib_tavor
# oib9 /root >
At this point, the console sees (but the system is still running):
VAPI(4): hobul.c[559]: HOBUL_delete: Invoked while 0x7 resources are still
allocated
VAPI(2): EVAPI_release_hca_hndl HOBUL_delete failed return: Resource is busy
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorQpDestroy][tavor_qp.c:562]InfiniHost0: VAPI_destroy_qp
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorMemoryDeregister][tavor_mr.c:126]InfiniHost0:
VAPI_deregister_mr failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorCqDestroy][tavor_cq.c:124]InfiniHost0:
EVAPI_clear_comp_eventh failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorCqDestroy][tavor_cq.c:131]InfiniHost0: VAPI_destroy_cq
failed, return code = -244 (Invalid HCA Handle.)
[KERNEL_IB][tsIbTavorPdDestroy][tavor_pd.c:76]InfiniHost0: VAPI_dealloc_pd
failed, return code = -244 (Invalid HCA Handle.)
Now, back to the rmmod'ing:
# oib9 /root > rmmod mod_vapi
# oib9 /root > rmmod mod_vipkl
# oib9 /root >
At this point (still no crash) I see on the console:
VIPKL(1): em.c[88]: EM delete:found unreleased async object
VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array
VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array
VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array
VIPKL(1): qpm.c[156]: QPM delete: found unreleased qp in array
VIPKL(1): cqm.c[62]: CQM delete:found unreleased cq
VIPKL(1): mmu.c[97]: MM delete:found unreleased mr
VIPKL(1): pdm.c[44]: PDM delete:found unreleased pd
THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!
THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!
THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!
THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!
THH(1): tmrwm.c[1384]: found unreleased internal mr!!!!
THH(1): tmrwm.c[1409]: found unreleased mr!!!!
More rmmoding:
# oib9 /root > rmmod mod_thh
# oib9 /root > rmmod mod_hh
# oib9 /root > rmmod mod_mpga
# oib9 /root > rmmod mod_vapi_common
# oib9 /root > rmmod mosal
# oib9 /root > rmmod ib_mad
# oib9 /root > rmmod ib_poll
# oib9 /root > rmmod ib_core
# oib9 /root > rmmod ib_packet_lib
# oib9 /root > rmmod ib_services
# oib9 /root > lsmod
Module Size Used by Not tainted
nfs 98384 3 (autoclean)
lockd 58832 1 (autoclean) [nfs]
sunrpc 96348 1 (autoclean) [nfs lockd]
e1000 74208 1
microcode 5472 0 (autoclean)
# oib9 /root >
After the rmmod of mod_thh, I see the following on the console:
THH(1): thh_mod_obj.c[378]: cleanup_module: destroying InfiniHost0
THH kernel module removed successfully
Interestingly, the system still ran (without a problem). So, I rebooted and
attempted groups:
# oib9 /root > ifdown ib0
# oib9 /root > lsmod
Module Size Used by Not tainted
nfs 98384 3 (autoclean)
lockd 58832 1 (autoclean) [nfs]
sunrpc 96348 1 (autoclean) [nfs lockd]
ib_useraccess 10180 0 (unused)
ib_useraccess_cm 15808 0 (unused)
ib_cm 46680 0 [ib_useraccess_cm]
ib_udapl 40920 0 (unused)
ib_ip2pr 29500 0 [ib_useraccess_cm ib_udapl]
ib_ipoib 58380 0 [ib_udapl ib_ip2pr]
ib_sa_client 29076 0 [ib_udapl ib_ip2pr ib_ipoib]
ib_client_query 12736 0 [ib_udapl ib_ip2pr ib_ipoib ib_sa_client]
ib_tavor 23940 0 (autoclean) [ib_useraccess_cm]
mod_vapi 129892 0 (autoclean) [ib_useraccess_cm ib_udapl
ib_tavor]
mod_vipkl 219360 0 (autoclean) [mod_vapi]
mod_thh 257920 0 (autoclean) [mod_vapi]
mod_hh 15608 0 (autoclean) [mod_vipkl mod_thh]
mod_mpga 23584 0 (autoclean) [mod_vapi]
mod_vapi_common 65276 0 (autoclean) [ib_useraccess_cm ib_udapl
ib_tavor mod_vapi mod_vipkl mod_thh]
mosal 110053 0 (autoclean) [mod_vapi mod_vipkl mod_thh
mod_mpga mod_vapi_common]
ib_mad 21068 0 [ib_useraccess ib_cm ib_client_query]
ib_poll 14616 0 [ib_cm ib_ip2pr ib_client_query]
ib_core 47636 0 [ib_useraccess ib_useraccess_cm ib_cm
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_tavor ib_mad]
ib_packet_lib 147024 0 [ib_mad ib_core]
ib_services 16932 0 [ib_useraccess ib_useraccess_cm ib_cm
ib_udapl ib_ip2pr ib_ipoib ib_sa_client ib_client_query ib_tavor ib_mad
ib_poll ib_core ib_packet_lib]
e1000 74208 1
microcode 5472 0 (autoclean)
# oib9 /root > modprobe -r ib_udapl
At this point, on the console, I see the same messages all the way to the
point of rmmod'ing the mod_thh, but then:
Unable to handle kernel paging request at virtual address f8ccb183
printing eip:
f8ccb183
*pde = 376d4067
*pte = 00000000
Oops: 0000
nfs lockd sunrpc ib_ipoib ib_sa_client ib_client_query mod_vapi_common mosal
ib_mad ib_poll ib_core ib_packet_lib ib_services e1000 microcode
CPU: 0
EIP: 0060:[<f8ccb183>] Not tainted
EFLAGS: 00010282
EIP is at __insmod_mod_vapi_common_S.data_L4 [mod_vapi_common] 0xc928b
(2.4.21-9.EL-IB_patches/i686)
eax: c483e580 ebx: f7984200 ecx: f7983000 edx: 00000000
esi: f8ccb183 edi: f7b3c5b8 ebp: f63e3edc esp: f63e3ea8
ds: 0068 es: 0068 ss: 0068
Process modprobe (pid: 792, stackpage=f63e3000)
Stack: f8ba16e0 f7984200 00000282 f63e3ed8 00000282 f7b3c408 f7b3c49c 00000282
00000015 f7b3c408 f7b3c408 f7b3c400 f7b3c408 f63e3f0c f906eff7 f7984200
f9070147 00000015 00000003 00200000 f907012a f7b3c408 f7b3c408 f7b3c408
Call Trace: [<c010cd9d>] show_stack [kernel] 0x6c
[<c010cf31>] show_registers [kernel] 0x169
[<c010d0b2>] die [kernel] 0x8a
[<c011cdbc>] do_page_fault [kernel] 0x2e4
Code: Bad EIP value.
Kernel panic: Fatal exception
This is pretty reproduceable for me. Basic sequence:
0) Reboot with the modules.conf that I sent out on Friday (result is full boat
of modules, and ipoib).
1) ifdown ib0
2) modprobe -r ib_udapl
It seems that you must do it this way, because the following has a different
path of failure:
0) Reboot
1) modprobe -r ib_udapl
2) ifdown ib0
This way, I see the following:
# oib9 /root > modprobe -r ib_udapl
# oib9 /root >
On the console is the same messages all the way up to unloading the mod_thh
but then the following is spewed slowly:
[KERNEL_IB][ib_cached_sm_path_get_Rsmp_93f76742][core_cache.c:84]Bad magic
0x2000000 at f7630c80 for DEVICE
[KERNEL_IB][ib_cached_gid_get_Rsmp_efcd7458][core_cache.c:137]Bad magic
0x2000000 at f7630c80 for DEVICE
[KERNEL_IB][ib_mad_send_Rsmp_43faef95][mad_ib.c:132]Bad magic 0x2000000 at
f7630c80 for DEVICE
Back to the command line:
# oib9 /root > lsmod
Module Size Used by Not tainted
nfs 98384 3 (autoclean)
lockd 58832 1 (autoclean) [nfs]
sunrpc 96348 1 (autoclean) [nfs lockd]
ib_ipoib 58380 1
ib_sa_client 29076 0 [ib_ipoib]
ib_client_query 12736 0 [ib_ipoib ib_sa_client]
ib_mad 21068 0 [ib_client_query]
ib_poll 14616 0 [ib_client_query]
ib_core 47636 0 [ib_ipoib ib_sa_client ib_mad]
ib_packet_lib 147024 0 [ib_mad ib_core]
ib_services 16932 0 [ib_ipoib ib_sa_client ib_client_query
ib_mad ib_poll ib_core ib_packet_lib]
e1000 74208 1
microcode 5472 0 (autoclean)
# oib9 /root >
That's pretty scary, "modprobe -r" of ib_udapl seems to unload all of the
mellanox modules as well. Therefore, the ipoib is completely hosed. So, if
I attempt to ifdown ib0:
Unable to handle kernel paging request at virtual address f8cca343
printing eip:
f8cca343
*pde = 3747f067
*pte = 00000000
Oops: 0000
nfs lockd sunrpc ib_ipoib ib_sa_client ib_client_query ib_mad ib_poll ib_core
ib_packet_lib ib_services e1000 microcode
CPU: 0
EIP: 0060:[<f8cca343>] Not tainted
EFLAGS: 00010246
EIP is at __insmod_ib_mad_S.data_L288 [ib_mad] 0x110223
(2.4.21-9.EL-IB_patches/i686)
eax: f72c33cc ebx: f72c33c0 ecx: f74ba400 edx: f726cb80
esi: f74ba408 edi: 00001002 ebp: f624de44 esp: f624de10
ds: 0068 es: 0068 ss: 0068
Process ip (pid: 777, stackpage=f624d000)
Stack: f8ba0ca9 f72c33c0 00000002 f90703fc f74ba408 00000800 00000800 f624de50
00000100 00000297 f74ba400 f74ba400 f74ba408 f624de74 f906ecf9 f72c33c0
f624de88 f90683a4 f74ba408 f74ba49c 00000286 f74ba408 f74ba52c f72c33c0
Call Trace: [<c010cd9d>] show_stack [kernel] 0x6c
[<c010cf31>] show_registers [kernel] 0x169
[<c010d0b2>] die [kernel] 0x8a
[<c011cdbc>] do_page_fault [kernel] 0x2e4
Code: Bad EIP value.
Kernel panic: Fatal exception
Quite strange behaviour. Hopefully you've made it all the way to the bottom
of this email...
On Friday 26 March 2004 14:16, Roland Dreier wrote:
> Thanks for the information. Which module triggers the problems on
> unloading? In other words, if you go through and start unloading
> modules in the order they appear in lsmod, when do you get the crash?
>
> - Roland
--
(((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))
Makia Minich The memories of a man in his old age,
are the deeds of a man in his Prime.
makia at llnl.gov -- Waters/Wright
(((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))
--
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.
More information about the openib-general mailing list