Bug #3770


ipmi drivers hangs due to attach/detach/attach cycle

Added by Alek Pinchuk about 11 years ago. Updated about 11 years ago.

driver - device drivers
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


36657:May 11 11:06:02 myhost ipmi: [ID 550463 kern.warning] WARNING: Timed out waiting for GET_DEVICE_ID

Noticed this one when FMD didn't come up:

> ::stack`__pollsys+0x15(8046e70, 1, 0, 0)`pselect+0x199(a, 8046fa0, 0, 0, 0, 0)`select+0x78(a, 8046fa0, 0, 0, 0, 0)`ipmi_bmc_send+0x122(828d348, 80470d8, 810d450, 810dc84)`ipmi_send+0x30(810d448, 80470d8, 80470f8, fe1e6b82)`ipmi_sdr_get_info+0x40(810d448, fe1fb000, 8047128, fe1e6d72)`ipmi_sdr_refresh+0x20(810d448, 14, fe1e4114, fe1e75be)`ipmi_sdr_iter+0x33(810d448, fe1e3900, 0, fe1e3e0a)`ipmi_entity_refresh+0x51(810d448, 80471cc, 1, fe1e3e7e)`ipmi_entity_iter+0x20(810d448, fde71968, 8047200, fde71f32)`ipmi_enum+0x103(812b2a8, 823c2b8, 811c3d0, 0, 64, 0)`topo_mod_enumerate+0xb3(812b2a8, 823c2b8, 82a1ab8, 811c3d0, 0, 64)`enum_run+0xa8(812bb68, 8297b20, a, fec570d6)`topo_xml_range_process+0x102(812bb68, 81ef278, 8297b20, 8047338)`tf_rdata_new+0xf5(812bb68, 8130480, 81ef278, 823c2b8)`topo_xml_walk+0x243(812bb68, 8130480, 81ef3b8, 823c2b8)`topo_xml_walk+0x1ae(812bb68, 8130480, 8153b20, 823c2b8)`dependent_create+0x12f(812bb68, 8130480, 8136c30, 8153b20, 823c2b8, fec5e7ec)`dependents_create+0x7c(812bb68, 8130480, 8136c30, 8153da0, 823c2b8, 8118128)`pad_process+0x53b(812bb68, 8297e50, 8153da0, 823c2b8, 8297e78, 811c540)`topo_xml_range_process+0x282(812bb68, 8153da0, 8297e50, 80475f8)`tf_rdata_new+0xf5(812bb68, 8130480, 8153da0, 81182a8)`topo_xml_walk+0x243(812bb68, 8130480, 814e798, 81182a8)`topo_xml_enum+0x4c(812bb68, 8130480, 81182a8, fec461a1)`topo_file_load+0xe1(812bb68, 81182a8, fe20a7b8, fe20a94c, 0, fe20b0bc)`topo_mod_enummap+0x29(812bb68, 81182a8, fe20a7b8, fe20a94c)`x86pi_enum_start+0x210(812bb68, 8047880, 80478a8, fe204492)`x86pi_enum+0x56(812bb68, 81182a8, 811cd50, 0, 0, 0)`topo_mod_enumerate+0xb3(812bb68, 81182a8, 8130998, 811cd50, 0, 0)`enum_run+0xa8(812bbd8, 8120998, a, fec570d6)`topo_xml_range_process+0x102(812bbd8, 810a010, 8120998, 80479a8)`tf_rdata_new+0xf5(812bbd8, 81309b0, 810a010, 81182a8)`topo_xml_walk+0x243(812bbd8, 81309b0, 810a150, 81182a8)`topo_xml_enum+0x4c(812bbd8, 81309b0, 81182a8, fec461a1)`topo_file_load+0xe1(812bbd8, 81182a8, 811cf28, 811cd60, 0, 2c)`topo_tree_enum+0x85(812af40, 81100e0, 3a103fd8, fec531f2)`topo_tree_enum_all+0x2d(812af40, fec6f000, 8047c58, fec50e6a)`topo_snap_create+0xd6(812af40, 8047cac, 1, fec51086)`topo_snap_hold+0x45(812af40, 0, 8047cac, 807e806)
fmd_topo_update+0xaa(1, 0, 8047dd8, 8060e82)
fmd_topo_init+0xd(0, 0, 0, 40000000, 8085540, 0)
fmd_run+0x14f(809d848, 4, 8047e58, 8074def)
main+0x2fa(1, 8047e90, 8047e98, 8047e4c)
_start+0x7d(1, 8047f20, 0, 8047f34, 8047f4c, 8047f6d)
> ::pgrep fmd
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   2047   1882   2047   1853      0 0x4a004000 ffffff0d3e0f61c8 fmd
R    485      9    485    485      0 0x4a004000 ffffff0d563971e8 fmd
R    496    485    496    496      0 0x42000000 ffffff0d3766d1c0 fmd
> ffffff0d3766d1c0::walk thread | ::findstack
stack pointer for thread ffffff0d56f31040: ffffff005d636bb0
[ ffffff005d636bb0 _resume_from_idle+0xf1() ]
  ffffff005d636be0 swtch+0x145()
  ffffff005d636c40 cv_wait_sig_swap_core+0x174()
  ffffff005d636c60 cv_wait_sig_swap+0x18()
  ffffff005d636cd0 cv_waituntil_sig+0x13c()
  ffffff005d636db0 poll_common+0x462()
  ffffff005d636e30 pollsys+0xe4()
  ffffff005d636ec0 dtrace_systrace_syscall32+0x11a()
  ffffff005d636f10 _sys_sysenter_post_swapgs+0x149()

Anonymous DTracing ipmi_attach/detach showed that there was an attach then detach then attach happening on boot.

I've seen this on several different makes and models of servers but it's not consistently reproducible on most HW. Nexenta lab has a certain box that exhibits the problem every time.

The proposed fix performs the following:
- Explicitly set ipmi_detaching variable to zero in case detach had set it to 1
- Fail attach if ipmi_startup fails
- Destroy of global mutexes and cvs (Originally noticed by Hans)
- Fail attach for instance(s) other then zero since the driver assumes there is only one instance

Actions #1

Updated by Rich Lowe about 11 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Resolved in e1c99a7


Also available in: Atom PDF