Bug #3770
closedipmi drivers hangs due to attach/detach/attach cycle
100%
Description
36657:May 11 11:06:02 myhost ipmi: [ID 550463 kern.warning] WARNING: Timed out waiting for GET_DEVICE_ID
Noticed this one when FMD didn't come up:
> ::stack libc_hwcap1.so.1`__pollsys+0x15(8046e70, 1, 0, 0) libc_hwcap1.so.1`pselect+0x199(a, 8046fa0, 0, 0, 0, 0) libc_hwcap1.so.1`select+0x78(a, 8046fa0, 0, 0, 0, 0) libipmi.so.1`ipmi_bmc_send+0x122(828d348, 80470d8, 810d450, 810dc84) libipmi.so.1`ipmi_send+0x30(810d448, 80470d8, 80470f8, fe1e6b82) libipmi.so.1`ipmi_sdr_get_info+0x40(810d448, fe1fb000, 8047128, fe1e6d72) libipmi.so.1`ipmi_sdr_refresh+0x20(810d448, 14, fe1e4114, fe1e75be) libipmi.so.1`ipmi_sdr_iter+0x33(810d448, fe1e3900, 0, fe1e3e0a) libipmi.so.1`ipmi_entity_refresh+0x51(810d448, 80471cc, 1, fe1e3e7e) libipmi.so.1`ipmi_entity_iter+0x20(810d448, fde71968, 8047200, fde71f32) ipmi.so`ipmi_enum+0x103(812b2a8, 823c2b8, 811c3d0, 0, 64, 0) libtopo.so.1`topo_mod_enumerate+0xb3(812b2a8, 823c2b8, 82a1ab8, 811c3d0, 0, 64) libtopo.so.1`enum_run+0xa8(812bb68, 8297b20, a, fec570d6) libtopo.so.1`topo_xml_range_process+0x102(812bb68, 81ef278, 8297b20, 8047338) libtopo.so.1`tf_rdata_new+0xf5(812bb68, 8130480, 81ef278, 823c2b8) libtopo.so.1`topo_xml_walk+0x243(812bb68, 8130480, 81ef3b8, 823c2b8) libtopo.so.1`topo_xml_walk+0x1ae(812bb68, 8130480, 8153b20, 823c2b8) libtopo.so.1`dependent_create+0x12f(812bb68, 8130480, 8136c30, 8153b20, 823c2b8, fec5e7ec) libtopo.so.1`dependents_create+0x7c(812bb68, 8130480, 8136c30, 8153da0, 823c2b8, 8118128) libtopo.so.1`pad_process+0x53b(812bb68, 8297e50, 8153da0, 823c2b8, 8297e78, 811c540) libtopo.so.1`topo_xml_range_process+0x282(812bb68, 8153da0, 8297e50, 80475f8) libtopo.so.1`tf_rdata_new+0xf5(812bb68, 8130480, 8153da0, 81182a8) libtopo.so.1`topo_xml_walk+0x243(812bb68, 8130480, 814e798, 81182a8) libtopo.so.1`topo_xml_enum+0x4c(812bb68, 8130480, 81182a8, fec461a1) libtopo.so.1`topo_file_load+0xe1(812bb68, 81182a8, fe20a7b8, fe20a94c, 0, fe20b0bc) libtopo.so.1`topo_mod_enummap+0x29(812bb68, 81182a8, fe20a7b8, fe20a94c) x86pi.so`x86pi_enum_start+0x210(812bb68, 8047880, 80478a8, fe204492) x86pi.so`x86pi_enum+0x56(812bb68, 81182a8, 811cd50, 0, 0, 0) libtopo.so.1`topo_mod_enumerate+0xb3(812bb68, 81182a8, 8130998, 811cd50, 0, 0) libtopo.so.1`enum_run+0xa8(812bbd8, 8120998, a, fec570d6) libtopo.so.1`topo_xml_range_process+0x102(812bbd8, 810a010, 8120998, 80479a8) libtopo.so.1`tf_rdata_new+0xf5(812bbd8, 81309b0, 810a010, 81182a8) libtopo.so.1`topo_xml_walk+0x243(812bbd8, 81309b0, 810a150, 81182a8) libtopo.so.1`topo_xml_enum+0x4c(812bbd8, 81309b0, 81182a8, fec461a1) libtopo.so.1`topo_file_load+0xe1(812bbd8, 81182a8, 811cf28, 811cd60, 0, 2c) libtopo.so.1`topo_tree_enum+0x85(812af40, 81100e0, 3a103fd8, fec531f2) libtopo.so.1`topo_tree_enum_all+0x2d(812af40, fec6f000, 8047c58, fec50e6a) libtopo.so.1`topo_snap_create+0xd6(812af40, 8047cac, 1, fec51086) libtopo.so.1`topo_snap_hold+0x45(812af40, 0, 8047cac, 807e806) fmd_topo_update+0xaa(1, 0, 8047dd8, 8060e82) fmd_topo_init+0xd(0, 0, 0, 40000000, 8085540, 0) fmd_run+0x14f(809d848, 4, 8047e58, 8074def) main+0x2fa(1, 8047e90, 8047e98, 8047e4c) _start+0x7d(1, 8047f20, 0, 8047f34, 8047f4c, 8047f6d)
> ::pgrep fmd S PID PPID PGID SID UID FLAGS ADDR NAME R 2047 1882 2047 1853 0 0x4a004000 ffffff0d3e0f61c8 fmd R 485 9 485 485 0 0x4a004000 ffffff0d563971e8 fmd R 496 485 496 496 0 0x42000000 ffffff0d3766d1c0 fmd > ffffff0d3766d1c0::walk thread | ::findstack stack pointer for thread ffffff0d56f31040: ffffff005d636bb0 [ ffffff005d636bb0 _resume_from_idle+0xf1() ] ffffff005d636be0 swtch+0x145() ffffff005d636c40 cv_wait_sig_swap_core+0x174() ffffff005d636c60 cv_wait_sig_swap+0x18() ffffff005d636cd0 cv_waituntil_sig+0x13c() ffffff005d636db0 poll_common+0x462() ffffff005d636e30 pollsys+0xe4() ffffff005d636ec0 dtrace_systrace_syscall32+0x11a() ffffff005d636f10 _sys_sysenter_post_swapgs+0x149()
Anonymous DTracing ipmi_attach/detach showed that there was an attach then detach then attach happening on boot.
I've seen this on several different makes and models of servers but it's not consistently reproducible on most HW. Nexenta lab has a certain box that exhibits the problem every time.
The proposed fix performs the following:
- Explicitly set ipmi_detaching variable to zero in case detach had set it to 1
- Fail attach if ipmi_startup fails
- Destroy of global mutexes and cvs (Originally noticed by Hans)
- Fail attach for instance(s) other then zero since the driver assumes there is only one instance
Updated by Rich Lowe about 10 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Resolved in e1c99a7