Bug #12069
closedBackport sh_delay() and tvsleep() from ksh-2020.0.0
100%
Description
We faces few cases when /usr/bin/sleep
slept forever. Here is truss output to show what happened:
18962: sigaction(SIGALRM, 0x080468E0, 0x08046960) = 0 18962: new: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: old: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: setitimer(ITIMER_REAL, 0x080469A0, 0x08046990) = 0 18962: value: interval: 0.000000 sec value: 399.000000 sec 18962: ovalue: interval: 0.000000 sec value: 0.000000 sec 18962: pause() (sleeping...) 18962: Received signal #14, SIGALRM, in pause() [caught] 18962: pause() Err#4 EINTR 18962: lwp_sigmask(SIG_SETMASK, 0x00002000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 18962: alarm(0) = 0 18962: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 18962: sigaction(SIGALRM, 0x08046610, 0x08046690) = 0 18962: new: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: old: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: setitimer(ITIMER_REAL, 0x080466D0, 0x080466C0) = 0 18962: value: interval: 0.000000 sec value: 0.001263 sec 18962: ovalue: interval: 0.000000 sec value: 0.000000 sec 18962: Received signal #14, SIGALRM [caught] 18962: lwp_sigmask(SIG_SETMASK, 0x00002000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 18962: alarm(0) = 0 18962: setcontext(0x080461E0) 18962: setcontext(0x08046540) 18962: getpid() = 18962 [18961] 18962: pause() (sleeping...)
Obviously, the problem is the second setitimer()
call that sets the timeout to very short value and the SIGALARM
comes almost immediately. The signal is then cleared by the subsequent alarm(0)
call.
The root cause for this problem could be seen in the sleep()
function in usr/src/lib/libshell/common/bltins/sleep.c
(called from sh_delay()
for delays longer than 30 seconds). First, the sh_timeradd()
is called to set the alarm (line 133), later pause()
is called to wait for the signal (line 137). In a case the signal arrived before the timeout the do-while loop (lines 134-152) calls the pause()
again and it is expected that the sigalrm()
handler configures new alarm by calling setalarm()
in the meantime (in background). This obviously happens, but in a case the 2nd scheduled alarm interval is too short (in our case 0.001263 sec) the actual signal could arrive even before we called pause()
2nd time. This causes the pause()
to hang basically indefinitely (or until some random signal arrives, which could never happen).
Related issues