Bug #3055
closedsendfile sets *off to 0 when interrupted by SIGALRM
0%
Description
After spending several hours investigating a strange problem in Netatalk running on Openindiana I was finally able to track down the problem to sendfile occasionally returning with -1, an errno of EINTR and *off reset to 0.
truss shows:
8.6034 0.0092 sendfilev64(1, 9, 0x08047970, 1, 0x08047964) Err#4 EINTR
sfv_fd=6 sfv_flag=0x0 sfv_off=644100096 sfv_len=253952
8.6037 0.0003 Received signal #14, SIGALRM [caught]
siginfo: SIGALRM pid=13721 uid=0
sfv_off ist NOT the returned value. In order to get at that I added some debugging code after my sendfile code which showed that *off was 644100096 BEFORE sendfile(), but 0 afterwards:
Apr 15 06:03:06.564213 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 644100096, new offset: 0
Apr 15 06:03:15.043380 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 821833728, new offset: 0
Apr 15 06:03:54.438215 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 278147072, new offset: 0
Apr 15 06:04:01.431494 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 501493760, new offset: 0
Clearly, if sendfile() is returning with sth like EAGAIN or EINTR *off must still contain the current offset, possibly update by the amount of bytes sent.
Workaround:
in case of sendfile returning -1 with an errno of EINTR, check whether *off is 0 and reasign *off appropiately. I'm assuming 0 bytes have been sent in this case.
Files
Updated by Daniil Lunev about 11 years ago
And what about other signals? Do they reset the offset?
Updated by Frank Lahm about 11 years ago
Daniil Lunev wrote:
And what about other signals? Do they reset the offset?
Don't know, SIGALRM is the only signal I tested.
In the production code SIGALRM is send by a timer set to 30 seconds so the problem only seems to happen rarely when a long lasting file transfer is in progress (Netatalk sends files in 256k chunks to the client).
I managed to reproduce the issue quickly by sending the process a SIGALRM from a shell loop every 0.1 second:
# while true; do kill -SIGALRM PID; sleep 0.1; done
As can be seen in the truss, sendfilev quite often returns with -1 and errno EAGAIN. There the offset seems to be updated correctly. Obviously that's a different return vector, it's not caused by a signal but by a full TCP buffer (or similar).
The TCP socket in question is in non-blocking IO mode of course.
Updated by Daniil Lunev about 11 years ago
libsendfile calls SYS_sendfile. This syscall check receiving any signals in three places. When it finds out that there is a signal for the current thread, it stops its actions and go out. So, it is very likely that all signals reset the offset, but I want be sure. Can you try something like that
- while true; do kill -SIGHUP PID; sleep 0.1; done
or - while true; do kill -SIGPOLL PID; sleep 0.1; done
Updated by Frank Lahm about 11 years ago
Daniil Lunev wrote:
libsendfile calls SYS_sendfile. This syscall check receiving any signals in three places. When it finds out that there is a signal for the current thread, it stops its actions and go out. So, it is very likely that all signals reset the offset, but I want be sure. Can you try something like that
- while true; do kill -SIGHUP PID; sleep 0.1; done
yes, happens with SIGHUP too. truss:
189.3762 0.0092 sendfilev64(1, 10, 0x08047960, 1, 0x08047954) Err#4 EINTR sfv_fd=13 sfv_flag=0x0 sfv_off=26992640 sfv_len=8192 189.3766 0.0004 Received signal #1, SIGHUP [caught] siginfo: SIGHUP pid=13347 uid=0
Application log:
Aug 08 16:49:33.643558 afpd[27189] {dsi_stream.c:357} (E:Default): sendfile interrupted: errno: 4, offset: 26992640, pos: 0
The code basically looks like this:
while (something to send) { ret = sendfile(non-blocking-fd, ...) ...error checking... /* * Workaround a bizarre bug in Openindiana where pos is set to 0 * after sendfile() returned -1 with EINTR after receiving a SIGALRM */ if (errno == EINTR) LOG(log_error, logtype_default, "sendfile interrupted: errno: %d, offset: %zd, pos: %zd", errno, offset, pos); if (pos == 0) { pos = offset; continue; } }
Full code is here:
https://github.com/franklahm/Netatalk/blob/solaris-sendfile-eintr-fix/libatalk/dsi/dsi_stream.c#L338
Note that sys_sendfile() just calls on to sendfile on Solaris/OI.
Updated by Daniil Lunev about 11 years ago
I can't reproduce this bug on my system. It works pretty fine. I call sendfilev64, then send SIGALRM and sfv_off isn't changed.
Netatalk changes SIGALRM signal handler. Could you desribe what new handler do?
Updated by Daniil Lunev about 11 years ago
And sendfile64, even after receiving SIGALRM, returns right value - off + len;
Updated by Frank Lahm about 11 years ago
Daniil Lunev wrote:
I can't reproduce this bug on my system. It works pretty fine. I call sendfilev64, then send SIGALRM and sfv_off isn't changed.
I suggest calling sendfile(), not sendfilev()! I'm suspecting that possibly the sendfilev() wrapper code of sendfile() may be causing this. My suspicion comes from the fact that the code in question that uses sendfile() is from Netatalk 2.x.
Netatalk 3 has been rewritten so call sendfilev() and afaict the problem doesn't seem to occur there.
Netatalk changes SIGALRM signal handler. Could you desribe what new handler do?
Call setitimer(...) to ensure the timer is installed, then return.
<https://github.com/franklahm/Netatalk/blob/solaris-sendfile-eintr-fix/etc/afpd/afp_dsi.c#L293>
I've added a return directly after the call to setitimer in line 300 for the purpose of investigating this bug.
Also, I had added the very same code (setitimer() then return) to the SIGHUP handler.
Updated by Daniil Lunev about 11 years ago
I call sendfile with non-blocking socket, it catch SIGALARM. Offset isn't changed. What version of OI do you use?
Updated by Frank Lahm about 11 years ago
Daniil Lunev wrote:
I call sendfile with non-blocking socket, it catch SIGALARM. Offset isn't changed. What version of OI do you use?
-bash-4.0$ pkg publisher PUBLISHER TYPE STATUS URI openindiana.org origin online http://pkg.openindiana.org/dev/ -bash-4.0$ uname -a SunOS oi 5.11 oi_151a5 i86pc i386 i86pc -bash-4.0$ cat /etc/release OpenIndiana Development oi_151.1.5 X86 (powered by illumos) Copyright 2011 Oracle and/or its affiliates. All rights reserved. Use is subject to license terms. Assembled 26 June 2012 -bash-4.0$
Updated by Daniil Lunev about 11 years ago
May be, I am doing something wrong, but off don't become zero in the followed code.while true; do kill -SIGALRM PID; sleep 0.1; done
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <signal.h> #include <sys/sendfile.h> #include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <sys/fcntl.h> void catch (int sig) { } int main () { sigset (SIGALRM, catch); FILE * inf = fopen ("test_in", "r"); off64_t off; int is; struct sockaddr_in addr; is = socket (AF_INET, SOCK_STREAM, 0); fcntl (is, F_SETFL, fcntl (is, F_GETFL, 0) | O_NONBLOCK); memset (&addr, 0, sizeof (struct sockaddr_in)); addr.sin_family = AF_INET; addr.sin_port = htons (27001); addr.sin_addr.s_addr = htonl (INADDR_LOOPBACK); connect (is, (struct sockaddr *)&addr, sizeof (addr)); for (;;) { off = 512; sendfile64 (is, fileno (inf), &off, 35 * 1024 * 1024); if ((unsigned int) off == 0) break; } printf ("Break!\n"); fclose (inf); return 0; }
Updated by Daniil Lunev about 11 years ago
Server of it. Can you test these server and client? Will they reproduce the bug?
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <signal.h> #include <sys/sendfile.h> #include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <sys/fcntl.h> int main () { int os, tmp; char buf [1024]; os = socket (AF_INET, SOCK_STREAM, 0); struct sockaddr_in addr; memset (&addr, 0, sizeof (struct sockaddr_in)); addr.sin_family = AF_INET; addr.sin_port = htons (27001); addr.sin_addr.s_addr = htonl (INADDR_LOOPBACK); bind (os, (struct sockaddr *)&addr, sizeof (addr)); listen (os, 1); tmp = accept (os, NULL, NULL); fcntl (tmp, F_SETFL, fcntl (tmp, F_GETFL, 0) | O_NONBLOCK); while (1) { recv (tmp, buf, 1024, 0); } return 0; }
Updated by Frank Lahm about 11 years ago
Daniil Lunev wrote:
Server of it. Can you test these server and client? Will they reproduce the bug?
I've nearly finished my own minimal client/server app which forks. I let you know if I can reproduce the bug that way and if I can post the code.
Updated by Frank Lahm almost 11 years ago
- Status changed from New to Feedback
Unable to ever get this really reproduced. I think we should close this issue.
Updated by Marcel Telka almost 4 years ago
- Status changed from Feedback to Closed