Project

General

Profile

Bug #3055

sendfile sets *off to 0 when interrupted by SIGALRM

Added by Frank Lahm about 8 years ago. Updated 11 months ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
kernel
Start date:
2012-08-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage
Gerrit CR:

Description

After spending several hours investigating a strange problem in Netatalk running on Openindiana I was finally able to track down the problem to sendfile occasionally returning with -1, an errno of EINTR and *off reset to 0.

truss shows:

8.6034  0.0092 sendfilev64(1, 9, 0x08047970, 1, 0x08047964)    Err#4 EINTR
sfv_fd=6 sfv_flag=0x0 sfv_off=644100096 sfv_len=253952
8.6037 0.0003 Received signal #14, SIGALRM [caught]
siginfo: SIGALRM pid=13721 uid=0

sfv_off ist NOT the returned value. In order to get at that I added some debugging code after my sendfile code which showed that *off was 644100096 BEFORE sendfile(), but 0 afterwards:

Apr 15 06:03:06.564213 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 644100096, new offset: 0
Apr 15 06:03:15.043380 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 821833728, new offset: 0
Apr 15 06:03:54.438215 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 278147072, new offset: 0
Apr 15 06:04:01.431494 afpd22475 {dsi_stream.c:354} (E:DSI): sendfile: EINTR - old offset: 501493760, new offset: 0

Clearly, if sendfile() is returning with sth like EAGAIN or EINTR *off must still contain the current offset, possibly update by the amount of bytes sent.

Workaround:
in case of sendfile returning -1 with an errno of EINTR, check whether *off is 0 and reasign *off appropiately. I'm assuming 0 bytes have been sent in this case.


Files

truss.txt.zip (170 KB) truss.txt.zip Frank Lahm, 2012-08-04 01:25 PM

History

#1

Updated by Daniil Lunev about 8 years ago

And what about other signals? Do they reset the offset?

#2

Updated by Frank Lahm about 8 years ago

Daniil Lunev wrote:

And what about other signals? Do they reset the offset?

Don't know, SIGALRM is the only signal I tested.

In the production code SIGALRM is send by a timer set to 30 seconds so the problem only seems to happen rarely when a long lasting file transfer is in progress (Netatalk sends files in 256k chunks to the client).

I managed to reproduce the issue quickly by sending the process a SIGALRM from a shell loop every 0.1 second:

# while true; do kill -SIGALRM PID; sleep 0.1; done

As can be seen in the truss, sendfilev quite often returns with -1 and errno EAGAIN. There the offset seems to be updated correctly. Obviously that's a different return vector, it's not caused by a signal but by a full TCP buffer (or similar).

The TCP socket in question is in non-blocking IO mode of course.

#3

Updated by Daniil Lunev about 8 years ago

libsendfile calls SYS_sendfile. This syscall check receiving any signals in three places. When it finds out that there is a signal for the current thread, it stops its actions and go out. So, it is very likely that all signals reset the offset, but I want be sure. Can you try something like that

  1. while true; do kill -SIGHUP PID; sleep 0.1; done
    or
  2. while true; do kill -SIGPOLL PID; sleep 0.1; done
#4

Updated by Frank Lahm about 8 years ago

Daniil Lunev wrote:

libsendfile calls SYS_sendfile. This syscall check receiving any signals in three places. When it finds out that there is a signal for the current thread, it stops its actions and go out. So, it is very likely that all signals reset the offset, but I want be sure. Can you try something like that

  1. while true; do kill -SIGHUP PID; sleep 0.1; done

yes, happens with SIGHUP too. truss:

189.3762         0.0092 sendfilev64(1, 10, 0x08047960, 1, 0x08047954)   Err#4 EINTR
sfv_fd=13       sfv_flag=0x0    sfv_off=26992640        sfv_len=8192
189.3766         0.0004     Received signal #1, SIGHUP [caught]
      siginfo: SIGHUP pid=13347 uid=0

Application log:

Aug 08 16:49:33.643558 afpd[27189] {dsi_stream.c:357} (E:Default): sendfile interrupted: errno: 4, offset: 26992640, pos: 0

The code basically looks like this:

       while (something to send) {
           ret = sendfile(non-blocking-fd, ...)
           ...error checking...

           /*                                                                                                                                                                                             
            * Workaround a bizarre bug in Openindiana where pos is set to 0                                                                                                                               
            * after sendfile() returned -1 with EINTR after receiving a SIGALRM                                                                                                                           
            */
            if (errno == EINTR)
                LOG(log_error, logtype_default, "sendfile interrupted: errno: %d, offset: %zd, pos: %zd", errno, offset, pos);

            if (pos == 0) {
                pos = offset;
                continue;
            }
        }

Full code is here:
https://github.com/franklahm/Netatalk/blob/solaris-sendfile-eintr-fix/libatalk/dsi/dsi_stream.c#L338
Note that sys_sendfile() just calls on to sendfile on Solaris/OI.

#5

Updated by Daniil Lunev about 8 years ago

I can't reproduce this bug on my system. It works pretty fine. I call sendfilev64, then send SIGALRM and sfv_off isn't changed.
Netatalk changes SIGALRM signal handler. Could you desribe what new handler do?

#6

Updated by Daniil Lunev about 8 years ago

And sendfile64, even after receiving SIGALRM, returns right value - off + len;

#7

Updated by Frank Lahm about 8 years ago

Daniil Lunev wrote:

I can't reproduce this bug on my system. It works pretty fine. I call sendfilev64, then send SIGALRM and sfv_off isn't changed.

I suggest calling sendfile(), not sendfilev()! I'm suspecting that possibly the sendfilev() wrapper code of sendfile() may be causing this. My suspicion comes from the fact that the code in question that uses sendfile() is from Netatalk 2.x.
Netatalk 3 has been rewritten so call sendfilev() and afaict the problem doesn't seem to occur there.

Netatalk changes SIGALRM signal handler. Could you desribe what new handler do?

Call setitimer(...) to ensure the timer is installed, then return.

<https://github.com/franklahm/Netatalk/blob/solaris-sendfile-eintr-fix/etc/afpd/afp_dsi.c#L293>

I've added a return directly after the call to setitimer in line 300 for the purpose of investigating this bug.
Also, I had added the very same code (setitimer() then return) to the SIGHUP handler.

#8

Updated by Daniil Lunev about 8 years ago

I call sendfile with non-blocking socket, it catch SIGALARM. Offset isn't changed. What version of OI do you use?

#9

Updated by Frank Lahm about 8 years ago

Daniil Lunev wrote:

I call sendfile with non-blocking socket, it catch SIGALARM. Offset isn't changed. What version of OI do you use?

-bash-4.0$ pkg publisher
PUBLISHER                             TYPE     STATUS   URI
openindiana.org                       origin   online   http://pkg.openindiana.org/dev/
-bash-4.0$ uname -a
SunOS oi 5.11 oi_151a5 i86pc i386 i86pc
-bash-4.0$ cat /etc/release 
             OpenIndiana Development oi_151.1.5 X86 (powered by illumos)
        Copyright 2011 Oracle and/or its affiliates. All rights reserved.
                        Use is subject to license terms.
                           Assembled 26 June 2012
-bash-4.0$
#10

Updated by Daniil Lunev about 8 years ago

May be, I am doing something wrong, but off don't become zero in the followed code.
while true; do kill -SIGALRM PID; sleep 0.1; done

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <sys/sendfile.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <sys/fcntl.h>

void
catch (int sig)
{
}

int
main ()
{
    sigset (SIGALRM, catch);
    FILE * inf = fopen ("test_in", "r");
    off64_t off;
    int is;
    struct sockaddr_in addr;

    is = socket (AF_INET, SOCK_STREAM, 0);
    fcntl (is, F_SETFL, fcntl (is, F_GETFL, 0) | O_NONBLOCK);
    memset (&addr, 0, sizeof (struct sockaddr_in));
    addr.sin_family = AF_INET;
    addr.sin_port = htons (27001);
    addr.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    connect (is, (struct sockaddr *)&addr, sizeof (addr));
    for (;;) {
        off = 512;
        sendfile64 (is, fileno (inf), &off, 35 * 1024 * 1024);
        if ((unsigned int) off == 0)
            break;
    }
    printf ("Break!\n");
    fclose (inf);
    return 0;
}

#11

Updated by Daniil Lunev about 8 years ago

Server of it. Can you test these server and client? Will they reproduce the bug?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <sys/sendfile.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <sys/fcntl.h>

int
main () {
    int os, tmp;
    char buf [1024];
    os = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in addr;
    memset (&addr, 0, sizeof (struct sockaddr_in));
    addr.sin_family = AF_INET;
    addr.sin_port = htons (27001);
    addr.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    bind (os, (struct sockaddr *)&addr, sizeof (addr));
    listen (os, 1);
    tmp = accept (os, NULL, NULL);
    fcntl (tmp, F_SETFL, fcntl (tmp, F_GETFL, 0) | O_NONBLOCK);
    while (1) {
        recv (tmp, buf, 1024, 0);
    }
    return 0;
}

#12

Updated by Frank Lahm about 8 years ago

Daniil Lunev wrote:

Server of it. Can you test these server and client? Will they reproduce the bug?

I've nearly finished my own minimal client/server app which forks. I let you know if I can reproduce the bug that way and if I can post the code.

#13

Updated by Frank Lahm almost 8 years ago

  • Status changed from New to Feedback

Unable to ever get this really reproduced. I think we should close this issue.

#14

Updated by Marcel Telka 11 months ago

  • Status changed from Feedback to Closed

Also available in: Atom PDF