SO_REUSEADDR woes

Lazyweb,

  1. When I kill my WebSocket server, I have to wait 2+ minutes to restart it because the kernel (2.6.32 Centos 6.2) says my port is still in use. I've tried trapping the signal and calling disconnect on each client connection to no avail. Also no help: setsockopt($S, SOL_SOCKET, SO_REUSEADDR, 1).

    And this is a hassle because:

  1. Every few days this server code just goes catatonic. The process is alive but no longer accepting connections. I've added log messages to every callback, and there's no obvious triggering event that causes it to break; it just seems to happen... every now and then. At that point there's nothing to do but to kill and restart it, which sucks as per above. Attaching a gdb to it didn't tell me anything. How do you debug something like this?


Update: I think I've solved problem #1, thanks to the suggestions below about how I was using ReuseAddr. Problem #2 persists. And also I am now on kernel 3.10.0 CentOS 7.5, which has not changed the problem #2 behavior.


Previously.

Tags: , , ,

19 Responses:

  1. CJ says:

    Try SO_REUSEPORT?

  2. Christian Vogel says:

    I think you have to setsockopt() before the bind() for it to work properly, the code in backstage/src/websock/dna-websock.pl suggests you are using the other order. As the bind probably takes place within IO::Socket::INET->new, you probably have to patch it in there.


    socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

    • Philip Guenther says:

      The ReuseAddr=>1 argument to IO::Socket::INET->new() makes it call setsockopt(SO_REUSEADDR,1) on the new socket before calling bind or listen. Even the earliest version of Net::WebSocket::Server in metacpan.org passes that to IO::Socket::INET->new().

      I guess the question to jwz is which socket is unable to bind?

  3. foo says:

    By any chance, is the server port in the ephemeral range? I was recently told that the rules applied to ephemeral ports (including reuse) were quite confusing.

    • Philip Guenther says:

      Nope, whoever told you that is misunderstanding how the 2min timeout only applies to an end which sent a FIN before seeing a FIN from the other side. (Normally that's just one end, whichever closes first, but on simultaneous close both ends will go into TIME_WAIT.)

      Ephemeral ports are (normally) used by clients and, for many but not all protocols, clients close first, leading to more TIME_WAIT states on the "ephemeral port" end, but that's not a rule and there's nothing in the protocol specs that changes the TCP state machine depending on the port numbers involved.

  4. Line Noise says:

    You can tune the kernel to clear the socket quicker (10 seconds in this case):

    echo 10 > /proc/sys/net/ipv4/tcp_tw_recycle

    And you can tune it to reuse sockets by default:

    echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

    There are warnings and caveats about doing this but I've successfully used these settings on busy database servers that were running out of sockets because of too many lingering closed sockets in TIME_WAIT mode.

  5. Perry says:

    The first step is to use netstat to examine what state the kernel thinks the socket is in immediately after the server dies. That won't necessarily make it obvious how to fix it, but it will at least give you a hint of what's going on in the TCP state machine.

    • Nick Lamb says:

      "The process is alive but no longer accepting connections"

      ... also suggests value in diagnosing what this process is doing, it's presumably either entirely asleep (blocked in the kernel) or slowly spinning (polling at say one second timeout in a loop) or else Jamie would say it's chewing CPU.

      As well as using netstat (or I'd suggest lsof -p $pid to get information on what, including sockets, is actually open in that process) it will make sense to take a look at those system calls. strace -p $pid can show you what system calls a process is making, which is often revealing, if it shows just one line and stalls, try making a connection to see if that causes further system calls or not:

      e.g. you may find the process consumes new connections (they get allocated an fd) but somehow never asks any further questions about that fd so it stalls out, or you might find the process is actually stuck on a lock that's not actually socket related per se. In Linux the symptom for the latter will usually be a futex() system call, which is the contended slow path for the futex synchronisation primitive.

      • jwz says:

        Well, I got an strace of the ~8 hours preceding it going catatonic. The last thing that happens is a connection is initiated from a Comcast IP address, and then it does this:

        20:31:54 read(4, "...", 11) = 11
        20:31:54 read(4, "...", 534) = 534
        20:31:54 write(4, "...", 137) = 137
        20:31:54 read(4, 0x30bc473, 5) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

        And that's the last syscall logged until my cron job sends it SIGTERM for being unresponsive.

        I assume that 2 reads and 1 write means that it's not even out of SSL setup yet. So my first thought was, maybe this is some attack script mangling SSL in some way that is breaking things. Except, that IP address has, in the recent past, belonged to an employee's phone. So I don't think it's an attack, I think it's just a web page waking up.

        That's the only occurrence of ERESTARTSYS in the ~8 hour log.

        So, I don't really know what to make of that.

        • jwz says:

          Yeah, happened again, same place.

          I wonder if there's some exception-handling I could put in place on the Perl end to trap this ERESTARTSYS lossage?

  6. R p herrold says:

    CentOS 6.2 kernel code dates from the fall of 2011. I would be tempted to bounce to a current C6 series kernel, as the networking stack has gotten cleaner over time

    Setting unusual parameters such as fast connection closing early and before setting up a given connection seems sensible as well

    Ps something missing as to auth tokens for the Twitter based auth ...

  7. Joss says:

    I'm posting here because comments in the XScreenSaver 5.4.0 topic seem to have been closed already. Is anyone else getting crashes in macOS Mojave 10.14.1 after editing screensaver preferences? (This happens to me in all of the XScreenSavers.) The two standalone apps seem to work fine, more or less at least. ;) Any help is much appreciated, but I fear it may be a bug or system (preference pane) incompatibility.

  8. Kaleberg says:

    Is that timeout still around? I'm getting flashbacks to an SGI UNIX in the early 1990s.

  • Previously