SO_REUSEADDR woes

Lazyweb,

  1. When I kill my WebSocket server, I have to wait 2+ minutes to restart it because the kernel (2.6.32 Centos 6.2) says my port is still in use. I've tried trapping the signal and calling disconnect on each client connection to no avail. Also no help: setsockopt($S, SOL_SOCKET, SO_REUSEADDR, 1).

    And this is a hassle because:

  1. Every few days this server code just goes catatonic. The process is alive but no longer accepting connections. I've added log messages to every callback, and there's no obvious triggering event that causes it to break; it just seems to happen... every now and then. At that point there's nothing to do but to kill and restart it, which sucks as per above. Attaching a gdb to it didn't tell me anything. How do you debug something like this?

Previously.

Tags: , , ,

17 Responses:

  1. CJ says:

    Try SO_REUSEPORT?

  2. Christian Vogel says:

    I think you have to setsockopt() before the bind() for it to work properly, the code in backstage/src/websock/dna-websock.pl suggests you are using the other order. As the bind probably takes place within IO::Socket::INET->new, you probably have to patch it in there.


    socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

    • Philip Guenther says:

      The ReuseAddr=>1 argument to IO::Socket::INET->new() makes it call setsockopt(SO_REUSEADDR,1) on the new socket before calling bind or listen. Even the earliest version of Net::WebSocket::Server in metacpan.org passes that to IO::Socket::INET->new().

      I guess the question to jwz is which socket is unable to bind?

  3. foo says:

    By any chance, is the server port in the ephemeral range? I was recently told that the rules applied to ephemeral ports (including reuse) were quite confusing.

    • Philip Guenther says:

      Nope, whoever told you that is misunderstanding how the 2min timeout only applies to an end which sent a FIN before seeing a FIN from the other side. (Normally that's just one end, whichever closes first, but on simultaneous close both ends will go into TIME_WAIT.)

      Ephemeral ports are (normally) used by clients and, for many but not all protocols, clients close first, leading to more TIME_WAIT states on the "ephemeral port" end, but that's not a rule and there's nothing in the protocol specs that changes the TCP state machine depending on the port numbers involved.

  4. Line Noise says:

    You can tune the kernel to clear the socket quicker (10 seconds in this case):

    echo 10 > /proc/sys/net/ipv4/tcp_tw_recycle

    And you can tune it to reuse sockets by default:

    echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

    There are warnings and caveats about doing this but I've successfully used these settings on busy database servers that were running out of sockets because of too many lingering closed sockets in TIME_WAIT mode.

  5. Perry says:

    The first step is to use netstat to examine what state the kernel thinks the socket is in immediately after the server dies. That won't necessarily make it obvious how to fix it, but it will at least give you a hint of what's going on in the TCP state machine.

    • Nick Lamb says:

      "The process is alive but no longer accepting connections"

      ... also suggests value in diagnosing what this process is doing, it's presumably either entirely asleep (blocked in the kernel) or slowly spinning (polling at say one second timeout in a loop) or else Jamie would say it's chewing CPU.

      As well as using netstat (or I'd suggest lsof -p $pid to get information on what, including sockets, is actually open in that process) it will make sense to take a look at those system calls. strace -p $pid can show you what system calls a process is making, which is often revealing, if it shows just one line and stalls, try making a connection to see if that causes further system calls or not:

      e.g. you may find the process consumes new connections (they get allocated an fd) but somehow never asks any further questions about that fd so it stalls out, or you might find the process is actually stuck on a lock that's not actually socket related per se. In Linux the symptom for the latter will usually be a futex() system call, which is the contended slow path for the futex synchronisation primitive.

  6. R p herrold says:

    CentOS 6.2 kernel code dates from the fall of 2011. I would be tempted to bounce to a current C6 series kernel, as the networking stack has gotten cleaner over time

    Setting unusual parameters such as fast connection closing early and before setting up a given connection seems sensible as well

    Ps something missing as to auth tokens for the Twitter based auth ...

  7. Joss says:

    I'm posting here because comments in the XScreenSaver 5.4.0 topic seem to have been closed already. Is anyone else getting crashes in macOS Mojave 10.14.1 after editing screensaver preferences? (This happens to me in all of the XScreenSavers.) The two standalone apps seem to work fine, more or less at least. ;) Any help is much appreciated, but I fear it may be a bug or system (preference pane) incompatibility.

  8. Kaleberg says:

    Is that timeout still around? I'm getting flashbacks to an SGI UNIX in the early 1990s.

Leave a Reply

Your email address will not be published. But if you provide a fake email address, I will likely assume that you are a troll, and not publish your comment.

You may use these HTML tags and attributes: <a href="" title=""> <b> <blockquote cite=""> <code> <em> <i> <s> <strike> <strong> <img src="" width="" height="" style=""> <iframe src="" class=""> <video src="" class="" controls="" loop="" muted="" autoplay="" playsinline=""> <div class=""> <blink> <tt> <u>, or *italics*.

  • Previously