stupid ssh.

Dear Lazyweb,

I suspect the answer to this is also "Apple horked it in a recent security update", but I still desire to know how to fix it. Lately, when I'm doing rsync+ssh backups of various machines, ssh craps out partway through. Ssh is running on MacOS 10.4 PPC (OpenSSH 4.7p1) with the latest updates, and is aimed at at various Linuxen that haven't been upgraded since, say, 2005 (OpenSSH 4.3). It dies like so:

    rsync [...]
    building file list ... done
    ...dozens of files get transferred successfully...
    Disconnecting: Bad packet length 787964.

It's dying after transferring a bunch of data, not during connection setup. Both side are most assuredly speaking SSH2.

Googling this error message only results in very old threads where people say, "Oh, that's because you're using SSH2! You should use SSH1 instead." This answer is clearly bullshit. What's the real fix?

Tags: , , ,
Current Music: Rocket -- Funtime ♬

28 Responses:

  1. mattbot says:

    Check out the MacPorts (aka DarwinPorts) version of ssh. I think it's still newer than the version the Apple ships and has less compatibility problems with linux.

  2. Apple's modified the stock rsync to transfer all their weird filesystem goo (resource forks, xattrs, whatever the hell they have this year). Dunno if that causes compatibility problems.

  3. stuartl says:

    We've seen the openssh keepalives barf other non-openssh clients. Could be related...?

    • _maybe_ -- I don't have an OS X box here to test -- it can be worked around using some "sysctl -w net.inet.tcp.sendspace=X", "sysctl -w net.inet.tcp.recvspace=X", and "sysctl -w kern.ipc.maxsockbuf=Y" magic (with an appropriate values of X and Y).

      • rapier1 says:

        Looking at the bug that could be the problem but it would depend on the size of the window being advertised when the connection is instantiated. The best move in really just to upgrade the local client to 5.0

      • mattbot says:

        Also along these line you mat want to add in /etc/sysctl.conf:

        net.inet.tcp.delayed_ack=0

        During large file transfers between Mac and Linux boxen the Macs don't handle queued up acks very well and tend to drop the connection. I've mostly seen this with NFS but if the problem is indeed on the tcp layer and not the app layer this could help.

        The magic formula I use for kern.ipc.maxsockbuf is:

        • mattbot says:

          (Grr, f*ing Firefox...)

          is...:

          bandwidth in bits/1sec * 1byte/8bits * average latency in msec = magic number in bytes.

  4. discogravy says:

    drop a sniffer (wireshark et al) to see if 787964 is the largest packet you're sending.

    my suggested fix: recompile ssh yourself. or wait for an apple bugfix.

    • rapier1 says:

      That won't work. The packet in question is an SSH packet - which is not analogous to a TCP packet. The SSH packet is strictly an application layer construct.

      • discogravy says:

        it still goes over TCP/IP, non? you should be able to see the header and frames -- content encrypted or not -- and see if it's an oversize packet compared to other packets during that stream. If I'm missing something that makes packets coming from an SSH session not use TCP frames, or encrypts the packet itself, please specify; it was my understanding that only the payload was encrypted. Even if the whole packet is encrypted, AFAIK an SSH binary packet shouldn't be more than 35k bytes, so a packet that's 787964 bytes would kinda stick out, even if binary and opaque.

        • rapier1 says:

          You can still get all of the bytes and everything with a sniffer but the SSH packet is encapsulated in the TCP packet as a full encrypted payload. You have to decrypt the incoming data to find out what the expected SSH packet length is. This comment is from the packet.c code in packet_read_poll2


          /*
          * check if input size is less than the cipher block size,
          * decrypt first block and extract length of incoming packet
          */

          Just to be clear - the TCP packet size really has nothing to do with the SSH packet size. The SSH packet will often be broken up over many TCP packets. SSH packet boundaries don't necessarily correspond to TCP packet boundaries either.

  5. ultranurd says:

    I used to get this error when tunneling a Microsoft RDC connection into work (the SSH portal was some kind of Linux box), when RDC would decide to do a major screen refresh (like a lot of scrolling in a browser window, say). It went away when I upgraded to Leopard.

    I saw suggestions to try different ciphers, but that didn't seem to help.

    • ultranurd says:

      On closer inspection, it doesn't make sense that switching to Leopard would have fixed it, since that's also using 4.7p1... that said, I haven't seen the error since then.

  6. rapier1 says:

    A couple things can cause this. Usually this is caused by having unexpected data injected into the incoming (local) data stream. Generally, this is a 4 byte value though so the value you are seeing is undersized for that. Is the value always around that or do you sometimes have larger 4 byte values?

    Sometimes you can also see this when there is a protocol mismatch. The easiest way to figure that out is to force everything to use SSH2 with the -2 or -oProtocol=2 switch. I have no idea how you pass ssh command when using rsync so you may have to define that in a local ssh_config. Its an easy thing to try though.

  7. n0man says:

    In my case, I think it was actually the router that was the problem, but one way or another, that fixed it for me.

    • jwz says:

      Good idea. That seems to be working so far.

    • rapier1 says:

      Something to keep in mind is that the SSH packet length is encrypted and stored in the header as a discrete value. If it was being corrupted en route you'd expect to see "Corrupted MAC on input" errors as well - probably much more frequently than bad packet length errors. Since the block location that contains the SSH packet length doesn't have a fixed position in the TCP payload or TCP data stream a random bit flipping would, at least it seems, would be more likely to affect SSH packet payload than consistently affecting this one data block. The alternative being that somehow the router is able to identify this encrypted block and consistently flip a but when its feeling a little tetchy. In which case the NSA wants your router.

      • n0man says:

        It was a long time ago, but if memory serves, I was lucky enough to notice the router rebooting / glitching / something or other out of the corner of my eye at the same time as the problem occurred. I definitely agree with what you're saying here, but that was nonetheless my experience. Perhaps rsync was mis-reporting the problem, or perhaps the problem wasn't the router after all.

        • rapier1 says:

          I wouldn't even try to argue against what you actually experienced. I only wanted to point out that statistically speaking a consistent bad packet length error is unlikely to be caused by a randomly shifting bit. I can think of a couple scenarios in which it might happen with some consistency but nothing involving rsync.

          To be perfectly honest, OpenSSH is a bit of a... well... let me just say that tearing it down and starting over wouldn't be a bad idea. I'm pretty familiar with it because I've written a series of patches for it (hpn-ssh) and I keep threatening my coworkers with just rebuilding the whole thing. Then they tell me it wouldn't matter because OpenSSH is the 400 pound gorilla.

          • artkiver says:

            OpenSSH wasn't the gorilla when released; hell it's not even 10 years old yet. Reasons for its popularity are myriad, but hpn-ssh is neat (and OpenBSD devs have repeatedly said they want nothing to do with it). Alternate implementations that provide useful features are still worthwhile. But writing something from scratch isn't really a feature anyone but a developer might appreciate, so I hope you have others in mind.

            • rapier1 says:

              Well thats a big part of it. SSH is a phenomenally useful application and OpenSSH has done a great job of making it do what people really need it to do. The problem is that, at times, you look at the code and go "wtf?" but you pretty much do that with any sufficiently complex code I believe. What's nice is that this occasionally leaves room for us to go in there and do something neat - like the dynamic window tuning and multithreaded cipher we rolled into it. The main advantage to writing it from scratch would be that we could fine tune it for the user base I'm most interested (very high performance networks, super clusters, grid, etc). But I'm realistic and not really interested in wasting my time :)

              BTW: thanks for the nice comment about hpn-ssh.

        • waider says:

          Interesting. I had to swap my cheapie 3com wireless router for a Linksys WRT54GL as my shiny new macbook would, without fail, cause the 3com to reload itself once I hit the network hard enough (e.g. doing high-volume data transfers). And even with the Linksys I've seen a few burps in connectivity, usually when using bittorrent. I wonder if the Mac's using overly aggressive TCP tuning?

          • krick says:

            A lot of routers had trouble with this sort of thing back in the day and in most cases, it turned out to be a heat problem. Under heavy loads, the inside of the router would get hot enough to cause reboots.

            Some people had good results just standing the router on an edge so that the heat could disperse from the top and bottom of the router (the biggest surfaces) equally. Other people had to resort to running their routers with the top half of the case off to keep them cool.

            I think that early router thermal designs didn't consider people running eMule 24/7 downloading multi-gigabyte movies.