really hard bug

I found a reproducible xscreensaver crash, and I'm still at a loss as to how to find a fix for it. Help me figure this out and I will concede that your kung fu is the best. <lj-cut text="how I spent the last six hours"> To reproduce:

  • have `fade' and `unfade' turned on;
  • xscreensaver-command -activate
  • after it starts fading out, click the mouse (to stop it);
  • when it starts fading back in, click the mouse (to stop it again);
  • Xlib: sequence lost (0x2b714 > 0x1b714) in reply type 0x0!

"Sequence lost" means "game over." What it means is that Xlib went to read a reply from the X server (on the file descriptor that is its connection to the server) and it got nonsense back. There are three typical causes of this:

  • Making calls to Xlib routines from a signal handler (since Xlib is (usually) not re-entrant, this can cause a write or read to happen on the fd out of turn, confusing everything); or
  • Doing a printf to the X server's fd (this can happen if your program was launched with its stdin/stdout/stderr descriptors closed (that is, >&- rather than >/dev/null.) What can happen in that case is that the Xlib file descriptor gets allocated as fd #3, which is the magic stderr number. Then your program tries to write to stderr, and you blow up X); or

  • Similar lossage to the previous way, except involving forked subprocesses that inherited the Xlib fd and then did bad things to it.

So, of course, I have a high degree of confidence that none of these three things are happening. Yet, when I do that set of actions, I get "sequence lost" in _XSetLastRequestRead() from XlibInt.c in the bowels of Xlib. The backtrace is always:

    _XSetLastRequestRead (Xlib/XlibInt.c)
    _XError (Xlib/XlibInt.c)
    _XGetWindowProperty (Xlib/GetProp.c)
    kill_xsetroot_data_1 (xscreensaver/driver/windows.c)

Just after the "sequence lost" error, xscreensaver goes down with a BadImplementation error from XGetProperty(). This error is likely spurious: a result of the client trying to interpret nonsense data from the server, and noting that it's nonsense.

There's only one call to _XError() in XGetWindowProperty(), and it's getting called because the reply has format = 0, which is nonsense:

    type = 1,
    format = 0,
    sequenceNumber = 1472,
    length = 0,
    propertyType = 20971521 (0x1400001),
    bytesAfter = 1,
    nItems = 4

That 0x1400001 looks suspicious, but it's neither an atom ID nor a window ID. The property it's trying to read (actually, read-and-delete) is _XSETROOT_ID, and the window does not, in fact, have that property on it. (This is normal and often expected.) Which means XGetWindowProperty() should just return non-zero (meaning, "could not read-and-delete that property.") I guess this means that it should have gotten a non-zero value from _XReply(), but instead, it's getting 0, and trying to read a (nonsense) reply. Which is consistent with the theory that the stream of bytes coming back from the server has been horked (the server presumably really did send back a 1, but we're seeing a 0, and thus running off into the weeds.)

If I set a breakpoint on kill_xsetroot_data_1 and don't allow it to ever call XGetWindowProperty(), then lose sequence later with "unexpected async reply (sequence 0x664)!" from the next call to _XReply(), XEventsQueued(), XtAppNextEvent(). So the problem is still upstream somewhere.

What a total nightmare! I'm running with -sync, so it can't be far upstream, but I'm stumped as to how to narrow it down any farther.

I suppose the next step is to re-learn how to run Xnest or a protocol monitor. I can't remember the last time I've had to resort to that!

Tags: , , ,
Current Music: the disco crapfest we call "DNA Lounge friday night"

9 Responses:

  1. zztzed says:

    Fix: Tell people not to do that! Har har har.

  2. ciphergoth says:

    Could something be trampling on the data before sending or after reply? Maybe throw some sort of bounds checker at it?

    • jwz says:

      All the bounds checkers for Linux are complete ass.

      There's a beta of Purify for Linux, but I spent an hour trying to debug their installer, with zero success.

      Anyway, I had tried "memprof" (which is, as I said, ass) and it doesn't catch anything. This isn't actually very surprising, since chances are the corruption is happening in the read buffer inside of Xlib, which is both large and constantly reused. So it would be not out of the question for something to chew on that without actually going past an allocation boundary.

      I set gdb watchpoints on Display->request and ->last_request_read, but those never went bad before the failure inside of _XSetLastRequestRead().

  3. ivorjawa says:

    Does it show up on other platforms, or is it Linux-only?

    • jwz says:

      I only have access to Linux. But, in fact, it only happens on one of the three machines I've tried it on. It fails on a Red Hat 8 machine with XFree86-4.2.0-72; it does not fail on another Red Hat 8 machine with that same version of X; and it also does not fail on a Red Hat 7.2 machine with XFree86-4.1.0-25.

      Isn't that special.

      Of course those machines all have different video cards and CPUs, so if it's even slightly timing-related...

  4. ex_meta says:

    Wow, you just reminded me how much I really hated debugging X code. All kinda of useful error messages, mostly of the form "segmentation fault: core dumped"