I am worthless and weak.

My kung fu is not the best.

This X bug is still kicking my ass, for the third day. I was able to reproduce it on a second machine, and I watched it happen literally hundreds of times, and I still have no idea what's causing it. I even got a debug build of Xlib going, and have been single stepping through the library, watching it pull bits off the wire and assemble them into events, and I still haven't been able to catch it in the act of going south. For a long time, it looked like it was malfunctioning every time it tried to call XGetWindowProperty() with the `delete' flag set (for a while, it was always getting a BadImplementation error down in XGetWindowProperty() because the reply it was seeing had a `type' of 1 (XA_PRIMARY, which is nonsense) but a `format' of 0 (also nonsense.)

But no, sometimes it only fails much later, after it has gone back to the main loop and run some Xt timer functions (which are polled, not signal-based.) But only if XGetWindowProperty() has already been called three times. (Yeah, sure.)

No matter what I've tried, I've not been able to narrow it down to the exact spot where things go wrong: timing influences it. Single-stepping changes the behavior. Attaching commands to breakpoints (to dump variables, print backtraces) changes the behavior. Yet memory checkers (memprof and valgrind) report no reads or writes of freed memory.

Running it through xmon (an X protocol-monitoring proxy) changes where the problem occurs, but it still happens -- and nothing that xmon prints out looks out of place. In particular, the last GetProperty reply that comes through is totally sensible while on the wire, then somehow turns to shit by the time XGetWindowProperty() gets the result from _XReply()):

        REQUEST: GetProperty
sequence number: 033e
         delete: True
 request length: 0006
         window: WIN 00400020
       property: ATM 00000103
           type: AnyPropertyType
    long-offset: 00000000
    long-length: 00000001
          REPLY: GetProperty
         format: 00
sequence number: 033e
   reply length: 00000000
           type: <NONE>      <-- notably not 1
    bytes-after: 00000000
length of value: 00000000

Of course, I haven't actually been able to watch _XReply() perform this reverse-alchemical trick, because to do that, I'd have to know which of the thousands of calls to _XReply() was the one that was about to go wrong: because if I look at more than one of them, I throw the timing off, and the problem doesn't occur.

Attempting to make a small test case program was fruitless, for the same reason; I've not yet found a sequence of small-number-of-hundreds of events that cause this to happen reliably.

I'm just totally flailing at this point, changing things at random. If I could find a way to make it always die in the same place, I could start tediously binary-searching from there, looking at the contents of the read buffer, comparing memory dumps between subsequent runs, something. But instead I just keep running it over and over, watching it fail in a different place each time, and hoping an idea occurs to me.

I used to be good at this. I think someone stole my mojo.

Tags: ,

2 Responses:

  1. pexor says:

    I'd write it off as a bug in Xlib and find another way to effect the same fade/unfade. My days of spending countless hours pursuing bugs that, after careful examination, could not have been the fault of my code ended the day I fell out of love with Java because I realized that the current Object Oriented paradigm is fundamentally broken and led to some really horrific design decisions. I've uttered those classic, hubris-ridden words too many times: "It *can't* be my code." A lot of times it is, but there are times when it isn't. I don't sweat it anymore.

    That being said, your kung fu is ancient compared to mine. Mayhaps your many seasons have taught you better than my few have taught me.

    "Well now, from your kung fu, your dad's useless... wouldn't even hire him to wipe my ass." -- Hwang Jang Lee, Drunken Master

  2. federico says:

    Instrument your server so that it writes all the results of GetProperty requests to a file, and then your xlib so that it does the same with the stuff it gets back. Compare the sequence numbers and see if the data matches. Maybe you do have an X server bug.