This X bug is still kicking my ass, for the third day. I was able to reproduce it on a second machine, and I watched it happen literally hundreds of times, and I still have no idea what's causing it. I even got a debug build of Xlib going, and have been single stepping through the library, watching it pull bits off the wire and assemble them into events, and I still haven't been able to catch it in the act of going south. For a long time, it looked like it was malfunctioning every time it tried to call XGetWindowProperty() with the `delete' flag set (for a while, it was always getting a BadImplementation error down in XGetWindowProperty() because the reply it was seeing had a `type' of 1 (XA_PRIMARY, which is nonsense) but a `format' of 0 (also nonsense.)
But no, sometimes it only fails much later, after it has gone back to the main loop and run some Xt timer functions (which are polled, not signal-based.) But only if XGetWindowProperty() has already been called three times. (Yeah, sure.)
No matter what I've tried, I've not been able to narrow it down to the exact spot where things go wrong: timing influences it. Single-stepping changes the behavior. Attaching commands to breakpoints (to dump variables, print backtraces) changes the behavior. Yet memory checkers (memprof and valgrind) report no reads or writes of freed memory.
Running it through xmon (an X protocol-monitoring proxy) changes where the problem occurs, but it still happens -- and nothing that xmon prints out looks out of place. In particular, the last GetProperty reply that comes through is totally sensible while on the wire, then somehow turns to shit by the time XGetWindowProperty() gets the result from _XReply()):
REQUEST: GetProperty sequence number: 033e delete: True request length: 0006 window: WIN 00400020 property: ATM 00000103 type: AnyPropertyType long-offset: 00000000 long-length: 00000001
REPLY: GetProperty format: 00 sequence number: 033e reply length: 00000000 type: <NONE> <-- notably not 1 bytes-after: 00000000 length of value: 00000000
Of course, I haven't actually been able to watch _XReply() perform this reverse-alchemical trick, because to do that, I'd have to know which of the thousands of calls to _XReply() was the one that was about to go wrong: because if I look at more than one of them, I throw the timing off, and the problem doesn't occur.
Attempting to make a small test case program was fruitless, for the same reason; I've not yet found a sequence of small-number-of-hundreds of events that cause this to happen reliably.
I'm just totally flailing at this point, changing things at random. If I could find a way to make it always die in the same place, I could start tediously binary-searching from there, looking at the contents of the read buffer, comparing memory dumps between subsequent runs, something. But instead I just keep running it over and over, watching it fail in a different place each time, and hoping an idea occurs to me.
I used to be good at this. I think someone stole my mojo.