Dooming us to inhuman toil, etc, etc.

You'll be happy to know that this weekend I've dragged XScreenSaver kicking and screaming into 2003 and have made all the hacks that load text able to properly display UTF-8 input, and it's even anti-aliased.

But don't worry: while doing so, it still parses HTML using regular expressions. I would never re-make a classic hit like that.

Previously, previously, previously.

Tags: , , , , ,

12 Responses:

  1. Hey, what's this module "jarjar.saver" doing in here?

  2. Ed says:

    Hey man, thanks for your work on this stuff. I really appreciate it and it makes me happy to still have xscreensaver loved and nurtured in 2014. Cheers!

    • I also wanted to say thanks. I've emailed you last year about this issue and I'm happy to see that you've adressed it. Now, I will able to display French fortunes on my screen! Thanks again.

  3. Leonardo Herrera says:

    Do you have a public repository available?

  4. jwz says:

    This involved writing a bunch of code that looked like this. ZALGO HE COMES.

      } else if ((c & 0xFC) == 0xF8 && in+3 < end) {
                                            /*  111110xx - 26 bits               */
        uc = (((c     & 0x03) << 24) |      /*  00000011--------+-------+------- */
              ((in[0] & 0x3F) << 18) |      /*        00111111--+-------+------- */
              ((in[1] & 0x3F) << 12) |      /*              00111111----+------- */
              ((in[2] & 0x3F) << 6)  |      /*                    00111111------ */
              ((in[3] & 0x3F)));            /*                          00111111 */
        in += 4;
      } else if ((c & 0xFE) == 0xFC && in+4 < end) {

    Also, to make it work portably, I had to implement most of Xft in terms of XDrawString16().

    • Nick Lamb says:

      Um. I feel bad only saying this now, but did you read RFC 3629?

      All those crazy-long five byte and six byte sequences are no longer permissible because Unicode voluntarily committed itself to the range U+0000 to U+10FFFF.

      All invalid sequences, including the single bytes 0xC0, 0xC1 and 0xF5 through 0xFF must never appear. If this code can raise an error, it should do so, if not it should treat each invalid sequence as codepoint U+FFFD and continue. For example the input 0xF8 0x41 0x42 0x43 0x44 should either raise an error or be handled exactly like 0xEF 0xBF 0xBD 0x41 0x42 0x43 0x44. Your code above appears to consume five bytes as a single Unicode codepoint beyond the documented maximum of U+10FFFF which is the Wrong Thing .

      The reason for using U+FFFD if you can't / won't throw an error is that it's the least dangerous option. Nobody thinks this codepoint is a reserved identifier, directory separator, whitespace, comment marker, escape character or other magic totem and most typefaces represent it as a black diamond with an white question mark that looks very clearly as if something went wrong, which it did.

      If you're 100% confident that all input to this code is sanitised then no problem, do whatever you like. If on the other hand it processes data it found in a public toilet, or worse, on the open web, then you need to be very careful to handle it as explained in the standard so that at least it's not actually your fault when inevitably something goes wrong.

      • jwz says:

        No, I didn't, I mostly cargo-culted it and ran a bunch of self-tests. I'm reasonably certain there are no buffer overflows lurking in my UTF-8 encoding and decoding code, so the worst that an invalid UTF-8 sequence could produce is an unsigned 32 bit (alleged) unicode character with arbitrary bits lit up. What's the exploit there?

        • Nick Lamb says:

          As you say, arbitrary bits can get lit up with code like the above. For example you can sneak a zero value into uc in the above fragment by using invalid sequences which would surprise some people. But that's just another arbitrary 31-bit integer, and if your code really is fine with arbitrary 31-bit integers then no ill will come of it.

          http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

          is a good set of test input data for seeing whether anything unexpected happens when you shove various things through your decoder.

          • jwz says:

            Yup, saw that, built my test cases based on that plus some other stuff. I don't get precisely the results he says are correct for invalid input, but it doesn't overflow...

            I suppose if I were to be paranoid about whether the underlying OS dealt sensibly with arbitrary random numbers as unichars (the devil you say), I could just mask the unichars after the fact into a legal range.

            Though actually, I think that case only comes up on X11 systems that don't have Xft, in which case the unichar gets truncated to 16 bits anyway and XDrawText16 is used.

  • Previously