perl and unicode go together like apples and razor blades

That scrmable thing has really been making the rounds: I've seen the text translated into three or four other (human) languages now, not to mention all the people writing their own scripts in their marginalized geek-language du jour.

But my script was malfunctioning for a bunch of people, and I finally figured out why. Fucking Unicode again. If $LANG contains "utf8" (which is the default on recent Red Hat systems), then "^\w" doesn't work right, among other things. Check this out:

    setenv LANG en_US
    echo -n "" | \
    perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'

          ===> "foo | . | bar" (right)

    setenv LANG en_US.utf8
    echo -n "" | \
    perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'

          ===> "" (wrong!)

It works fine in both cases if you do $_ = "" instead of reading it from stdin.

perl-5.8.0-88, Red Hat 9. Hate.

Tags: , , ,

32 Responses:

  1. scjody says:

    Turn off unicode: use bytes; at the top of your script.

    export LANG=en_US.utf8
    echo -n "" | \
    perl -Mbytes -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'

    ===> "foo | . | bar"
    • jwz says:

      Do you actually understand "use bytes", or are you cargo-culting it?

      Because my (quite possibly incorrect) understanding is that "use bytes" does not mean "turn off unicode", it means only "let me type arbitrary 8-bit characters in literal Perl strings."

      There's all kinds of other Unicode crap in, e.g., the file I/O layers that I don't think are affected by "use bytes".

      I see that adding "use bytes" does make it work, but I'd like to understand why.

      I'd also like to understand whether it's a bug that [^\w] stops working, or whether that's considered "correct" behavior in whatever Bizarro-world Unicode comes from.

      • deadmoose says:

        I'd also like to understand whether it's a bug that [^\w] stops working, or whether that's considered "correct" behavior in whatever Bizarro-world Unicode comes from.

        I'd guess that it's a bug; I tried \W, which should mean the exact same thing as [^\w] as far as I know, and it works with en_US.UTF-8.

      • waider says:

        Bytes covers read data as well as inline data. From the manpage, "... data that has come from a source that has been marked as being of a particular character encoding..." will be treated as character (potentially multibyte) unless you specify the bytes pragma. A stupid hack to get around this portably (i.e. will work on Perls that don't know about the 'bytes' pragma and thus die horribly) is to use the binmode() function on any filehandles you want treated as bytestreams. While I've not exaustively or logically tested this, it certainly gives a very strong appearence of working where I've tried it.

      • scjody says:

        use bytes forces strings to be treated as sequences of bytes ("byte semantics"), as opposed to letting Perl decide to use "character semantics" or "byte semantics" depending on where the input came from. "Turn off Unicode" is a bit of a simplification, but not much: Unicode is effectively disabled when handling strings.

        perldoc perlunicode has a list of things that are different under "character semantics", such as character classes in regular expressions. Note that having 'utf8' in $LANG turns on character symantics for strings from STDIN. According to perldoc open:

        If your locale environment variables (LC_ALL, LC_CTYPE, LANG) contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), the default encoding of your STDIN, STDOUT, and STDERR, and of any subsequent file open, is UTF-8.

        perldoc perlunicode states that a filehandle with a UTF-8 encoding is treated as a Unicode source, and Perl will use character semantics for such strings.

        So use bytes has the same effect, in this case, as removing utf8 from $LANG: regular expressions on $_ use byte semantics, which works around the Unicode bug. That [^\w] matches . in your example looks like a bug to me, especially considering:

        export LANG=en_US.utf8
        echo -n "." | \
        perl -ne 'print "match\n" if /[^\w]/'

        -=> match
  2. evan says:

    I was about to type something about how that couldn't due to some property of Unicode, and then I realized that it *should* work.

    lulu:~% echo $LANG
    lulu:~% echo -n "" | \
    perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'
    foo | . | bar
    lulu:~% perl --version | grep 'This is'
    This is perl, v5.8.0 built for powerpc-linux-thread-multi

    That's pretty mysterious. (I get this same behavior with "utf8" instead of "UTF-8", btw.)

    • jwz says:

      I'm using the Red Hat 9 RPM (perl-5.8.0-88.) Maybe it's different than whatever binary you have?
      md5sum `which perl` => 56c623abd14a2f39c4b08080fec14b6e /usr/bin/perl

    • evan says:

      I can't find anything that defines what \w should be, but the Unicode Regular Expression Guidelines mentions "A basic extension of this to work for Unicode is to make sure that the class of <word_character> includes all the Letter values from the Unicode character database, from UnicodeData.txt." According to the Unicode Character Database, a-z are of class L (letter), while a period is of class Po (punctuation, other).

      Possibly unrelated, but the technical report on word boundaries (which really should apply to \b in Perl) has an explicit rule which doesn't allow a word boundary when a "." is surrounded by letters:
      ALetter × (MidLetter | MidNumLet) ALetter
      (where × denotes "no break" and . is included in MidNumLet).


      • tkil says:

        The canonical reference for this should probably
        be the


        man page, which in turn defines it to be:

        \w — Match a "word" character (alphanumeric plus "_")

        So in a Unicode world, I'd expect
        one to define “alphanumeric” as
        the union of all characters with “L...” and


        In this particular case, it looks like
        RedHat screwed up somehow, since
        \w and \W
        both do the right thing; it's just
        the construct [^\w]+, and
        maybe even only inside a split, that seemed
        to do the wrong thing.

  3. tkil says:

    On my redhat 8 box:

    $ echo "" | LANG=en_US.utf8 \
    perl -lnwe 'print "$ENV{LANG}: " .
    join "|", split /([^\w]+)/;'
    en_US.utf8: foo|.|bar

    $ rpm -q perl

    But on my friend's RH9 box:

    $ echo "" | LANG=en_US.utf8 \
    perl -lnwe 'print "$ENV{LANG}: " .
    join "|", split /([^\w]+)/;'

    $ rpm -q perl

    I don't know if this is more likely
    to be a bug in the perl RPM, or if there
    are underlying libraries that it uses for
    UTF8 / Unicode handling. I would guess
    that perl handles the mechanics itself,
    but it's likely that it relies on
    external tables or other data to figure
    out what to do.

    Maybe time to look for and/or file a bug?

    (A few perl-optimizing comments: see the


    man page for info on the very helpful
    -l and -n flags. Also,
    note that \W (backslash, capital W)
    is a nice shorthand for [^\w];
    details in


    • tkil says:

      JWZ —

      Are you doing something peculiar to monospaced
      fonts with your comments stylesheet? My above
      post uses <pre> and <tt>,
      yet the contents of those tags are rendered
      as normal text. Interestingly enough,
      <code> seems to be formatted correctly.
      Are you trying to give us a hint?

      • jwz says:

        I didn't do shit, I'm just using "S2 Generator". Blame <lj user="brad">.

        TT/PRE stuff looks fine to me (though it's somewhat larger than the surrounding text, which is not the case in plain-old-no-stylesheet-HTML documents.)

        • kfringe says:

          It looks fine here.

          Wait... didn't I have this conversation with someone in 1995?

        • tkil says:

          You'll be disappointed to know that,
          all indications to the contrary, we are
          not in the future. Yet.

          My <pre> and <tt> content looks
          fine on Mac OS X, but doesn't look any different
          from normal text on Linux. Both running
          Mozilla 1.4 final.

          Although, now that I think about it, I might
          be using more aggressive “ignore site
          formatting” settings on the Linux box.
          ... but <code> works. My head hurts.

    • tkil says:

      Definitely broken. Consider this little
      test program (also available on the web


      #!/usr/bin/perl -w

      my $re_raw = shift @ARGV;
      my $re = qr/$re_raw/;

      print "regex: '$re_raw'\n";

      while (<>)

      print "$ENV{LANG}: " . join( "|", split /($re+)/ ), "\n";
      print "$_\n";
      foreach $c ( split // )
      print ( $c =~ /$re/ ? "." : "!" );
      print "\n";
      my @chars = map { sprintf "%02x", $_ } unpack "U*", $_;
      print "@chars\n";

      Now take a look at these test runs, on my
      friend's RedHat 9 box (perl-5.8.0-88):

      $ echo "" | LANG=en_US.utf8 ./jwz1.plx '\W'
      regex: '\W'
      en_US.utf8: foo|.|bar
      66 6f 6f 2e 62 61 72

      $ echo "" | LANG=en_US.utf8 ./jwz1.plx '[^\w]'
      regex: '[^\w]'
      66 6f 6f 2e 62 61 72

      $ echo "" | LANG=en_US ./jwz1.plx '[^\w]'
      regex: '[^\w]'
      en_US: foo|.|bar
      66 6f 6f 2e 62 61 72

      The middle one — which is, of course,
      the one most closely modeled after JWZ's
      original — as the amusing viewpoint that
      the full stop by itself is a non-word
      character, but it doesn't find it in the
      original split.

      Also, as you pointed out, doing the assignment
      other ways — I originally tried to pass
      in the string in @ARGV to avoid
      the need for echo — seems
      to avoid the problem... Which makes me think
      that it has to do with the input layer doing
      weird things. But it recognizes the full stop
      on its own! Grrr!
      And note that \W (upper-case) works,
      but [^\w] (lower-case) doesn't;
      this is also quite distressing.

      As before, running it on the RedHat 8 box
      (perl-5.8.0-55) works just fine:

      $ echo "" | LANG=en_US.utf8 ./jwz1.plx '[^\w]'
      regex: '[^\w]'
      en_US.utf8: foo|.|bar
      66 6f 6f 2e 62 61 72

      So this just screams “bug” to me.
      I took a quick look through the RedHat,
      but I didn't find anything obvious. (Although
      I know that I'm really bad at searching
      Bugzillas in general, so...)

      • jwz says:

        I reported 104540 earlier today; 102106 looks similar, but I couldn't be bothered to figure out what "try it in rawhide" means (as I suspect it's more effort than I'm interested in.)

        • tkil says:

          Rawhide is the beta builds of
          pretty much everything in the RedHat
          distribution. So, presumably, this issue
          might be resolved in the latest beta.

          I have no idea how to determine when that
          beta will percolate out to an actual release,

        • havardk says:

          I picked up the current perl version from rawhide, and rebuilt it for Red Hat 9. The test above works ok in a utf8 locale with this version.

          If anyone is interested, I put the rpms I built here.

  4. cschmidt says:

    Well, it.. uhh.. works for me, sorta:

    $ setenv LANG en_US.utf8
    $ echo -n "" | \
    pipe> perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'
    perl: warning: Setting locale failed.
    perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "en_US.utf8"
    are supported and installed on your system.
    perl: warning: Falling back to the standard locale ("C").
    foo | . | bar
  5. denshi says:

    There have been a number of Scheme implementations written recently for tight Unix integration (my favorite being Gauche).

    Would using one of these, and rewriting some modules and dealing with the wandering Schemeisms, be better than this periodic self-abuse you put yourself through with Perl?

    • jwz says:

      No, not really; I use Perl for the same reason I use C: not because it's good, but because it's ubiquitous. It works absolutely everywhere without my programs having to be accompanied by a list of prerequisites that people will scoff at. (And I find that I do get value from being in the situation that other people are using the things I've written for myself.)

      I used to write everything in Emacs Lisp. After that, I wrote everything in Java. Eventually I stopped chasing the holy grail and started just using what everyone else uses.

      That's why I laugh at people who suggest Python, Ruby, and whatever else the geek flavor of the week is. I was a beta tester of the "marginalized ghetto-language self-abuse kit", I don't need to do that again. I gave my Lisp Machines away.

      The Perl self-abuse is bad, but I guess I prefer it to the form of self-abuse that goes, "there are only ten people in the world who will ever run your silly little emacs-lisp function."

      • ciphergoth says:

        Of course, hardly any Windows machines have Perl installed, and they outnumber the Unix machines by a huge factor. Most likely, in a few years C# will be the most prevalently-installed scripting in the world by an enormous margin.

        I'd assume that most Linux boxes, at least, have more than one scripting language installed. I'd be surprised to learn that Perl's prevalence was an order of magnitude ahead of any of the others.

        • jwz says:

          Since I actively discourage people from running my software on Windows, that's fine with me.

          Though, if in a few years C# turns out to be the scripting language of choice, that'd be fine with me, since it's basically Java. But I doubt it will be, since it's not really a "scripting" language in the sense that sh and perl are; it has strong typing, so the level of competency required is much higher.

          I suspect there are an order of magnitude more people who know Perl than who know the other commonly-available scripting languages.

        • cnoocy says:

          It's likely that the programming language with the largest install base and competent user base is Microsoft Excel.
          I do not predict or recommend that everyone start writing their clever scripts as Excel functions.

      • denshi says:

        I was a beta tester of the "marginalized ghetto-language self-abuse kit",

        Good one.

        As for me, maybe 90% of the stuff I write talks to the world through port 80, so distribution has not traditionally been a concern of mine. Given that, I'd rather not tear my hair out dealing with C and Perl inanities. Code fast, die young, leave a good-looking CVS tree.

      • jonabbey says:

        I think Python is becoming a bit more than the geek flavor of the week. A lot of good stuff is being done with it, including a lot of critical systems infrastructure (all the Red Hat install/update stuff, all of BitTorrent..).

        The vast majority of code I've written since Java came out has either been in Java or Perl, but I really think that Python is the most rewarding direction to go in for new work, unless I need particularly tightly multithreaded and/or windows-portable stuff.