But my script was malfunctioning for a bunch of people, and I finally figured out why. Fucking Unicode again. If $LANG contains "utf8" (which is the default on recent Red Hat systems), then "^\w" doesn't work right, among other things. Check this out:
echo -n "foo.bar" | \
perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'
===> "foo | . | bar" (right)
setenv LANG en_US.utf8
echo -n "foo.bar" | \
perl -e '$_ = <>; print join (" | ", split (/([^\w]+)/)) . "\n";'
===> "foo.bar" (wrong!)
It works fine in both cases if you do
perl-5.8.0-88, Red Hat 9. Hate.
Turn off unicode: use bytes; at the top of your script.
Do you actually understand "use bytes", or are you cargo-culting it?
Because my (quite possibly incorrect) understanding is that "use bytes" does not mean "turn off unicode", it means only "let me type arbitrary 8-bit characters in literal Perl strings."
There's all kinds of other Unicode crap in, e.g., the file I/O layers that I don't think are affected by "use bytes".
I see that adding "use bytes" does make it work, but I'd like to understand why.
I'd also like to understand whether it's a bug that [^\w] stops working, or whether that's considered "correct" behavior in whatever Bizarro-world Unicode comes from.
I'd guess that it's a bug; I tried \W, which should mean the exact same thing as [^\w] as far as I know, and it works with en_US.UTF-8.
Bytes covers read data as well as inline data. From the manpage, "... data that has come from a source that has been marked as being of a particular character encoding..." will be treated as character (potentially multibyte) unless you specify the bytes pragma. A stupid hack to get around this portably (i.e. will work on Perls that don't know about the 'bytes' pragma and thus die horribly) is to use the binmode() function on any filehandles you want treated as bytestreams. While I've not exaustively or logically tested this, it certainly gives a very strong appearence of working where I've tried it.
use bytes forces strings to be treated as sequences of bytes ("byte semantics"), as opposed to letting Perl decide to use "character semantics" or "byte semantics" depending on where the input came from. "Turn off Unicode" is a bit of a simplification, but not much: Unicode is effectively disabled when handling strings.
perldoc perlunicode has a list of things that are different under "character semantics", such as character classes in regular expressions. Note that having 'utf8' in $LANG turns on character symantics for strings from STDIN. According to perldoc open:
perldoc perlunicode states that a filehandle with a UTF-8 encoding is treated as a Unicode source, and Perl will use character semantics for such strings.
So use bytes has the same effect, in this case, as removing utf8 from $LANG: regular expressions on $_ use byte semantics, which works around the Unicode bug. That [^\w] matches . in your example looks like a bug to me, especially considering:
I was about to type something about how that couldn't due to some property of Unicode, and then I realized that it *should* work.
That's pretty mysterious. (I get this same behavior with "utf8" instead of "UTF-8", btw.)
I'm using the Red Hat 9 RPM (perl-5.8.0-88.) Maybe it's different than whatever binary you have?
md5sum `which perl` => 56c623abd14a2f39c4b08080fec14b6e /usr/bin/perl
Yeah, this is PPC and Debian. Redhat has weird custom mods in their Perl.
I can't find anything that defines what \w should be, but the Unicode Regular Expression Guidelines mentions "A basic extension of this to work for Unicode is to make sure that the class of <word_character> includes all the Letter values from the Unicode character database, from UnicodeData.txt." According to the Unicode Character Database, a-z are of class L (letter), while a period is of class Po (punctuation, other).
Possibly unrelated, but the technical report on word boundaries (which really should apply to \b in Perl) has an explicit rule which doesn't allow a word boundary when a "." is surrounded by letters:
ALetter × (MidLetter | MidNumLet) ALetter
(where × denotes "no break" and . is included in MidNumLet).
Hmm.
The canonical reference for this should probably
be the
>perlre
man page, which in turn defines it to be:
So in a Unicode world, I'd expect
one to define “alphanumeric” as
the union of all characters with “L...” and
“N...”
>categories.
In this particular case, it looks like
RedHat screwed up somehow, since
\w
and\W
both do the right thing; it's just
the construct
[^\w]+
, andmaybe even only inside a
split
, that seemedto do the wrong thing.
On my redhat 8 box:
But on my friend's RH9 box:
I don't know if this is more likely
to be a bug in the perl RPM, or if there
are underlying libraries that it uses for
UTF8 / Unicode handling. I would guess
that perl handles the mechanics itself,
but it's likely that it relies on
external tables or other data to figure
out what to do.
Maybe time to look for and/or file a bug?
(A few perl-optimizing comments: see the
>perlrun
man page for info on the very helpful
-l and -n flags. Also,
note that
\W
(backslash, capital W)is a nice shorthand for
[^\w]
;details in
>perlre.)
JWZ —
Are you doing something peculiar to monospaced
fonts with your comments stylesheet? My above
post uses <pre> and <tt>,
yet the contents of those tags are rendered
as normal text. Interestingly enough,
<code> seems to be formatted correctly.
Are you trying to give us a hint?
I didn't do shit, I'm just using "S2 Generator". Blame <lj user="brad">.
TT/PRE stuff looks fine to me (though it's somewhat larger than the surrounding text, which is not the case in plain-old-no-stylesheet-HTML documents.)
It looks fine here.
Wait... didn't I have this conversation with someone in 1995?
You'll be disappointed to know that,
all indications to the contrary, we are
not in the future. Yet.
My <pre> and <tt> content looks
fine on Mac OS X, but doesn't look any different
from normal text on Linux. Both running
Mozilla 1.4 final.
Although, now that I think about it, I might
be using more aggressive “ignore site
formatting” settings on the Linux box.
... but <code> works. My head hurts.
Definitely broken. Consider this little
test program (also available on the web
at
>http://scrye.com/~tkil/perl/jwz1.plx):
Now take a look at these test runs, on my
friend's RedHat 9 box (perl-5.8.0-88):
The middle one — which is, of course,
the one most closely modeled after JWZ's
original — as the amusing viewpoint that
the full stop by itself is a non-word
character, but it doesn't find it in the
original
split
.Also, as you pointed out, doing the assignment
other ways — I originally tried to pass
in the string in
@ARGV
to avoidthe need for
echo
— seemsto avoid the problem... Which makes me think
that it has to do with the input layer doing
weird things. But it recognizes the full stop
on its own! Grrr!
And note that
\W
(upper-case) works,but
[^\w]
(lower-case) doesn't;this is also quite distressing.
As before, running it on the RedHat 8 box
(perl-5.8.0-55) works just fine:
So this just screams “bug” to me.
I took a quick look through the RedHat,
but I didn't find anything obvious. (Although
I know that I'm really bad at searching
Bugzillas in general, so...)
I reported 104540 earlier today; 102106 looks similar, but I couldn't be bothered to figure out what "try it in rawhide" means (as I suspect it's more effort than I'm interested in.)
Rawhide is the beta builds of
pretty much everything in the RedHat
distribution. So, presumably, this issue
might be resolved in the latest beta.
I have no idea how to determine when that
beta will percolate out to an actual release,
though.
I picked up the current perl version from rawhide, and rebuilt it for Red Hat 9. The test above works ok in a utf8 locale with this version.
If anyone is interested, I put the rpms I built here.
Well, it.. uhh.. works for me, sorta:
Presumably that means you have no /usr/lib/locale/en_US.utf8/
Indeed. I'm glad to be part of the 'standard locale' majority :)
There have been a number of Scheme implementations written recently for tight Unix integration (my favorite being Gauche).
Would using one of these, and rewriting some modules and dealing with the wandering Schemeisms, be better than this periodic self-abuse you put yourself through with Perl?
No, not really; I use Perl for the same reason I use C: not because it's good, but because it's ubiquitous. It works absolutely everywhere without my programs having to be accompanied by a list of prerequisites that people will scoff at. (And I find that I do get value from being in the situation that other people are using the things I've written for myself.)
I used to write everything in Emacs Lisp. After that, I wrote everything in Java. Eventually I stopped chasing the holy grail and started just using what everyone else uses.
That's why I laugh at people who suggest Python, Ruby, and whatever else the geek flavor of the week is. I was a beta tester of the "marginalized ghetto-language self-abuse kit", I don't need to do that again. I gave my Lisp Machines away.
The Perl self-abuse is bad, but I guess I prefer it to the form of self-abuse that goes, "there are only ten people in the world who will ever run your silly little emacs-lisp function."
Of course, hardly any Windows machines have Perl installed, and they outnumber the Unix machines by a huge factor. Most likely, in a few years C# will be the most prevalently-installed scripting in the world by an enormous margin.
I'd assume that most Linux boxes, at least, have more than one scripting language installed. I'd be surprised to learn that Perl's prevalence was an order of magnitude ahead of any of the others.
Since I actively discourage people from running my software on Windows, that's fine with me.
Though, if in a few years C# turns out to be the scripting language of choice, that'd be fine with me, since it's basically Java. But I doubt it will be, since it's not really a "scripting" language in the sense that sh and perl are; it has strong typing, so the level of competency required is much higher.
I suspect there are an order of magnitude more people who know Perl than who know the other commonly-available scripting languages.
It's likely that the programming language with the largest install base and competent user base is Microsoft Excel.
I do not predict or recommend that everyone start writing their clever scripts as Excel functions.
I was a beta tester of the "marginalized ghetto-language self-abuse kit",
Good one.
As for me, maybe 90% of the stuff I write talks to the world through port 80, so distribution has not traditionally been a concern of mine. Given that, I'd rather not tear my hair out dealing with C and Perl inanities. Code fast, die young, leave a good-looking CVS tree.
I think Python is becoming a bit more than the geek flavor of the week. A lot of good stuff is being done with it, including a lot of critical systems infrastructure (all the Red Hat install/update stuff, all of BitTorrent..).
The vast majority of code I've written since Java came out has either been in Java or Perl, but I really think that Python is the most rewarding direction to go in for new work, unless I need particularly tightly multithreaded and/or windows-portable stuff.
"Sewer rat may taste like pumpkin pie, but I'll never know."
The whitespace thing is a total dealbreaker.
Heh.
It's like s-exprs delimited with tabs. One could probably write python in a macro, if one was sufficiently nerdy.