Configuring Netscape Mail On Unix:
Why the Content-Length Format is Bad

(a humble opinion)


Message-ID: <319CEA7A.7A79@netscape.com>
Date: Fri, 17 May 1996 14:07:06 -0700
From: Jamie Zawinski <jwz@netscape.com>
Newsgroups: comp.mail.headers
Subject: Re: "From_" specification

RFCs specify internet protocols, that is, on-the-wire formats. The thing that the original poster is looking for is a description of the BSD Mailbox file format (which is not something an RFC would cover.)

But, here's the good news, there is no true specification of this file format, just a collection of word-of-mouth behaviors of the various programs over the last few decades which have used that format.

Essentially the only safe way to parse that file format is to consider all lines which begin with the characters ``From '' (From-space), which are preceded by a blank line or beginning-of-file, to be the division between messages. That is, the delimiter is "\n\nFrom .*\n" except for the very first message in the file, where it is "^From .*\n".

Some people will tell you that you should do stricter parsing on those lines: check for user names and dates and so on. They are wrong. The random crap that has traditionally been dumped into that line is without bound; comparing the first five characters is the only safe and portable thing to do. Usually, but not always, the next token on the line after ``From '' will be a user-id, or email address, or UUCP path, and usually the next thing on the line will be a date specification, in some format, and usually there's nothing after that. But you can't rely on any of this.

In the BSD format, the only safe way to add a message to a file is to mangle occurrences of the ``From '' delimiter in the body of messages to some other string, usually ``>From ''. This is mangling, not quoting, because it's not a reversible process (since ``>From '' is not also quoted.)

Now, there are actually two very similar-looking file formats. One is the BSD format, which I've described. The other, which one might as well call the ``content-length'' format, is used by some SYSV-derived systems, notably Solaris. It's very similar, but subtly incompatible. This format does not quote ``From '' lines, but instead relies on a Content-Length header in the message proper to indicate the exact byte-position of the end of each message.

This latter format is non-portable, easily-corruptible, and overall, brain-damaged (that's a technical term.) But I'll refrain from ranting about it again right now...


Message-ID: <319D3B7A.6201@netscape.com>
Date: Fri, 17 May 1996 19:52:42 -0700
From: Jamie Zawinski <jwz@netscape.com>
Newsgroups: comp.mail.headers
Subject: Re: "From_" specification

I'm not sure exactly what you're trying to say, but I'll clarify what I meant: I'm not saying that the BSD Mailbox format is good. Just that the Content-Length variant of that format is worse.

Ok, so someone took the From_ format, and extended it to not require mangling by adding a length indicator to the format. At first glance, this may sound simple and elegant, but it breaks the world, and one shouldn't encourage its use to spread.

The thing that breaks is taking an existing, widely-implemented format, and adding a requirement that it have a length indicator. This means that any existing software that already thinks it knows how to manipulate that format is going to damage the file (any change to the data will cause the length indicator to be wrong with respect to the new specification but not with respect to the old specification.)

If the content-length-based format was not otherwise-indistinguishable from the ``From '' format, there wouldn't be a problem; the old software would simply fail to work with this new file format, instead of ``corrupting'' the documents (in quotes, because it's really just a matter of which spec you're following.)

Also, mailboxes are by their nature a textual format; but, the content-length header measures in bytes rather than lines. This means that if you move the file to a system which has a different end-of-line representation (Windows <=> Mac, or Windows <=> Unix) then the content-lengths will suddenly be wrong, because the linebreaks now take two bytes instead of one, or vice versa.

It's impossible for a mail client to look at a file, and tell which of the two formats (From_ or Content-Length) it is in; they are programatically indistinguishable. The presence of a Content-Length header is not enough, because suppose you were on a system which knew nothing at all about that header, and some incoming message just happened to have that header in it. Then that header would end up in your mailbox (because nobody would have known to remove or recalculate it), and it would possibly be incorrect. (Presume further that the header was not just incorrect, but intentionally malicious...)

Stricter parsing of the ``From '' separator line doesn't help either, because there are many, many variations on what goes in that line (since it was never standardized either); and also, some mail readers include that line verbatim when forwarding messages (Sun's MailTool, for example) so a stricter parser wouldn't help that case at all, because message bodies tend to contain valid matches.

Some mail readers attempt to cope with this by recognizing the case where the Content-Length is not obviously spot-on-target, and then searching forward and backward for the nearest message delimiter; but this is obviously not foolproof, and makes one's parser much more inefficient (requiring arbitrary lookahead and backtracking.)

Conventional wisdom is, ``if you believe the Content-Length header, I've got a bridge to sell you.''


Message-ID: <33F6624C.B937B92C@netscape.com>
Date: Sat, 16 Aug 1997 19:30:36 -0700
From: Jamie Zawinski <jwz@netscape.com>
Newsgroups: comp.os.linux.development.apps
Subject: Re: /var/spool/mail format?

Those headers are required by RFC 822.

In the absence of those required headers, it can't hurt to try. But if you do that, you should consider it a second pass: try to parse it after identifying it. The only way to identify it is by its first five characters. Don't assume that it's not a message delimiter just because you can't parse a date out of it.

I completely disagree.

I was describing what the historical truth of this file format is: the facts about what is out there in the world today. If you do quoting, not mangling, you are following different rules than have been followed for decades.

If, when I send a message, I type ``>From '' at the beginning of a line, and you treat it as quoting instead of mangling, you will alter the message. And you will alter it in a different way than all the other software out there which is following the ancient de-facto standard.

Whether you treat it as quoting or mangling, either method will cause certain messages to be changed: neither is foolproof, as both make the assumption that all software is following the same rules, whichever they may be.

I say, since neither method is foolproof, use the older and more prevalent rules, since that's what will get you maximal interoperability.

You may think that turning ``>From '' into ``From '' is somehow more harmless than turning ``From '' into ``>From '', but I don't see why.

For example, consider the case of a message with a clear-signed cryptographic signature (multipart/signed, say.) Any change to the body will cause the signature verification to fail. Now, suppose that the software doing the signing, being aware of the prevalence of From-mangling, were to do that mangling before signing. This would interoperate perfectly with software that did mangling in the traditional way (like, say... sendmail, or /bin/Mail).

But software that tried to ``improve'' matters by changing the rules -- by mangling ``>From '' to ``>>From '', or by mangling ``>From '', to ``From '' -- would cause the signature to no longer match.

It is in nobody's best interest for random pieces of software to try and ``improve'' on the BSD mbox format. It's a crummy format, but it is what it is. If you don't want to use it, don't -- use something else. But please, use something else that can be self-identified, that is programatically distinguishable from the historical format, so that software can know what it's parsing.

Don't make us need to guess even more than we do now.

There's a lot of software that thinks the delimiter is ``\nFrom .*\n''. After all, there's no official document describing the format, so every implementor over the years has had to guess, to rediscover it for themselves.

Saying ``don't process the file'' isn't very helpful. ``Be strict in what you send and lenient in what you accept'' and all that.

Doing different things based on the local architecture doesn't work either. If /var/mail/ is NFS mounted, then what matters is not the architecture the mail client is running on, but rather, the architecture that the MDA is running on (which may or may not be the machine which has exported the file system.)

Besides /var/mail, people's home directories (and thus their saved mail) are also often shared between different machines. And most people don't throw away all of their old mail when they change machines or jobs.

And if a folder written in the Content-Length format is later touched by software using the BSD format (or vice versa) then you've got one folder with messages in both formats.

You can't win.

Such a test will tell you what the MDA running today does to /var/mail/. It won't tell you what will happen when the user logs in to a different workstation at the same site; it won't tell you a thing about any of the user's saved mail.


(original version)