mail summary files.
© 1998-2000 Jamie Zawinski <jwz@jwz.org>


``The awful thing about getting it right the first time is that nobody realizes how hard it was.''
-- Unknown.

In this modern world, the only sensible thing for someone implementing a mail reader to do is use BSD `mbox' files to store mail messages, because that is the de-facto standard used by almost every popular mail reader, on Unix, Windows, and Mac.

All versions of Netscape have always used the mbox file format to store messages: ``converting'' to or from Netscape's folder format is simply a matter of copying the files to or from the place you keep your /bin/Mail or Eudora (etc) folders. (Just remember to do Empty Trash first, to flush out recent changes.)

However, to make use of the mbox format be fast, some tricks are necessary.

The mbox file format is basically one big text file containing every message of the ``folder'' sequentially, with separator lines between them. This has obvious efficiency problems: you don't want to spend a lot of time doing linear searches through the file, and you don't want to ever pull the whole file into memory at once, and deletions are expensive.

So, you have to summarize it, and be lazy about updates. My approach, when designing this for Netscape Mail in the 2.0 and 3.0 timeframe, was to identify the pieces of information that needed to be gotten at quickly and repeatedly, and store those in a cache. Most other information was only needed on a per-message basis (not when operating on many messages at once) so that latter sort was simply parsed from the message on demand, as the message was being displayed.

Think of it this way: when you are looking at a folder and a message in it, you're looking at two windows: one lists all the messages in the folder, including a brief (one line) summary of their sender, date, etc; and the other window shows the full headers and body of the message that you've selected. That second window is easy: to present that information, you can just re-parse that whole message at display time (there's only one, so it's fast.) But to display the other window, the message list window, you need a small bit of info from every message in the folder, and you don't want to have to parse the entire folder every time.

So, we have these ``summary files,'' one summary file for each folder, which provide the necessary info to generate that message list quickly.

Each summary file is a true cache, in that if it is deleted (accidentally or on purpose) it will be regenerated when it is needed again (which might take a little while, but which only needs to be done once.)

In Netscape 4.0, the new team went both C++ and Database happy, threw away the tightly-tuned mail summary files I had designed, and generally screwed the pooch raw.

My code had summary files that were on the order of 2% of the size of the folder they were summarizing, and was blazingly fast in all respects. The 4.0 code had an overhead closer to 30% (last time I checked) and was insanely slow, not to mention extremely fragile: their summary files got corrupted all the time.

But they were databases and C++ so they were better.

But I'm not bitter.

Here are some messages I exchanged with a member of the 5.0 team, trying to explain to him how the 2.0/3.0 code worked, and why he shouldn't follow in the 4.0 team's footsteps. It was a hard sell, since he's a database person, and as far as I've seen, once those database worms eat into your brain, it's hard to ever get anything practical done again. To a database person, every nail looks like a thumb. Or something like that.


The Netscape 2.x / 3.x code discussed here was never released to the public, but my port of that code to Java was released as part of the Grendel project. You can find a descendant of that code on ftp.mozilla.org. Here's the original source release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern Java APIs: grendel-1999-05-14.tar.gz.

The mailsum code is in storage/MailSummaryFile.java and related classes. The mime/ parser and threading code may also be of interest.


Message-ID: <367999AE.556B2FF4@mozilla.org>
Subject: Re: [MORK] some summary file problem statements
Date: Thu, 17 Dec 1998 15:54:22 -0800
From: Jamie Zawinski <jwz@mozilla.org>
Newsgroups: netscape.public.mozilla.mail-news

For the record, the summary file format used by Netscape 2.0 and 3.0 is documented at MailSummaryFileCheddar.java#35

I am still of the opinion that a custom in-memory read-the-whole-file type of cache is the way to go, and I think that the summary files that 2.0/3.0 used were brilliantly speedy and compact, if I do say so myself. (I have disk usage statistics on this format that has been known to make young girls squeal, but I don't hav them handy right now.)

When I was reimplementing this stuff for Grendel, and doing the only sane thing and ignoring the whole 4.x summary file debacle, I thought of a way to reduce the summary file sizes even more than 2.x/3.x did (by not storing message IDs, but instead storing MD5 hashes of them); that new format, which never really left the building, is documented here: MailSummaryFileGrendel.java#39

I distrust the idea of using a general-purpose database for mail summary files. I smell overengineering.

--
Jamie Zawinski
jwz@mozilla.org    http://www.mozilla.org/   (work)
jwz@jwz.org        http://www.jwz.org/       (play)


Message-ID: <36799AB1.58915311@mozilla.org>
Subject: Re: [MORK] some summary file problem statements
Date: Thu, 17 Dec 1998 15:58:41 -0800
From: Jamie Zawinski <jwz@mozilla.org>
Newsgroups: netscape.public.mozilla.mail-news

You are summarizing mail folders. Mail folders are composed of mail messages. Mail messages have headers which are composed of short-line 8-bit (7-bit, technically) quantities. Therefore, the summary files need only store data composed of printable ASCII (or perhaps Latin1) characters. So how do you do i18n stuff? The same way you do it in mail headers: MIME-2 encoding.

I still smell overengineering.

--
Jamie Zawinski
jwz@mozilla.org    http://www.mozilla.org/   (work)
jwz@jwz.org        http://www.jwz.org/       (play)


Message-ID: <3679AD90.53F2FFB2@jwz.org>
Subject: Re: [MORK] some summary file problem statements
Date: Thu, 17 Dec 1998 17:19:12 -0800
From: Jamie Zawinski <jwz@mozilla.org>
Newsgroups: netscape.public.mozilla.mail-news

Well really the requirement is "small changes should be fast", right? In 2.x/3.x, we did this by writing out the summary file lazily (when switching folders, and other times when file IO would be happening (never as a result of a fast action like toggling the "read" flag.)

In Grendel, we did it from a background thread; when a change was made, a background thread was created and went to sleep, and after 30 seconds or so, the batched up changes would be written out. This was even better, because the UI was still live during I/O.

Right now I'm looking at a folder in 3.0. It has 15,466 messages in it. Selecting this folder takes less than a second (it's hard to eyeball it, but I'd say it takes about 1/2 to 3/4 second from when I click to when I see the message summary on the screen). The BSD mbox file is 57.2MB (1.2 million lines) and the summary file is 1.3MB (2% of the size of the folder.)

This is on a P266 with a local IDE disk (Linux.)

Oh, if I have threading on, it takes about 1.5 seconds to select the folder. (Note that unlike 4.x, there is no "pre-threading" in the summary files; this is all done in core, from the first principles of the Message-ID and References data stored in the summary file.)

An MD5 hash is 128 bit quantity. In my code, I was only using 64 bits of that, since I wasn't really worried about it being cryptographically strong. The result is that I had to store 8 bytes per message ID rather than the full message ID string, which tended to average somewhere around 25 or 30 bytes (I don't remember the actual number, but I collected stats back when I was working on this, and it was a big number.)

It's much better than CRC, because MD5 is intended to be cryptographically strong: as far as I know, nobody has yet come up with two strings that have the same MD5 hash; RSA's FAQ (http://www.rsasecurity.com/rsalabs/faq/3-6-6.html) says in part:

More detailed MD5 info: http://www.rsasecurity.com/rsalabs/cryptobytes/ (click on ``Spring 1995,'' the article is on page 5.)

I used MD5 because it's a great hash function, and we had the code handy. It might well be overkill for this purpose.

Now wait, I see two different problems: you need a mail summary file; and you need a representation for address books. It's not obvious to me that those are anywhere *near* the same problem, or that one should try to conflate them and solve them simultaniously. The requirements of the two problems seem pretty dramatically different to me.

On the one hand, it's a drag to do two different implementations, but on the other hand, it's a drag to have one of the two problems be solved badly because of baggage from the other problem; or to have the all-singing-all-dancing solution take longer than two independent solutions together.

I see mail summary files as a very specialized need for a very fast cache for very specific kinds of data and access patterns. I can see how an address book would need something more general than that (and would also have less severe speed and storage requirements.) And I can also see how a general data store for RDF-ish things would be really, really useful. But I'd hate to see either of those latter two desires make mail summary files be bloated, slow, or late.

--
Jamie Zawinski
jwz@mozilla.org    http://www.mozilla.org/   (work)
jwz@jwz.org        http://www.jwz.org/       (play)


Message-ID: <3679C7C9.6F04B9AF@mozilla.org>
Subject: Re: [MORK] some summary file problem statements
Date: Thu, 17 Dec 1998 19:11:05 -0800
From: Jamie Zawinski <jwz@mozilla.org>
Newsgroups: netscape.public.mozilla.mail-news

I think the answer is that it's the time needed for the I/O required to do two things: read the whole summary file for the source folder into memory; plus the time needed to fseek() in the source file and copy the message body bytes from one file to the other.

But note that in most cases, the summary data would already have been loaded, so really the cost would be purely the message-body I/O. If you're dragging messages from one folder to another, then you're gesturing at the source folder, so the source folder is on the screen, which means it summary data is in memory (or you wouldn't be seeing it at all.)

In more detail: in the model used by 2.x/3.x and Grendel, there was an in-memory object representing a mail folder; and the two disk-oriented operations you could perform on it were "read the whole from scratch" or "write the whole thing." (Actually there was a 3rd one, a sort of recovery-mode for when file dates or sizes had changed on us unexpectedly, but pay no attention to the man behind the curtain.)

So the act of copying several messages from folder A to folder B would be:

Now, the current state is that B's summary file indicates in its header, "I am a summary for N messages through byte position X of a file of length Y with date Z."

The next time folder B is selected, we do the error checking -- does the file length and date match? If not, we go into "recovery mode", which I'll continue to handwave over.

If it does match, but the summarized portion is smaller than the length of the file, then the act of creating the in-memory summary object has two passes: first the data is parsed from the summary file; then more data is appended to the in-memory summary object by parsing the file itself, starting at position X (say, 10 messages from the end of the file.)

At this point we have an in-memory summary object for the folder, and if needed, it is marked as dirty. When the time comes to write out dirty summary data, the whole summary file is rewritten from the beginning (which is fast, because summary files are small.)

The act of marking a message deleted in a folder is pretty simple, too; the "deleted" flag is set on the in-memory summary object; and the summary object is marked as dirty. At some point in the future, the summary file will be flushed to disk; and also, we will fseek() through the mbox file itself and overwrite the X-Mozilla-Status headers if they exist. In this way, we maintain the (important) goal that the summary file is a cache -- a redundant store for information that can be recomputed from the mbox file itself.

This only breaks down if the mbox file doesn't contain space for X-Mozilla-Status headers; in that case, there can be situations where crucial information exists only in the summary file. However, this is corrected when the folder is compacted (when Trash is emptied) by rewriting the mbox so that all messages have X-Mozilla-Status.

I suppose, but in 3.x and Grendel, the summary files weren't really thought of as a "database" at all; they were more like a disk representation of a set of objects. Sometimes you'd write those objects to disk, or read them back, and the layer that dealt with the file format had the mission of doing this I/O quickly and using minimal disk space.

It was really more of a "serialization" exercise than a "database" exercise.

--
Jamie Zawinski
jwz@mozilla.org    http://www.mozilla.org/   (work)
jwz@jwz.org        http://www.jwz.org/       (play)


[ up ]