© 1997-2002 Jamie Zawinski <email@example.com>
In this document, I describe what is, in my humble but correct opinion, the best known algorithm for threading messages (that is, grouping messages together in parent/child relationships based on which messages are replies to which others.) This is the threading algorithm that was used in Netscape Mail and News 2.0 and 3.0, and in Grendel.
Sadly, my C implementation of this algorithm is not available, because it was purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0 source code.
However, my Java implementation is available in the Grendel source. You can find a descendant of that code on ftp.mozilla.org. Here's the original source release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/Threader.java. See also IThreadable and TestThreader. (The mailsum code in storage/MailSummaryFile.java and the MIME parser in the mime/ directory may also be of interest.)
This is not the algorithm that Netscape 4.x uses, because this is another area where the 4.0 team screwed the pooch, and instead of just continuing to use the existing working code, replaced it with something that was bloated, slow, buggy, and incorrect. But hey, at least it was in C++ and used databases!
This algorithm is also described in the imapext-thread Internet Draft: Mark Crispin and Kenneth Murchison formalized my description of this algorithm, and propose it as the THREAD extension to the IMAP protocol (the idea being that the IMAP server could give you back a list of messages in a pre-threaded state, so that it wouldn't need to be done on the client side.) If you find my description of this algorithm confusing, perhaps their restating of it will be more to your taste.
First some background on the headers involved.
The In-Reply-To header was originally defined by RFC 822, the 1982 standard for mail messages. In 2001, its definition was tightened up by RFC 2822.
RFC 822 defined the In-Reply-To header as, basically, a free-text header. The syntax of it allowed it to contain basically any text at all. The following is, literally, a legal RFC 822 In-Reply-To header:
So you're not guaranteed to be able to parse anything useful out of In-Reply-To if it exists, and even if it contains something that looks like a Message-ID, it might not be (especially since Message-IDs and email addresses have identical syntax.)
However, most of the time, In-Reply-To headers do have something useful in them. Back in 1997, I grepped over a huge number of messages and collected some damned lies, I mean, statistics, on what kind of In-Reply-To headers they contained. The results:
In a survey of 22,950 mail messages with In-Reply-To headers:
|18,396||had at least one occurrence of <>-bracketed text.|
|4,554||had no <>-bracketed text at all (just names and dates.)|
|714||contained one <>-bracketed addr-spec and no message IDs.|
|4||contained multiple message IDs.|
|1||contained one message ID and one <>-bracketed addr-spec.|
The most common forms of In-Reply-To seemed to be:
|31%||NAME's message of TIME <ID@HOST>|
|9%||<ID@HOST> from NAME at "TIME"|
|8%||USER's message of TIME <ID@HOST>|
|7%||USER's message of TIME|
|6%||Your message of "TIME"|
|17%||hundreds of other variants (average 0.4% each?)|
Of course these numbers are very much dependent on the sample set, which, in this case, was probably skewed toward Unix users, and/or toward people who had been on the net for quite some time (due to the age of the archives I checked.)
However, this seems to indicate that it's not unreasonable to
assume that, if there is an In-Reply-To field, then the first
<>-bracketed text found therein is the Message-ID of the
parent message. It is safe to assume this, that is, so long as
you still exhibit reasonable behavior when that assumption turns
out to be wrong, which will happen a small- RFC 2822, the
RFC 822, updated the definition of In-Reply-To: by the more modern
standard, In-Reply-To may contain only message IDs. There will usually
be only one, but there could be more than one: these are the IDs of the
messages to which this one is a direct reply (the idea being that you
might be sending one message in reply to several others.)
RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To: by the more modern standard, In-Reply-To may contain only message IDs. There will usually be only one, but there could be more than one: these are the IDs of the messages to which this one is a direct reply (the idea being that you might be sending one message in reply to several others.)
The References header was defined by RFC 822 in 1982. It was defined in, effectively, the same way as the In-Reply-To header was defined: which is to say, its definition was pretty useless. (Like In-Reply-To, its definition was also tightened up in 2001 by RFC 2822.)
However, the References header was also defined in 1987 by RFC 1036 (section 2.2.5), the standard for USENET news messages. That definition was much tighter and more useful than the RFC 822 definition: it asserts that this header contain a list of Message-IDs listing the parent, grandparent, great-grandparent, and so on, of this message, oldest first. That is, the direct parent of this message will be the last element of the References header.
It is not guaranteed to contain the entire tree back to the root-most message in the thread: news readers are allowed to truncate it at their discretion, and the manner in which they truncate it (from the front, from the back, or from the middle) is not defined.
Therefore, while there is useful info in the References header, it
is not uncommon for multiple messages in the same thread to have
seemingly- RFC 2822
updated the mail standard to have the same semantics of References
as the news standard,
RFC 2822 updated the mail standard to have the same semantics of References as the news standard, RFC 1036.
In practice, if you ever see a References header in a mail message, it will follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822 definition. Because the References header both contains more information and is easier to parse, many modern mail user agents generate and use the References header in mail instead of (or in addition to) In-Reply-To, and use the USENET semantics when they do so.
You will generally not see In-Reply-To in a news message, but it can occasionally happen, usually as a result of mail/news gateways.
So, any sensible threading software will have the ability to take both In-Reply-To and References headers into account.
Note: RFC 2822 (section 3.6.4) says that a References field should contain the contents of the parent message's References field, followed by the contents of the parent's Message-ID field (in other words, the References field should contain the path through the thread.) However, I've been informed that recent versions of Eudora violate this standard: they put the parent Message-ID in the In-Reply-To header, but do not duplicate it in the References header: instead, the References header contains the grandparent, great-grand-parent, etc.
This implies that to properly reconstruct the thread of a message in the face of this nonstandard behavior, we need to append any In-Reply-To message IDs to References.
This algorithm consists of five main steps, and each of those steps is somewhat complicated. However, once you've wrapped your brain around it, it's not really that complicated, considering what it does.
In defense of its complexity, I can say this:
Well, enough with the disclaimers.
|Message message;||// (may be null)|
|Container child;||// first child|
|Container next;||// next element in sibling list, or null|
|ID message_id;||// the ID of this message|
|ID *references;||// list of IDs of parent messages|
The References field is populated from the ``References'' and/or ``In-Reply-To'' headers. If both headers exist, take the first thing in the In-Reply-To header that looks like a Message-ID, and append it to the References header.
If there are multiple things in In-Reply-To that look like Message-IDs, only use the first one of them: odds are that the later ones are actually email addresses, not IDs.
These ID objects can be strings, or they can be any other token on which you can do meaningful equality comparisons.
Only two things need to be done with the subject strings: ask whether they begin with ``Re:'', and compare the non-Re parts for equivalence. So you can get away with interning or otherwise hashing these, too. (This is a very good idea: my code does this so that I can use == instead of strcmp inside the loop.)
The ID objects also don't need to be strings, for the same reason. They can be hashes or numeric indexes or anything for which equality comparisons hold, so it's way faster if you can do pointer-equivalence comparisons instead of strcmp.
The reason the Container and Message objects are separate is because the Container fields are only needed during the act of threading: you don't need to keep those around, so there's no point in bulking up every Message structure with them.
At presentation-time, these will show up as unselectable ``parent'' containers, for example, if we have the thread
-- A |-- B \-- C -- Dand we know about messages B and C, but their common parent A does not exist, there will be a placeholder for A, to group them together, and prevent D from seeming to be a sibling of B and C.
These ``dummy'' messages only ever occur at depth 0.
Note that this could cause this message to now have no parent, if it has no references field, but some message referred to it as the non-first element of its references. (Which would have been some kind of lie...)
Note that at all times, the various ``parent'' and ``child'' fields must be kept inter-consistent.
Walk over the elements of id_table, and gather a list of the Container objects that have no parents.
Note: Normally such containers won't occur, but they can show up when two messages have References lines that disagree. For example, assuming A and B are messages, and 1, 2, and 3 are references for messages we haven't seen:
There is ambiguity as to whether 3 is a child of 1 or of 2. So, depending on the processing order, we might end up with either
-- 1 |-- 2 \-- 3 |-- A \-- Bor
-- 1 |-- 2 <--- non root childless container! \-- 3 |-- A \-- B
Do not promote the children if doing so would promote them
to the root
If any two members of the root set have the same subject, merge them. This is so that messages which don't have References headers at all still get threaded (to the extent possible, at least.)
For each Container in the root set:
(People who reply to messages without using ``Re:'' and without using a References line will break this slightly. Those people suck.)
(It has occurred to me that taking the date or message number into account would be one way of resolving some of the ambiguous cases, but that's not altogether straightforward either.)
You might be wondering what Netscape Confusicator 4.0 broke. Well, basically they never got threading working right. Aside from crashing, corrupting their databases files, and general bugginess, the fundamental problem had been twofold:
Plus my pet peeve,
That is, 4.0 gives you these choices: ``Sort by Date; Sort by Subject; Sort by message number; or Thread.'' Where they assume that ``Thread'' implies ``Sort by Date.'' So that means that there's no way to see a threaded set of messages that are sorted by message number, or by sender, etc.
There should be options for how to sort the messages; and then, orthogonal to that should be the boolean option of whether the messages should be threaded.
I seem to recall there being some other problem that was a result of the thread hierarchy being stored in the database, instead of computed as needed in realtime (there were was some kind of ordering or stale-data issue that came up?) but maybe they finally managed to fix that.
My C version of this code was able to thread 10,000 messages in
less than half a second on a low-end
Also bunk is the idea that databases are needed for ``scalability.'' This code can thread 100,000 messages without a horrible delay, and the fact is, if you're looking at a 100,000 message folder (or for that matter, if you're running Confusicator at all), you're doing so on a machine that has sufficient memory to hold these structures in core. Also consider the question of whether your GUI toolkit contains a list/outliner widget that can display a million elements in the first place. (The answer is probably ``no.'') Also consider whether you have ever in your life seen a single folder that has a million messages in it, and that further, you've wanted to look at all at once (rather than only looking at the most recent 100,000 messages to arrive in that newsgroup...)
In short, all the arguments I've heard for using databases to implement threading and mbox summarization are solving problems that simply don't exist. Show me a real-world situation where the above technique actually falls down, and then we'll talk.
Just say no to databases!