message threading.
© 1997-2002 Jamie Zawinski <jwz@jwz.org>


In this document, I describe what is, in my humble but correct opinion, the best known algorithm for threading messages (that is, grouping messages together in parent/child relationships based on which messages are replies to which others.) This is the threading algorithm that was used in Netscape Mail and News 2.0 and 3.0, and in Grendel.

Sadly, my C implementation of this algorithm is not available, because it was purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0 source code.

However, my Java implementation is available in the Grendel source. You can find a descendant of that code on ftp.mozilla.org. Here's the original source release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/Threader.java. See also IThreadable and TestThreader. (The mailsum code in storage/MailSummaryFile.java and the MIME parser in the mime/ directory may also be of interest.)

This is not the algorithm that Netscape 4.x uses, because this is another area where the 4.0 team screwed the pooch, and instead of just continuing to use the existing working code, replaced it with something that was bloated, slow, buggy, and incorrect. But hey, at least it was in C++ and used databases!

This algorithm is also described in the imapext-thread Internet Draft: Mark Crispin and Kenneth Murchison formalized my description of this algorithm, and propose it as the THREAD extension to the IMAP protocol (the idea being that the IMAP server could give you back a list of messages in a pre-threaded state, so that it wouldn't need to be done on the client side.) If you find my description of this algorithm confusing, perhaps their restating of it will be more to your taste.

I'm told this algorithm is also used in the Evolution and Balsa mail readers. Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik Dietz has written a Ruby version; and Max Ogden has written a JavaScript version. (I've not tested any of these implementations, so I make no claims as to how faithfully they implement it.)


First some background on the headers involved.

In-Reply-To:

The In-Reply-To header was originally defined by RFC 822, the 1982 standard for mail messages. In 2001, its definition was tightened up by RFC 2822.

RFC 822 defined the In-Reply-To header as, basically, a free-text header. The syntax of it allowed it to contain basically any text at all. The following is, literally, a legal RFC 822 In-Reply-To header:

So you're not guaranteed to be able to parse anything useful out of In-Reply-To if it exists, and even if it contains something that looks like a Message-ID, it might not be (especially since Message-IDs and email addresses have identical syntax.)

However, most of the time, In-Reply-To headers do have something useful in them. Back in 1997, I grepped over a huge number of messages and collected some damned lies, I mean, statistics, on what kind of In-Reply-To headers they contained. The results:

Of course these numbers are very much dependent on the sample set, which, in this case, was probably skewed toward Unix users, and/or toward people who had been on the net for quite some time (due to the age of the archives I checked.)

However, this seems to indicate that it's not unreasonable to assume that, if there is an In-Reply-To field, then the first <>-bracketed text found therein is the Message-ID of the parent message. It is safe to assume this, that is, so long as you still exhibit reasonable behavior when that assumption turns out to be wrong, which will happen a small-but-not-insignificant portion of the time.

RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To: by the more modern standard, In-Reply-To may contain only message IDs. There will usually be only one, but there could be more than one: these are the IDs of the messages to which this one is a direct reply (the idea being that you might be sending one message in reply to several others.)

References:

The References header was defined by RFC 822 in 1982. It was defined in, effectively, the same way as the In-Reply-To header was defined: which is to say, its definition was pretty useless. (Like In-Reply-To, its definition was also tightened up in 2001 by RFC 2822.)

However, the References header was also defined in 1987 by RFC 1036 (section 2.2.5), the standard for USENET news messages. That definition was much tighter and more useful than the RFC 822 definition: it asserts that this header contain a list of Message-IDs listing the parent, grandparent, great-grandparent, and so on, of this message, oldest first. That is, the direct parent of this message will be the last element of the References header.

It is not guaranteed to contain the entire tree back to the root-most message in the thread: news readers are allowed to truncate it at their discretion, and the manner in which they truncate it (from the front, from the back, or from the middle) is not defined.

Therefore, while there is useful info in the References header, it is not uncommon for multiple messages in the same thread to have seemingly-contradictory References data, so threading code must make an effort to do the right thing in the face of conflicting data.

RFC 2822 updated the mail standard to have the same semantics of References as the news standard, RFC 1036.

In practice, if you ever see a References header in a mail message, it will follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822 definition. Because the References header both contains more information and is easier to parse, many modern mail user agents generate and use the References header in mail instead of (or in addition to) In-Reply-To, and use the USENET semantics when they do so.

You will generally not see In-Reply-To in a news message, but it can occasionally happen, usually as a result of mail/news gateways.

So, any sensible threading software will have the ability to take both In-Reply-To and References headers into account.

Note: RFC 2822 (section 3.6.4) says that a References field should contain the contents of the parent message's References field, followed by the contents of the parent's Message-ID field (in other words, the References field should contain the path through the thread.) However, I've been informed that recent versions of Eudora violate this standard: they put the parent Message-ID in the In-Reply-To header, but do not duplicate it in the References header: instead, the References header contains the grandparent, great-grand-parent, etc.

This implies that to properly reconstruct the thread of a message in the face of this nonstandard behavior, we need to append any In-Reply-To message IDs to References.


The Algorithm

This algorithm consists of five main steps, and each of those steps is somewhat complicated. However, once you've wrapped your brain around it, it's not really that complicated, considering what it does.

In defense of its complexity, I can say this:

Well, enough with the disclaimers.


Definitions:

The Algorithm:

  1. For each message:

    1. If id_table contains an empty Container for this ID:
      • Store this message in the Container's message slot.
      Else:
      • Create a new Container object holding this message;
      • Index the Container by Message-ID in id_table.

    2. For each element in the message's References field:

      • Find a Container object for the given Message-ID:
        • If there's one in id_table use that;
        • Otherwise, make (and index) one with a null Message.

      • Link the References field's Containers together in the order implied by the References header.
        • If they are already linked, don't change the existing links.
        • Do not add a link if adding that link would introduce a loop: that is, before asserting A->B, search down the children of B to see if A is reachable, and also search down the children of A to see if B is reachable. If either is already reachable as a child of the other, don't add the link.

    3. Set the parent of this message to be the last element in References. Note that this message may have a parent already: this can happen because we saw this ID in a References field, and presumed a parent based on the other entries in that field. Now that we have the actual message, we can be more definitive, so throw away the old parent and use this new one. Find this Container in the parent's children list, and unlink it.

      Note that this could cause this message to now have no parent, if it has no references field, but some message referred to it as the non-first element of its references. (Which would have been some kind of lie...)

      Note that at all times, the various ``parent'' and ``child'' fields must be kept inter-consistent.

  2. Find the root set.

    Walk over the elements of id_table, and gather a list of the Container objects that have no parents.

  3. Discard id_table. We don't need it any more.

  4. Prune empty containers.
    Recursively walk all containers under the root set.
    For each container:
    1. If it is an empty container with no children, nuke it.

      Note: Normally such containers won't occur, but they can show up when two messages have References lines that disagree. For example, assuming A and B are messages, and 1, 2, and 3 are references for messages we haven't seen:

        A has references: 1, 2, 3
        B has references: 1, 3

      There is ambiguity as to whether 3 is a child of 1 or of 2. So, depending on the processing order, we might end up with either

            -- 1
               |-- 2
                   \-- 3
                       |-- A
                       \-- B
      or
            -- 1
               |-- 2            <--- non root childless container!
               \-- 3
                   |-- A
                   \-- B

    2. If the Container has no Message, but does have children, remove this container but promote its children to this level (that is, splice them in to the current child list.)

      Do not promote the children if doing so would promote them to the root set -- unless there is only one child, in which case, do.

  5. Group root set by subject.

    If any two members of the root set have the same subject, merge them. This is so that messages which don't have References headers at all still get threaded (to the extent possible, at least.)

    1. Construct a new hash table, subject_table, which associates subject strings with Container objects.

    2. For each Container in the root set:

      • Find the subject of that sub-tree:
        • If there is a message in the Container, the subject is the subject of that message.
        • If there is no message in the Container, then the Container will have at least one child Container, and that Container will have a message. Use the subject of that message instead.
        • Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so on.
        • If the subject is now "", give up on this Container.
        • Add this Container to the subject_table if:
          • There is no container in the table with this subject, or
          • This one is an empty container and the old one is not: the empty one is more interesting as a root, so put it in the table instead.
          • The container in the table has a ``Re:'' version of this subject, and this container has a non-``Re:'' version of this subject. The non-re version is the more interesting of the two.

    3. Now the subject_table is populated with one entry for each subject which occurs in the root set. Now iterate over the root set, and gather together the difference.

      For each Container in the root set:

      • Find the subject of this Container (as above.)
      • Look up the Container of that subject in the table.
      • If it is null, or if it is this container, continue.

      • Otherwise, we want to group together this Container and the one in the table. There are a few possibilities:

        • If both are dummies, append one's children to the other, and remove the now-empty container.

        • If one container is a empty and the other is not, make the non-empty one be a child of the empty, and a sibling of the other ``real'' messages with the same subject (the empty's children.)

        • If that container is a non-empty, and that message's subject does not begin with ``Re:'', but this message's subject does, then make this be a child of the other.

        • If that container is a non-empty, and that message's subject begins with ``Re:'', but this message's subject does not, then make that be a child of this one -- they were misordered. (This happens somewhat implicitly, since if there are two messages, one with Re: and one without, the one without will be in the hash table, regardless of the order in which they were seen.)

        • Otherwise, make a new empty container and make both msgs be a child of it. This catches the both-are-replies and neither-are-replies cases, and makes them be siblings instead of asserting a hierarchical relationship which might not be true.

          (People who reply to messages without using ``Re:'' and without using a References line will break this slightly. Those people suck.)

      (It has occurred to me that taking the date or message number into account would be one way of resolving some of the ambiguous cases, but that's not altogether straightforward either.)

  6. Now you're done threading!
    Specifically, you no longer need the ``parent'' slot of the Container object, so if you wanted to flush the data out into a smaller, longer-lived structure, you could reclaim some storage as a result.

  7. Now, sort the siblings.
    At this point, the parent-child relationships are set. However, the sibling ordering has not been adjusted, so now is the time to walk the tree one last time and order the siblings by date, sender, subject, or whatever. This step could also be merged in to the end of step 4, above, but it's probably clearer to make it be a final pass. If you were careful, you could also sort the messages first and take care in the above algorithm to not perturb the ordering, but that doesn't really save anything.


You might be wondering what Netscape Confusicator 4.0 broke. Well, basically they never got threading working right. Aside from crashing, corrupting their databases files, and general bugginess, the fundamental problem had been twofold:

Plus my pet peeve,

I seem to recall there being some other problem that was a result of the thread hierarchy being stored in the database, instead of computed as needed in realtime (there were was some kind of ordering or stale-data issue that came up?) but maybe they finally managed to fix that.

My C version of this code was able to thread 10,000 messages in less than half a second on a low-end (90 MHz) Pentium, so the argument that it has to be in the database for efficiency is pure bunk.

Also bunk is the idea that databases are needed for ``scalability.'' This code can thread 100,000 messages without a horrible delay, and the fact is, if you're looking at a 100,000 message folder (or for that matter, if you're running Confusicator at all), you're doing so on a machine that has sufficient memory to hold these structures in core. Also consider the question of whether your GUI toolkit contains a list/outliner widget that can display a million elements in the first place. (The answer is probably ``no.'') Also consider whether you have ever in your life seen a single folder that has a million messages in it, and that further, you've wanted to look at all at once (rather than only looking at the most recent 100,000 messages to arrive in that newsgroup...)

In short, all the arguments I've heard for using databases to implement threading and mbox summarization are solving problems that simply don't exist. Show me a real-world situation where the above technique actually falls down, and then we'll talk.

Just say no to databases!


[ up ]