 |
 |
|
 |
 |
 |
|
``Intertwingularity is not generally acknowledged -- people
keep pretending they can make things deeply hierarchical, categorizable
and sequential when they can't. Everything is deeply intertwingled.''
|
|
|
-- Ted Nelson
|
|
In the following, I outline a potential project to make it easier to
deal with a massive volume of personal messages: excavating, traversing,
relating, reporting, annotating.
I call this hypothetical program ``Intertwingle.''
-
Intertwingle can be seen as a unification of a search tool and an address
book. It is not, however, a mail reader. The presentation of query results
could be done through a mail reader, but the intention is that ones
choice of mail reader should be orthogonal to the use of this tool. The two
kinds of tools just happen to operate on the same data.
The design philosophy is that any time there is a visual representation
of an object, the corresponding object should be accessible with a gesture:
That chasing links is easier than composing search terms (but both are needed.)
The target audience is individuals who have a lot of mail. The target
audience is not inhabitants of the corporation, it is people.
Needs which are specific to IS Managers, or to Enterprise Directory Services
are not of interest. This is about the general problem of handling lots of
individual mail. (Whether in the context of personal mail, or job-related
mail, the problem is the same: you've got a lot of it; now what do you do?)
Sharing is an interesting problem, and may be addressed, but I feel it is
explicitly secondary in priority to solving the problem in the non-shared
domain. (But, we should think about it up front, because that kind of
thing tends to be hard to retrofit.)
-
The sheer multitude of representations-of-objects yields a colossal
number of potential links to follow, which is why I anticipate
link-chasing to be a (usually) far easier method of excavation than
searching. For example, here are the headers of a typical message:
Date: |
Sun, 3 Jul 94 16:40:07 PDT |
From: |
Jamie Zawinski <jwz@mcom.com> |
To: |
eng |
Subject: |
printing |
In-Reply-To: |
Chris Houck's message of Sun 3-Jul-94 13:19:23 -0700
<9407032019.AA18853@neon.mcom.com> |
Message-ID: |
<19940703093034.jwz@islay.mcom.com> |
References: |
<9407032019.AA18853@neon.mcom.com> |
There is a great deal of structure there:
- Sun, 3 Jul 94 16:40:07 PDT
- This is a representation of a point in time. From here one can
envision traversing to a list of other messages within some range
of that moment: that hour, that day, that month, that year.
- Jamie Zawinski <jwz@mcom.com>
- This is a description of a particular person. From here one should
be able to easily get to information related to that person: an
address book entry, or a list of all messages sent by them, or sent
to them, or any number of other annotations.
- Jamie Zawinski
- This is a name, not a person, and names are notoriously non-unique.
From here it would be useful to get to a list of all known people
who have claimed that name (from the set of people who are message
senders or recipients.)
- jwz@mcom.com
- This is an email address, not a person, and while one email address
is usually not used by more than one person, it's quite common
for one person to have many email addresses (or many variations on the
same address.) From here it would be useful to get to a list of all
known people who have used that address (from the set of people who are
message senders or recipients) and from there to the set of other
addresses used by that person or those people. One might also find it
useful to get a list of messages associated with this address (while
excluding messages from other addresses of the same person.)
- eng
- This is an email address, yet it happens to be a mailing list.
There is no one person associated with it, yet the set of operations
one might like to perform on it is very similar.
- printing
- This is unstructured text, and what one does with unstructured text is
attempt to match patterns in it. There are any number of other
properties associated with this particular piece of text: it is in a
header field called Subject in a message from Jamie
Zawinski, on Sunday, July 3rd, and so on. All of these are
interesting properties that are within one or two link-hops of the text
itself. Their proximity is what makes them interesting.
- Chris Houck
- A name, as above.
- Chris Houck's message
- An ambiguous reference to a message. From here, one should be able to
get to the set of all messages from someone who claimed the name
Chris Houck.
- Chris Houck's message of Sun 3-Jul-94 13:19:23 -0700
- Another reference to a message, probably less ambiguous.
- <9407032019.AA18853@neon.mcom.com>
- <19940703093034.jwz@islay.mcom.com>
- These also are references to particular messages, the least ambiguous
representations so far; however, they are still slightly ambiguous,
since message IDs refer to original messages: there could be multiple
copies of these messages with slightly different headers or other
annotations within the message-store.
Any any time there is a link, one can imagine an equal but opposite
counter-link: when we talk of reaching lists of objects above, the object
by which we reached that list will always be a member of the list. And
if A is three hops away from D, then D is three hops
away from A, and traversal in both directions should be possible.
However, the object at the other end of the link does not necessarily
encode the reverse path in its usual visual representation. For example,
while messages point to the message to which they are a reply, the parent
doesn't (in itself) point to the children. This implicit relationship must
be made explicit: it must be easy to get from a message to the set of
messages which refer to it. All links must be bidirectional.
Further structure exists outside of the message headers themselves:
- Messages live in folders.
- Folders have names.
- Folders are sometimes arranged in a hierarchy.
- Folders tend to store messages linearly, in a particular order:
thus, each message has ``previous'' and ``next'' relationships
with other messages.
- Messages can contain other messages (forwarded messages, or digests.)
Each such message is a message in its own right, but the containment
relationship can be important.
- Messages have bodies.
- The bodies can contain unstructured text.
- The bodies can contain text that is named, for example, an
attached text file which has a file name or description specified
in its attachment headers.
- The bodies can contain binary objects which, while not textually
searchable, are named and described.
- Bodies can contain hyperlinks. Plain-text messages might happen to
have detectable URLs in them, and HTML messages have many mechanisms
for referring to other objects. This implies that it would be
interesting to traverse from a message, to information about a web
page that it refers to, and back to a set of messages which refer to
objects on that server.
-
Following a link only gives you one dimension of mobility. A search can
be seen as following multiple links, and finding the intersection (or union)
of the results of those links.
Any link-relationship should be searchable. For example:
- All messages from person between date and date
that have pattern in the body.
- All messages from person which contain a message
from person.
- All messages to mailing-list which refer to URL.
- All messages containing text in the main body, but not in an
attachment.
- All messages with an attachment whose file name contains string.
-
The basic components of this system are:
- parser.
The module which reads the existing message store (directories of
BSD mbox files, or news spool directories, or whatever) and parses
them into tagged, indexable data.
It needs to understand where messages begin and end, understand how
to descend into MIME structures, how to translate HTML into indexable
text, how to recognise URLs, and so on, and so on.
It will presumably generate an intermediate data representation which
can be more easily fed to the database. A pretty-printed version of
the representation of a message might look like this (if you will
excuse my lisp-centric upbringing; here in the modern world, this
would presumably be done with
XML):
(:message
(:db-id "globally-unique-identifier")
(:header-field (:key "from")
(:addr "Jamie Zawinski" "jwz"))
(:header-field (:key "newsgroups")
(:news "mcom.test"))
(:header-field (:key "subject")
(:text "hey"))
(:link "http://url-found-in-some-textual-header/")
(:attachment
(:type "text/plain")
(:body "message body text")
(:link "http://url-found-in-body-text")
(:addr-or-id "email@address.found.in.body")
(:addr-or-id "or@maybe.its.really.a.message.id"))
(:attachment
(:type "text/plain")
(:name "filename")
(:description "description")
(:text "decoded/stripped text of attachment")
(:link "http://ijkl")
(:link "http://mnop"))
(:attachment
(:type "application/postscript"))
(:attachment
(:type "message/rfc822")
(:message-pointer "db-id")))
These objects are shallow: that last "db-id"
mentioned in the example is a pointer to a top-level message object
that will be coming up soon (probably next in the stream.) That is,
deeply nested trees of messages are flattened. (An interesting search
term might be ``depth > 1'' for when you're
looking for something, and you know it was in a forwarded
message, but you don't remember from whom.)
Deeply nested MIME structures (multipart/ forms) are also
flattened. Content-Disposition is always assumed to be inline
for purposes of indexing; we index the body of any part that is of a
text type. There is no special handling for
multipart/alternative forms: each part is indexed as for
multipart/mixed.
A more formal representation might be
msg_desc = db_id *msg_header
*link_part *addr_id_part
*msg_body
msg_header = header_name header_body
msg_body = text_part / link_part /
attach_part
header_name = keyword
header_body = text / *mailbox /
*newsgroup / *msg_id / date
mailbox = name address
name = keyword
address = keyword
newsgroup = keyword
msg_id = keyword
date = <time_t>
text_part = content_type text
content_type = keyword
link_part = url
addr_id_part = address / msg_id
url = text
attach_part = content_type
[attach_name]
[attach_desc]
[attach_value]
*link_part *addr_id_part
attach_name = text
attach_desc = text
attach_value = text_part / db_id
db_id = <uint32>
keyword = <an interned string>
text = <an uninterned,
searchable string>
(Note: I've actually already written this parser; it's not a lot of code,
but it seems to work fairly well. If anyone is seriously interested in
taking this project and running with it, I'll see about getting permission to
release that code.)
- database.
The module which stores the output of the parser on disk in some
quickly-retrievable format. It needs to have both relational
and full-text-indexing properties; many of the searches
we want to do could be accomplished with a database that was nothing
but a glorified set of hash tables; but body searches need to be done
in some more clever way. (Perhaps simply putting every word in a hash
table would be sufficient, but I doubt it.) And more to the point,
the text searches have to take advantage of the tagging of the data,
so that, for example, constraining a search to be in the subject and
not the body actually makes the search go faster instead of slower.
Incremental updates are probably pretty important. I doubt we could
get away with a setup that required a nightly update.
It seems clear that RDF would be
the way go go here.
- query tool.
All of the web search engines force the user to type in boolean
expressions. Sometimes that's ok, but we should do something better,
that lets the user construct expressions with a GUI.
Drawing on the notion that searches are really set operations, perhaps
one aspect of the search tool could be drag-and-drop: to add a set
of messages to the union of messages returned, drop the link on the
``Or'' box. To add it to the intersection of messages returned, drop
it on the ``And'' box. Of course, that doesn't handle deeper boolean
expressions, or textual searches. Maybe it's a dumb idea.
- presentation tools.
There are objects, sets of objects, and presentation tools.
There is a presentation tool for each kind of object; and one for
each kind of object set.
names, addresses, or people.
The presentation tools for these kinds of objects needn't be complicated,
since there's not a lot of information to show: just a bunch of links and/or
commands. For example, there needs to be a place to hang the ``show me all
people with this name'' gesture, and the ``show me all messages from this
user'' gesture. But just including the list there isn't going to work, since
it's long; really, there wants to be a way to initialize a search with this
user. Perhaps activating one of those controls would bring up the search
tool with some terms already filled in, like
user = "Jamie Zawinski <jwz@mozilla.org>"
Getting back to the drag-and-drop idea, dragging that button
onto an existing search tool could expand the search to include that term.
One should be able to store annotations on people: even something as
simple as a single text field would add a great deal of power. These
annotations should themselves be searchable. These annotations should
be able to contain (clickable!) references to other people or messages
or newsgroups or...
BBDB convinces me that this is an absolute requirement.
The problem with the annotation notion is that it's the first time that
we consider a piece of data which is not merely a projection of data already
present in the message store: it is out-of-band data that needs to
be stored somewhere. In the address book? In LDAP? I have no idea.
sets of people.
Perhaps a simple list is sufficient, with options to sort in various ways
(by last name, first name, email, host-name, or host-domain.)
messages.
Presenting a single message is straightforward: just return a
message/rfc822 or text/html document. However, there should be some
other controls available: Reply-To-Sender, Reply-To-All, Forward.
And there needs to be a place to hang the reciprocal links to the
referring messages, to the folder, and so on.
Annotations of messages would be interesting as well. For example,
one might want to make a note to one's self that two messages from different
people refer to the same issue and should be dealt with at the same time.
sets of messages.
This presentation has to be fairly powerful; it needs to present a decent
summary of the messages (with resizable columns for sender, recipient,
date, and so on) and be able to do all the usual sorting and threading
tricks. Basically, this has to be a very good thread display.
It should also be able to incrementally update as results are coming
back from the database, so that the user can see the results they're getting
(and even examine messages) while more results are still coming in.
Note that, to this view, the concept of ``folder'' is meaningless:
a folder name is just another property by which searches can be pruned.
Today, I can point my ``message set browser'' at my Inbox folder, but
I can't point it at the set of messages with word in the body.
The special treatment of Inbox is arbitrary and limiting.
Annotating a message-set could mean manually including and excluding
specific messages: a message-set could be considered a ``bucket'' which
the user can then manipulate by hand, assign a name, and keep around. For
use as a ``to do'' list, say. (Message inclusion and exclusion
could be handled by manipulating the search terms, so it's not as hard
a problem as textual annotations in general.)
Presentation tools should be linked as well: one should be
able to pick up the sets displayed in one tool and project them into
another. For example:
- Show me all messages with word in body.
- Drag the sender column away: that's a set of people,
therefore it is displayed using a ``people browser''.
- In the people browser, click on an address: refine the search to
contain only those in the same domain as that address. A new,
smaller list of people is presented.
- Project the addresses of those people into a message-set-viewer:
this shows all mail received from any of those people.
Perhaps the message-set presentation is a simulated IMAP folder.
Perhaps the message and message-set presentation tools are a mail reader.
However, every element of the display needs to be deeply intertwingled
with the database. Simply dropping the messages into a mail reader would
defeat the purpose, which is that every structured piece of text on the
screen should be a hyperlink.
The presentation tools could be implemented as client-side Java, or
partly as client-side Java and partly as server-generated HTML. (It seems
unlikely that the message-set presentation could be implemented solely in
HTML, though that's conceivable for the other presentations.)
The other components are server-side, and need to be far faster than
Java is currently capable of. And hopefully we won't have to write the
database at all, but can just use something publicly available.
-
There are other interesting data-visualization possibilities here as
well; since really what we have is nodes and connections between them,
tools like graphers and histogram charts might be applicable as well,
to answer questions like
- show me a graph of the age-distribution of my unanswered mail, or,
- show me a graph of people who are known to have directly exchanged
mail with each other so that I can see the ``clumping'' of my correspondents.
The object/presentation infrastructure should be designed so that new
tools drop in easily, with few interdependencies.
This sort of model is not applicable merely to the domain of
messages; it applies equally well to any corpus which has structured,
potentially-ambiguous references (or rather, representations of
references.)
For example, source code.
|
 |
 |