DadaDodo
Exterminate All Rational Thought

© 1997-2003 Jamie Zawinski <jwz@jwz.org>

don't read the words

just look at the shapes

I never metadiscourse I didn't like

deconstruct this, monkey boy

the fun link is at the bottom

DadaDodo is a program that analyses texts for word probabilities, and then generates random sentences based on that. Sometimes these sentences are nonsense; but sometimes they cut right through to the heart of the matter, and reveal hidden meanings.

William S. Burroughs called this ``cut up theory.'' His approach was to take a page of text, divide it into quadrants, rearrange the quadrants, and then read the page across the divisions. He wrote this way; writing, cutting up, shuffling, publishing the result. Collage and randomness applied to words. He saw this as a way of escaping from a prison that words create for us, locking us down into one way of thinking: an idea echoed in Orwell's ``1984,'' where the purpose of NewSpeak was to make ThoughtCrime impossible by making it inexpressible: ``The Revolution will be complete when the language is perfect.''

In 1976, industrial music found a name, when Throbbing Gristle formed Industrial Records (``Industrial Music for Industrial People'') along with such bands as Cabaret Voltaire and ClockDVA. These bands were heavily influenced by Burroughs' ideas, and cut-up theory made its way into their music, when the bands would make tape recordings of found sounds (machinery, short-wave radio, television newscasts, public conversations) and cut up, rearrange, and splice the tapes, turning it into music.

This was long before digital audio: this was done with razor blades. Today, it's called sampling, and the influence of these bands is felt in nearly all branches of modern pop music.

This wasn't the first time ``natural'' sounds had been used in musical compositions; that sort of thing had been going on at least as far back as the 19th century, and the surrealists and futurists of the 1920s and 1930s were way into this kind of thing.

Ted Nelson, the inventor of hypertext, published ``Computer Lib'' in 1973. This book was more a stream-of-consciousness collage than anything else, nominally about nonlinear texts, and effectively an example of the same. It was written as hundreds of individual typewritten rants, and then pasted together for printing. Ironically, it was printed with a third of the pages out of order, allegedly due to a mix-up with the printer: one wonders, however, whether that really mattered.

DadaDodo is one of the class of programs known as ``dissociators,'' a coin perhaps termed by the amazing Emacs hack, ``Dissociated Press.''

DadaDodo works rather differently than Dissociated Press; whereas Dissociated Press (which, incidentally, refers to itself as a ``travesty generator'') simply grabs segments of the body of text and shuffles them, DadaDodo tries to work on a larger scale: it scans bodies of text, and builds a probability tree expressing how frequently word B tends to occur after word A, and various other statistics; then it generates sentences based on those probabilities.

The theory here is that, with a large enough corpus, the generated sentences will tend to be grammatically correct, but semantically random: exterminate all rational thought.

M-x psychoanalyze-pinhead RET

This kind of probability histogram is called a Markov Chain, after Andrei Markov, the fellow who invented it. It turns out that Markov Chains can actually be used for things other than generating random text. They are used in image processing, for feature recognition; and can be used to analyze finite-state machines for bottlenecks and critical paths: the states which occur most often are where the bottlenecks are.

DadaDodo doesn't work quite as well as I would like it to.

Here's the bug: the smaller the amount of input text, the better the sentences are that it generates. (Above a certain threshold -- too low, and you just get the input text back.)

I think I understand why this is. My guess is that as the body of input text increases, the probabilities even out to normal-english-language distributions, and the end result starts behaving more like picking words at random in a vacuum. You don't tend to get sequences of 4, 8, 10 words together that all make a kind of sense; you get 2, then another 2. The distance between any two words gets smaller, and the overall probabilities become obscured.

I don't think treating every pair of words as one ``word'' for statistical purposes will work very well; that will be too clumpy.

I want some kind of commutative probability -- X is n% likely after A+B, but only m% likely after A+C. And A+D+E+F+G adds a correspondingly smaller amount of influence.

I'd kind of like to do this without adding another dimension to my graph, because it's already pretty huge. Another order of magnitude just won't do.

But it seems that human language isn't a system which can be modeled by a Markov chain of length 1.

Dissociated press works better because it only ever operates on small inputs, and always shuffles large-ish chunks. Burroughs' cut-ups work better because they work on large-ish chunks, and there are spatial relations that come into play -- even if you shuffle all segments of all pages, some words on the same segment are still similar distances apart, even if new words have been interspersed.

But I really like the idea of breaking the original text down into probabilities and then generating from that, rather than taking the original text and shuffling it. The shuffling approach feels like it preserves too much of the original content, whereas all I want to preserve is the original grammar. Maybe that's not possible (practical).

I don't want to have huge lists of nouns/verbs. I don't want to encode knowledge of the language into it. That way lies intellectually corrupt AI projects which I'll not taunt by naming here.

    One possibility would be to keep only the most popular 3-way and higher combinations around; for the more common ones, I could hold a pointer to a sub-table, instead of a flat probability. Their popularity could be found using a quadtree-style subdivision of the word space (to find the words that ``clump'' together.) The problem with this is, it doesn't work in a streaming fashion; you basically have to have the whole N-way graph around before you can throw away the less-popular combinations, because you won't know which ones are popular until the end.

Another good compression trick would be to quantize the values; though the maximal numerator or denominator that we need to express the probabilities might be a 16 bit number (or higher), we probably could make do with 8 bits (or less) of resolution: have an 8-bit lookup table of approximate probabilities. I suspect that the values in this table would end up being on a logarithmic scale (since that's how nature works.)

DadaDodo doesn't do quite as much as I would like it to.

I want it to crawl the web and consume text.

I want it to sometimes randomly bounce to a similar-sounding word, for an auto-punning effect.

I want it to count syllables, and thereby generate haiku. This could be done by simply generating random sentences until we get ones that have the word and sentence breaks in the right places: it shouldn't take more than a few hundred or thousand iterations each.

I think syllable counting is just a hyphenation problem.

I want access to the raw Alta Vista database. In order to implement their NEAR search term, they must have proximity information in their database. I want to reconstruct a Markov chain from that, and generate text based on the whole of the web.

Quoth a random Netscape employee, about The Dr. Bronner's Peppermint Castile Soap School of Web Site Design and Panhandling:

    Well, I've seen at least one person who appreciates the advanced aesthetics of our web site. As I was driving through Scotts Valley on my way to work one day not long ago, I chanced to stop at a light where a man was standing on a corner holding a sign. You know, like those people who ``will work for food.'' Except he had quite a large sign, and it was divided into rectangular areas, each containing a separate message, although none of these were actually large enough for me to read.

    If I see this guy again, so help me, I'm going to refer him for hiring as a web site designer.

Was it Robert McElwaine?

DadaDodo is still kinda cool, though.

Click here
to run it.

Click here
to download
the source.

This will select among a few pre-parsed corpora, and generate new random text -- just for you!

This is a gzipped tar file of the program to the left. It was written on Unix, but it should be fairly portable ANSI C.


[ up ]