Illuminating blacked-out words:

By realigning the document, it was possible to use another program Whelan had written to determine that it had been formatted in the Arial font. Next, they found the number of pixels that had been blacked out in the sentence: "An Egyptian Islamic Jihad (EIJ) operative told an xxxxxxxx service at the same time that Bin Ladin was planning to exploit the operative's access to the U.S. to mount a terrorist strike." They then used a computer to determine the pixel length of words in the dictionary when written in the Arial font.

The program rejected all of the words that were not within three pixels of the length of the word that was probably under the blacked-out area in the document.

The software then reduced the number of possible words to just seven from 1,530 by using semantic guidelines, including the grammatical context. The researchers selected the word "Egyptian" from the seven possible words, rejecting "Ukrainian" and "Ugandan," because those countries would be less likely to have such information.

Tags: , ,

24 Responses:

  1. lars_larsen says:

    Fuck yeah! I love it when poor security fails.

    Unfortunately most documents have just one word NOT blacked out, and like 6 paragraphs solid black.

  2. srcosmo says:

    An argument for using fixed-width fonts if I ever heard one...

    Apparently the government has a different idea, however.

    • greatbiggary says:

      I don't get how a fixed-width font would help in this matter. Wouldn't it just aid you in figuring out exactly how many letters the word or words in question were composed of? You could use low-tech means, like a ruler, or a pencil to mark a strip of paper and compare lengths with any other word, or a simple folding of the paper over a light table, or even your basic eyeballing by a graphic designer.

      "Ah, we're looking for an exactly 9-letter word." That removes the need to try out every word in a particular font and measure its pixel width against the font, including all that kerning and letter-pairing business. This all sounds more like an argument for Dingbats to me.

      • jwz says:

        Teh gvomerennt iwll jsut haev to tsart keepnig its recodrs in spamglish.

        • ronbar says:

          The reading comprehension of the target audience for the most sensitive documents in this country is pretty low. Maybe they should use scramable.pl and write a descramable.pl for the White House Communications Agency to "decrypt" documents intended for our noble and honorable royal majesty to read.

          descramable.pl should also add lots of pretty colors and happy pictures to motivate him to continue to read to the top of page 2. Maybe eventually, after a lot of intensive tutoring and reading comprehension exercises, he'll even make it to the undiscovered territory of page 3.

          Then again, maybe it'd just be easier to wait until Kerry is inaugurated in January.

      • bazil says:

        There are exactly n words that could be used that are 9-letters long. (An online scrabble wordfinder, a2zwordfinder.com, found 24788 words 9-letters long. Also check out http://www.thewordlist.com/wordox9.html)

        Now, pick any two words out of that list at random, and enter them in a text editor in a non-fixed-width font. They are each different lengths, obviously. Each letter would have its own specific width, which may differ by only a pixel, from the smallest 'l' or 'i' to the largest of 'W' or 'M' or such. So, maybe we have a range of 2-8 pixels for each letter. The longer a blacked-out word gets, I'm betting (without trying to do any sort of analysis; it would be an interesting program, indeed!) that the number of potentials decreases, by sheer fact that being 9 letters, each one of variable length of 2-8, that provides a range between 18 and 72 pixels wide over a range of 24,000 words or so (with, more than likely, a median range somewhere in the middle, leaning towards the lower spectrum, because of so few long-letters). This means that for each specific pixel range we might expect to see between 200 and 800 words within that range, for 9-letters long.
        Understand, of course, that some words that are 8 or 10 letters long would have the same pixel lengths as 9-letters, and so forth.

        I even made a chart, because I'm insane. (Discounting 1-letter words. I, a, ... and... uh... Yeah. Also, assuming a variable-length font of 2-8 pixelwidth, hypothetically. Used abovementioned a2zwordfinder for wordcounts.)
        Letters, PixelRange, #Words, Words Per PixelLen
        02, 04-16, 00094, 7.8
        03, 06-24, 00953, 52.9
        04, 08-32, 03831, 159.6
        05, 10-40, 08513, 283.8
        06, 12-48, 15013, 417.0
        07, 14-56, 22773, 542.2
        08, 16-64, 27969, 582.7
        09, 18-72, 24788, 459.0
        10, 20-80, 20189, 336.5
        11, 22-88, 15404, 233.4
        12, 24-96, 11271, 156.5

        So our range for 2-12 character words is 4-96 pixels wide. There's a lot of overlapping in there, and you can think about that if you want, but it might be observed that the majority of words would appear in the lower-center weight (probably around the 1/3 point, I betcha) of the pixel-widths.
        Take a word 56 pixels wide. If it were fixed width, and you knew that your fixed-width font was 8 pixels wide (including the gap between letters), you would come up with 7-letter words. The chances of picking out exactly the right 7-letter word from context out of a volume of 22,773 letters is a lot less than picking out exactly the right n-letter word from a volume of ~2308 (assuming the average length of all words from 7-length and through to 12 {Pretending there are no 13+ words, which there are}) based on that same 56-pixel-width word.

        Fixed-width gives you chances of 1/22,773, whereas variable-width gives you chances of 1/2308 for a 56-pixel-width word. This is why fixed-width fonts should be used at the least for government documents, because they offer the greater resilience to this sort of attack. Ultimately, it would take a multiarrayed font with each letter having, say, 2-10 variants, each one carrying a different width, in order to curb the ability for a cryptographer to analyze the frequencies to figure it all up. That would give the strength that a factor of n, I guess, where n is the number of variants per letter. So instead of 1/22,773 for a single word, it would be something like 1/75,000. That's a lot of words to sift through for the correct one.

        Of course, none of this is really any good for multiple-line paragraphs blacked out. That's where having access to the originals, and sophisticated forensics would come into play!

        One last thing I learned:
        'aarrghh' is a legal word to play in Scrabble, and the first word in the 7-letter list. Remember this, for it will serve you well.

        • greatbiggary says:

          I see. By spreading out the word lengths over many possible exact pixel lengths, you up the total possible number of word lengths several times, and thus reduce the number of possible words per length, several times. Thanks for all the footwork on that one.

          • bazil says:

            Welcome! It's also why the idea mentioned elsewhere in these comments that variable-length characters are a Good Idea©.

      • belgand says:

        I think the answer would be some sort of random-width font that varies the width of each letter. This would make it largely impossible to utilize this sort of method. Of course, it would make things harder to read and make typing the document a bit of a pain in the ass.

        • greatbiggary says:

          Do fonts support random widths? Is there any way with current font types to have a letter change between a variety of them at random, or simply scale in x to some random width?

          Maybe a font with a unique spacing for ever letter pair... I believe that would work out to 26^26 letter pairs, or roughly 6156119580207157310796674288400200000 combinations. My long number namer tells me that's around six undecillion, one hundred fifty-six decillion, one hundred nineteen nonillion, five hundred eighty octillion, two hundred and seven septillion, one hundred fifty-seven sextillion, three hundred ten quintillion, seven hundred ninety-six quadrillion, six hundred seventy-four trillion, two hundred eighty-eight billion, four hundred million, two hundred thousand.

          I guess anyone with the font would be able to run a similar program to calculate width, since every letter pair would be fixed. It really does need some randomness in there. At least it would be a very large font with all those pairs. I don't know that anyone has the bandwidth to download something like that, let alone hdd space to store it, or the ram to use it ;)

          • bazil says:

            It would be possible to create a font like that, with the newer possibilities of unicode and other higher-bit encoding schemes. (65536 characters, I think, in unicode-16bit) Just set up the scheme of your font to use 1-10 as 'a', 11-20 as 'b,' and so forth. The key problem would be the programming of a text editor that would implement such a font. It would have to know that all 'a's are range 1-10, and, when a letter is typed on a keyboard, randomly select one of those ten versions of the letter 'a.' Of course, the more variances in character width, the uglier the document would look. ;)

            It would be a huge government study into 'how contorted can we make a letter before you can't read it' and 'is it really important to make your document nearly illegible to support the ability to censor it?' Neat thought, though. I'm imagining a 10pixel-wide font by 13-high. 130 bits per character, and using the entire unicode-16 format would be, what, 8,519,550 bits? Or, just over a whopping 1MB, I think. I might be wrong. That isn't a large font to download, and would offer around 2500 possibilities (font-size capabilities speaking) per letter.

            • jwz says:

              I have an idea. It's called "reprint it using an elipsis". I'm applying for a patent, hands off.

            • jayrtfm says:

              so it's possible to include a bit of code in the font definition to randomize the character shape and kerning.
              see http://cgm.cs.mcgill.ca/~luc/fontsamples.html
              fron that page:
              "Mike McDougall created a random font called "Tekla", which uses several handwritten samples as parents to create random offspring. A companion article has appeared in "Electronic Publishing". Its letters vary every time a character is needed. A type 3 font of unique versatility, Tekla may be used to simulate drunkenness, and, as the sample shows, varying degrees of instability on one page. It should prove useful in testing character recognition software."

              • jayrtfm says:

                un/pw and subject was on Netscape's autofill, and I clicked on "post' before noticing the subject.

                It should have said "Postscript IS a programming language"

          • edge_walker says:

            I believe that would work out to 26^26 letter pairs

            How in the world do you arrive at that? Each letter can be combined with 26 others. At 26 letters to combine, that makes 26*26, ie 26^2, ie a far cry from your 26^26.

          • belgand says:

            I was thinking this would be a custom job integrated into the text editor rather than a font itself. Then again, my knowledge of the internals of font design and such are more is more or less non-existent.

            Besides, it's the government. A customized, over-designed program is exactly the sort of thing I'd expect from them.

  3. king_mob says:

    That is fucking awesome.

  4. fo0bar says:

    I would have rejected "Ukrainian" and "Ugandan" much sooner, considering "told an Ukrainian" is not valid gramatically.

    • deeptape says:

      Now that just makes too much sense.

    • aprilized says:

      That made me wonder....

      Only countries that begin with "Y" and "U" are exempt from the 'AN in front of a vowel' rule...

      I think they're easier to pronounce with an 'a' in front of them....

      just a thought..

      • wfaulk says:

        The rule is that “an” precedes words that start with a vowel sound, not just simply a vowel. Since “Ugandan”, for example, starts with a ‘Y’ sound, it gets an “a”.

        It's also important to point out that while ‘Y’ is sometimes a vowel, the traditional ‘Y’ sound is not a vowel sound. It's only a vowel when pronounced like one of the always-a-vowel vowels. For example ‘W’ is also sometimes a vowel (“cwm”, e.g.), but you'd never say “an wheelbarrow”.

        Also somewhat interesting is that people who pronounce a very slight initial ‘H’ use “an” when the ‘H’ is followed by a vowel sound, as in “an hamburger”. This is often viewed as more British.

    • 33mhz says:

      I imagine it's much faster to match against words that start with a vowel and then remove the ones that start with a consonant sound by hand than to plug in phonological data for your whole dictionary.

  5. naturalborn says:

    A much simpler attack is to request the same document twice. They don't always black out the same stuff.