Tales from the PDF Wars

Someone wrote an article about how awesome PDF is, but the fun part is all of the rebuttals on this Metafilter thread. Some choice ones:

PDF isn't "a" file format, it's a metastatic tumor of file dialects.

It's so nice to know that closed source proprietary bullshit with all kinds of security issues is "king" in the workplace. Adobe can get off my lawn.

I stopped taking PDF seriously when I found out they'd extended it to allow for embedded Javascript.

PostScript is a complete computer language. Any task that can be expressed in Javascript can also be expressed in PostScript.

The whole point of PDF was to be a cut-down subset of PostScript that was not a complete scripting language and therefore wasn't so demanding on the limited processors of the day. If the format had any integrity at all, and if the designers had any real interest in minimizing its security attack surface, such scripting as it needs would be provided by relaxing restrictions and allowing more of PostScript (perhaps just enough to allow for execution of transpiled Javascript) to be supported, not by bolting in a second script interpreter.

I find the blind trust folks place in PDFs especially hair-raising since the one time when some numbers in a pdf that Preview on my Mac let me select and copy, when pasted, turned out as other numbers (something like that weird old photocopier effect )

I've seen PDFs where everything that actually makes it onto the paper is just a single pre-rendered 300dpi compressed image; it's entirely possible that the weird old photocopier effect is exactly what was going on with your PDF, and that the numbers you copied and pasted, being derived from a hidden semantic layer rather than the visible graphics so tempting to take as definitive, were actually the correct ones.

On the other hand, those graphics might have come from a scanner and the "underlying" semantic layer added after scanning (but possibly before image compression) via OCR. Hard to tell without examining the PDF concerned in a text editor.

And when Apple added vector artwork to iOS, they settled on PDF as the file format. Which is a colossal WTF.

Presumably there is code somewhere (ideally at compiletime, though probably at runtime) that goes through the PDF, picks out the vector paths, ignoring the myriad of other possibilities for what could be in there, and creates some sort of concise in-memory data structure. One hopes that the sheer openness of the PDF format does not lead to an attack surface measurable in parsecs. (What's to say that, at some tributary of the code path, the PDF isn't passed to a third-party library which runs the JavaScript in it and has a buffer overflow in it somewhere, for example?)

And when Apple added vector artwork to iOS, they settled on PDF as the file format. Which is a colossal WTF.

Because the Quartz stack is literally a working GPU accelerated implementation of Display Postscript.

NeXT used Display Postscript, so this goes back a while.

My D&D campaign has ended up using some crazy PDF files for our character sheets. They don't work in OS X's Preview, only in the Acrobat viewer - in fact they claim that even opening them in Preview can irrevocably fuck them up, though that hasn't happened to me yet. They are chock full of all kinds of JavaScript to deal with all the arcane calculations involved in building and leveling up a D&D character; they have their own toolbar that pops up. It's kind of insane.

Previously, previously, previously, previously, previously, previously, previously, previously, previously, previously.

Tags: , , ,

11 Responses:

  1. Other Jamie says:

    In time, someone will build a self-hosted PDF editor in a PDF.

  2. Happen Muche says:

    Working for a small company swallowed up by a bureaucratic multinational, I once spent two hours editing a previous quarter's PDF invoice to a client with LibreOffice, to match what the current quarter's invoice to the client should have read, if the accountant responsible for generating the invoice hadn't fucked off on holiday without ensuring that someone would cover doing that. Which we only found out about because the contact at the client got a call from their accountants saying that if they didn't get an invoice within the day, we would not be paid that month.

    Fortunately the boss managed to get someone who wasn't out of contact to sort it out, and we got it legitimately generated, with a real invoice number and everything. But I'm pretty sure I would rather have faced the fallout from illicitly making up an invoice than fucked around with that awful file for even another hour.

  3. emacsomancer says:

    Be that as it may, I'll still take a PDF over a .docx any day.

    • internetimal says:

      It's interesting to look at docx as the result of a very necessary internal exercise to document what the seething mess of the .doc format was even doing, from which the very convenient political / marketing hell it allowed for was only a side effect.

      Apple's use of XML mostly as a container for opaque piles of hex strings gets a runner-up award.

  4. tfb says:

    Is there yet a second-order PDF: a format which is a safe, fast subset of PDF the way PDF once was of PS? This should obviously be called PPDF or P2DF (I don't think sup is an allowed tag sadly).

    • internetimal says:

      PDF/A and possibly PDF/UA attempt to be this and at least the former is relied on for government archival - court filings etc. I am not sure how well the standard culls the more bizarre stuff because as much focus is on making sure enough of the kitchen sink is in there so there can even be a deterministic rendering without external references.

    • asciiphil says:

      The Qubes developers thought about this for a least a little bit. What they came up with is basically rasterizing the PDF page by page. Not the most satisfying answer.

  5. Wout says:

    So why isn't there a real standard for a zipball of html + all deps + page size definitions? I can't think of a single thing that PDF does that this wouldn't be better at.

  6. walrus says:

    I may be over a decade out of the game, but in the late 90’s when I was up to my neck in deadlines in a publishing industry still trying to wean itself from hand-stripped lithographic separations while “press-ready” artwork was being submitted in corrupted word files, or worse, wonky Corel EPS files, PDF was the only thing that saved my sanity.

  7. thielges says:

    The cause of this and other standards borkage is misplaced business authority. Here’s what happens: a big customer like Bank of America is up for renewal of their three year all-you-can eat contract with Adobe. The deal is worth millions and BofA knows that this is their chance to get concessions, both technical and contractional.

    So BofA gets their top tech person in the loop to start making technical enhancement demands with the hopes of coercing Adobe into a better deal. The top tech gopher says something like “must has JavaScript”. Adobe pushes back pointing out that postscript already supports the needed features of JS. That tech gopher is now a temporary tin pot dictator and realizes that they have weight to throw around. Stunningly they get their way and JS is implemented. BofA doesn’t get their better deal but instead get an unnecessary feature, which was a short term financial gain for Adobe.

    That’s speculative fiction above. I’m not an Adobe insider but I do deal with standards and see this exact scenario play out a few times a year. In fact I’m currently defending against one this week.

    When it is $$ versus long term purity and stability, the people steering the $$ always win, in the short term. Long term the pain is spread far and wide.

  • Previously