I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes.)
Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.
Previously, previously, previously, previously, previously, previously.
So, considering the manually confirmed few, it's at least a 4,7% of redactional failure. :/
Reminds me of the PostScript program you (via IRC) inspired me to reinvent for an April Fool's prank back in 2002...
The "CM/ECF" system(s) that intake the PDFs modify them enough to bang a timestamp, etc. up top.
I guess adding software de-stupiding really is a pain in the ass when you have a text stream and need to figure out what text is actually under the "someone-didn't-think-this-through polygon" and replace it without altering the flow so other crap ends up under the polygon. And it's Not Really the Government's Problem, since it's the attorneys who are obliged to get it right and face penalties for failure.
[Also, when using the black marker method, you are still doomed to discover oh-shit-someone-missed-one right after it gets filed. The procedure for correcting that is pretty arcane and amounts to an admission, so standard operating procedure is to Hope Nobody Notices. Adding something like a 6 hour did-you-really-want-to-do-that? delay would help but also make deadline-compliance more confusing and be exploitable to get another 6 hours under deadline pressure (which would probably increase quality immeasurably, but that's another story).]
Even software under-black-removal wouldn't help against previously #3, though.