Redacting the Redactors

Timothy B. Lee: Studying the Frequency of Redaction Failures in PACER

I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Google drops another turd in the punchbowl

You may have heard that Google went and invented a new still-image format, because the zillion we already have apparently aren't good enough. It's a disaster and Mozilla has rejected it, but they're putting it in Chrome anyway.

Oh well, despite that, I'm sure it will be every bit as successful as VP8, Orkut, Wave and Buzz were. (And Ogg, though we can't pin that one on them.)

Jeff Muizelaar:

WebP also comes across as half-baked. Currently, it only supports a subset of the features that JPEG has. It lacks support for any color representation other than 4:2:0 YCrCb. JPEG supports 4:4:4 as well as other color representations like CMYK. WebP also seems to lack support for EXIF data and ICC color profiles, both of which have be come quite important for photography. Further, it has yet to include any features missing from JPEG like alpha channel support. [...]

Every image format that becomes "part of the Web platform" exacts a cost for all time: all clients have to support that format forever, and there's also a cost for authors having to choose which format is best for them. [...]

Where does that leave us? WebP gives a subset of JPEG's functionality with more modern compression techniques and no additional IP risk to those already shipping WebM. I'm really not sure it's worth adding a new image format for that. Even if WebP was a clear winner in compression, large image hosts don't seem to care that much about image size. Flickr compresses their images at libjpeg quality of 96 and Facebook at 85: both quite a bit higher than the recommended 75 for "very good quality". Neither of them optimize the huffman tables, which gives a lossless 4--7% improvement in size. Further, switching to progressive JPEG gives an even larger improvement of 8--20%.

jwz mixtape 146E

My one hundredth mixtape is coming up soon, but before that, I thought I'd re-release a few of my favorite mixtapes from the first year. These are audio-only, and so they will expire in two weeks. Please enjoy mixtapes ØØ1, ØØ4, ØØ6 and Ø14.
Physics of My Little Pony

"How to fix this: Butterflies could have been made from dark matter."

It's probably about time that you re-read Apocamon.

Moonman-language tweets mentioning @jwz in the last two weeks:

Click Trajectories: End-to-End Analysis of the Spam Value Chain

This paper is awesome:

Spam-based advertising is a business. While it has engendered both widespread antipathy and a multi-billion dollar anti-spam industry, it continues to exist because it fuels a profitable enterprise. We lack, however, a solid understanding of this enterprise’s full structure, and thus most anti-spam interventions focus on only one facet of the overall spam value chain (e.g., spam filtering, URL blacklisting, site takedown). In this paper we present a holistic analysis that quantifies the full set of resources employed to monetize spam email— including naming, hosting, payment and fulfillment—using extensive measurements of three months of diverse spam data, broad crawling of naming and hosting infrastructures, and over 100 purchases from spam-advertised sites. We relate these resources to the organizations who administer them and then use this data to characterize the relative prospects for defensive interventions at each link in the spam value chain. In particular, we provide the first strong evidence of payment bottlenecks in the spam value chain; 95% of spam-advertised pharmaceutical, replica and software products are monetized using merchant services from just a handful of banks.
Bobby Pflugelblager

Referer snooping

Dear Lazyweb, is there a service like Google Blog Search but that actually works?

We will define "works" as: gives me an RSS feed of timely references to or links to my sites; omits links that are months or years old; and is not brimming with spam. (Google Blog Search fails on all three of these.)

