"The plaintiff gains the power to traverse multiple silos of data"

The Traceability of an Anonymous Online Comment

Suppose that I post an anonymous and potentially defamatory comment on a Boing Boing article, but Boing Boing for some reason is unable to supply the plaintiff with any hints about who I am -- not even my IP address. The plaintiff will only know that my comment was posted publicly at "9:42am on Fri. Feb 5." But as I mentioned yesterday, Boing Boing -- like almost every other site on the web -- takes advantage of a handful of useful third party web services.

For example, one of these services -- for an article that happens to feature video -- is an embedded streaming media service that hosts the video that the article refers to. The plaintiff could issue a subpoena to the video service and ask for information about any user that loaded that particular embedded video via Boing Boing around "9:42am on Fri. Feb 5." There might be one user match or a few user matches, depending on the site's traffic at the time, but for simplicity, say there is only one match -- me. Because the video service tracks each user with a unique persistent cookie, the service can and probably does keep a log of all videos that I have ever loaded from their service, whether or not I actually watched them. The subpoena could give the plaintiff a copy of this log.

In perusing my video logs, the plaintiff may see that I loaded a different video, earlier that week, embedded into an article on TechCrunch. He may notice further that TechCrunch uses Google Analytics. With two more subpoenas -- one to TechCrunch and one to Google -- and some simple matching up of dates and times from the different logs, the plaintiff can likely rebuild a list of all the other Analytics-enabled websites that I've visited, since these will likely be noted in the records tied to my Analytics cookie.

The bottom line: From the moment I first load that video on Boing Boing, the plaintiff gains the power to traverse multiple silos of data, held by independent third party entities, to trace my activities and link my anonymous comment to my web browsing history. Given how heavily I use the web, my browsing history will tell the plaintiff a lot about me, and it will probably be enough to uniquely identify who I am.

But this is just one example of many potential paths that a plaintiff could take to identify me. Recall from yesterday that when I visit Boing Boing, the site quietly forwards my information to the servers of at least 17 other parties. Each one of these 17 is a potential subpoena target in the first round of discovery. The information culled from this first round -- most importantly, what other websites I've visited and at what times -- could inform a second round of subpoenas, targeted to these other now-relevant websites and third parties. From there, as you might already be able to tell, the plaintiff can repeat this data linking process and expand the circle of potentially identifying information.

See also EFF's Panopticlick, which shows that even with cookies turned off, just your user agent string alone contains enough information to (on average) identify you to within 1/1500 people in the world.

The most surprising thing to me was that web servers can get the list of all the fonts installed on your system -- and that that is usually even more uniquely-identifying than the user-agent string.

Tags: , , ,

21 Responses:

  1. fantasygoat says:

    They might be able to identify a unique machine, but how might they connect that machine to a person?

    For example, I surf from work, which uses a single NATed IP for all workstations. At best they could narrow it down to an office of dozens of people.

    However, doing the subpoena dance they could probably eventually track it down to a name.

    • jwz says:

      IP → ISP → employer → logs on NAT box. Two subpoenas, both of whom will roll over on you at the drop of a hat.

      • jered says:

        Also, the series of cookie-linking described in the article means you could be tracked to a webmail account, personal photo album... Anything with your username in the URL sent as a referrer to ad or analytics provider.

        The combination of fonts and plugins means my laptop tested unique. Of course, all iPhones of the same sw rev and time zone test same, as will the iPad.

        • jwz says:

          Surprisingly, my iPhone and a friend's iPhone, same hardware model and OS, did not show up as identical. I'm not sure what the difference was.

          Someone theorized that some installed apps show up as browser plugins, and the server can get a list of those (and their version numbers).

      • fantasygoat says:

        You assume that such logs are kept for any length of time, or at all. For example, my employer doesn't log any network traffic at all on the internal network.

        I know this because I run it. Now, at a Fortune 500 company there might be a process but in my 15 years of industry work I've never seen a small to medium shop keep more than a day's worth of traffic logs.

        An ARIN query will connect the office to an IP without even a subpoena, but after that, they're SOL generally.

        If they're smart, they'll go another way and get a list of destinations from the ISP for the time period and work back from that. The ISP *may* keep such logs, although again in my experience never more than a week's worth.

        • jwz says:

          Well aren't you just Mr. Special Fancypants then.

        • strspn says:

          Essentially all "small to medium" shops and person in the developed world keeps about 30-60 days of traffic on their hard disk, and a representative distribution of its source's in its ISP's DNS cache.

        • mhoye says:

          Your experience is the precise opposite of mine. Centralized, backed up logging is the first thing I set up at any sysadmin job I've had; how else can you figure out what's going on?

          In this modern age disk space is basically free and Splunk is needle-in-a-moon-sized-haystack fantastic. Why wouldn't you?

          • fantasygoat says:

            There's logging and then there's *logging*. "Basically free" and "free" aren't quite the same thing. And Splunk ain't free either.

            When you're forced to build a site's main router out of pieces you find in the back room because you have a limited budget, or worse, when you need to get management approval to buy someone a USB hub, spending tens of thousands of dollars to store detailed router logs going back to the stone ages suddenly doesn't quite matter as much.

            Apache logs? Sure, I've got those going back 60 days. But router logs? Forget it.

    • lionsphil says:

      Orthogonal to the "getting down to the individual machine" part, I wonder if there's a potential argument here along the same lines as the need for front-facing photo speed cameras: proof that the actual person using the computer was the defendant.

      Mind you, given that there have been cases of teachers getting put on kiddie-fiddler lists for having machines packed with malware start popping up porn ads during presentations, this is presumably another point where computers are considered magical and the user account is inexorably tied to a single person.

  2. krick says:

    Do you have a source for the the font list thing?

    • chrisb74 says:

      Try this and it will happily show you your fonts:

      https://panopticlick.eff.org/

      • luserspaz says:

        I have Java disabled (good riddance) and Flashblock installed, and that seems to be enough to prevent that from working.

    • leolo says:

      If you want to know if font X is installed, this can be done with a javascript hack: http://www.lalit.org/wordpress/wp-content/uploads/2008/05/fontdetect.js?ver=1.5

      To get a list of fonts installed, you use flash:
      http://www.maratz.com/blog/archives/2006/08/18/detect-visitors-fonts-with-flash/

    • emn13 says:

      If you take a look at the panopticlick output, you'll see the fonts aren't reported in any particularly obvious order. Hearsay has it that the order is actually determined by whatever order the file entries happen to be on disk. The OS does not guarantee any particular ordering in directory entries; the order may well depend on fragmentation at the time of writing, any defragmenting you've done, the order in which you install apps that install new fonts, the speed and caching effects of your I/O subsystem (and other unrelated load at the time of writing) etc.

      In short, even if you have a fairly common setup with apps and OS similar to those of many others, and even if you have the same fonts, that does not mean that your font list is actually the same.

      Supposedly, anyhow. Sounds believable, though...

  3. discogravy says:

    i'm reading this as "do anything potentially obnoxious via a linux LiveCD from a wifi-enabled DNA Lounge hotspot cafe"

  4. lionsphil says:

    Font information was enough to get me uniquely. Turn off JavaScript (leaving plugins on), however, and "one in 84,945 browsers have the same fingerprint as yours".

    Unfortunately, the tendency for people to write the damn object elements out using JavaScript these days means you won't be watching any BoingBoing embedded videos this way anyway.