News flash: "anonymized" data sets aren't.

Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset

There has been a lot of online comment recently about a dataset released by the New York City Taxi and Limousine Commission. It contains details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi's license and medallion numbers. It was obtained via a FOIL request earlier this year and has been making waves in the hacker community ever since.

The release of this data in this unalloyed format raises several privacy concerns. The most well-documented of these deals with the hash function used to "anonymize" the license and medallion numbers. A bit of lateral thinking from one civic hacker and the data was completely de-anonymized. This data can now be used to calculate, for example, any driver's annual income. More disquieting, though, in my opinion, is the privacy risk to passengers. With only a small amount of auxiliary knowledge, using this dataset an attacker could identify where an individual went, how much they paid, weekly habits, etc. I will demonstrate how easy this is to do in the following section.

tl/dr: Jessica Alba didn't tip.

Previously, previously, previously.

Tags: , , , , ,

7 Responses:

  1. Jason McHuff says:

    If so, I do not think it was appropriate to release the data at such a detailed level. Zip code-level can be OK, but not exact pick up+drop off location. Also, it's a tiny bit surprising that the city is storing every single detail since taxis are private businesses and that level doesn't seem to be needed for regulatory purposes.

    And the only thing that they did (poorly) anonymize--the particular cab that did the trip--doesn't seem to be important.

  2. John Adams says:

    Years ago this happened with AOL's dataset. When will people learn. https://en.wikipedia.org/wiki/AOL_search_data_leak

  3. John Adams says:

    I guess people didn't learn from last time.

    https://en.wikipedia.org/wiki/AOL_search_data_leak

  4. tobias says:

    double plus bonus points to those who match up senators from time tagged, geolocated photos to taxi's and routes, to see whose been sleeping around or coming from unexpected locations.

  5. nooj says:

    The same is true of supposedly-HIPAA-compliant use of medical data. That thing you sign at the dr's office that allows them to transfer medical records electronically? Yeah, that transfer and those databases might be insecure. Even the anonymized versions.

    I was doing some database work for a doomed startup wanting to get into this field. These were two or three guys who had a good presentation once which got them an offer for $10M for their company. Instead of take the offer, they decided to redouble their efforts and shoot for $100M or $1B. They figure they can "redirect" and sell backend services to multiple major players.

    Anyway, so they hired me to throw on a couple of bells and whistles, and they hired a marketing guy to rebrand, etc. So I'm testing with a bunch of chest x-rays and colonoscopy data, seeing plaintext names splattered all over within the non-standardized data fields, realizing that none of this shit is secure at all, wondering if anyone will actually be paid to fix it before the prototype becomes release 1.1. At the other end of the room, the CEO and marketing guy are thinking of a new slogan.

    I say, "Our backend is open to you."

    The entire room burst out laughing, but, sadly, no one was brave enough to use it.

  • Previously