Shortlinks

Shortlinks are terrible for all kinds of reasons, but this post isn't about that. But let me get that part out of the way first:

  • They obscure the destination you're about to click on, making them a primary tool for phishing attacks.
  • They train people that not looking at their link destinations is a reasonable thing to do.

  • Each shortening "service" introduces a new point of failure: when, not if, they go out of business, they will have broken a vast swath of the web equivalent to their market share.

  • The real reason that link shorteners exist is not actually to save typing, or reading, but as a tool of surveillance: the shortening "service" wants to interject itself between your mouse and the destination site to sell those hit statistics to other people.

  • Twitter, who inflicted this blight upon the world in the first place, won't even respect the shortlinks that sites provide on their own, but instead double-encode them using their own shortener. They say this is for "security" reasons but that's a bald-faced lie that I'm sure I don't have to... unpack... for you.

So, all that aside -- it's still an interesting numerical / bit-twiddling problem, on a purely technical level.

Back when I switched to WordPress, I noticed that the "shortlinks" it generated for every post were terrible. They really weren't that short at all, just appending the base 10 numeric post ID to the blog's base URL. They were barely shorter than the long URL that includes the post's whole subject. So I wrote a plugin to do better. For example, the blog post:

    https://www.jwz.org/blog/2011/08/base64-shortlinks/

has this default shortlink:

    https://www.jwz.org/blog/?p=13240780 (35 bytes)
Other services give us:
    http://tinyurl.com/3et9fw7 (26 bytes)
    http://bit.ly/qbFuII (20 bytes)
    http://goo.gl/xraFX (19 bytes)
    http://t.co/jJAv1SQ (19 bytes)
    http://dnklg.tk/ (16 bytes)

My code gives us:

    http://jwz.org/b/ygnM (21 bytes)

I did that by just encoding the post's ID number in base64, which is the same thing those other shorteners do, except that the ID in question is intrinsic to the post. Other shorteners either just increment a global variable, or pick a random non-conflicting number. Of course the smaller that number is, the more traversable the space is, which can be a problem.

But since the post's ID number isn't a secret maybe it could be shorter? Could it be fewer than 4 bytes? Sure, if your post IDs were smaller. By default, a brand new WordPress blog gives its first post the ID 100, which encodes as "ZA". This blog currently has 9469 posts, so that would have still been way down in the three-byte space, "JP0". The post IDs don't increase quite monotonically (the number increases every time you do a preview, among other things), but it still would have fit in three.

Unfortunately, I used to host my blog on Livejournal, and only migrated it here in 2010. The tool I used to import the blog preserved Livejournal's post ID numbers in the WordPress database. Those were already four bytes: "FDWn" was the last one. And then immediately after that, something went wonky with the import, and subsequent WordPress IDs jumped by eleven million for some reason, all the way up to "ygO-". If I had noticed it at the time, I could have done surgery to pull that number back down, but since then there have been almost 5,000 more posts, and I suspect that WordPress might lose its mind if post IDs are non-increasing. It doesn't matter, though, because these IDs will still fit in 4 bytes for the next 3.5 million posts.

Anyway, a few weeks ago I decided to waste some time making shortlinks for the DNA web site. Since there was no Livejournal fuckery, the WP blog over there already had nice and small IDs that fit in three bytes, so its shortlinks looked like http://dnalounge.com/b/FAM. But I thought it might be interesting to make shortlinks for the various other pages on the site, too. Most of those pages are date-based, so that suggests a way to generate unique IDs that are predictable and do not require a global counter: just use the date! But a time_t is a big number that takes six bytes to encode, so that won't do.

So I computed the number of days since the Epoch instead of the number of seconds (no, you can't just divide, because of leap years and daylight savings). Then there's the matter of the directory (is this a blog post, a calendar page, a flyer page, a gallery page?) and the room suffix (is this a daytime event in the main room, a nighttime event in Above DNA, etc?) So I use 3 bits for each of those, adding 6 bits to the 15-bit day number, and a 21-bit number still handily encodes as 4 bytes.

So here's a gallery: http://dnalounge.com/b/G0O6 and its calendar page: http://dnalounge.com/b/C0O6 and flyer: http://dnalounge.com/b/E0O6 and a blog post from around the same time: http://dnalounge.com/b/AUPC. That they start with low capital letters means there's plenty of space left.

Of course those aren't actually all that short, since unsurprisingly, whoever was squatting "DNA.com" back in 1998 never answered my email when I tried to find out what their price for it would be. But if someone wanted to buy me "dnaloun.ge" from the Registrar of the Great Nation of Georgia, I wouldn't say no.

BTW, autocomplete keeps changing "shortlink" to "chortling", which is what I think we should call them now.

Previously.

Tags: , , , ,

22 Responses:

  1. Chas. Owens says:

    Do you consider the fact that the short URLs will eventually contain obscenity a feature or a bug? Is it worth planning for the eventual obscenities (ie making sure an appropriate page is reached by the obscenity)? Out of sheer luck, http://dnalounge.com/b/ASS points to https://www.dnalounge.com/backstage/log/2003/04/30.html which starts with "Some photos of the Fetish Ball are up now. ".

    • Phil says:

      Over beers, an IT person told me of a MacBook with FUCK embedded in the serial number. The serial number is exposed in that company's providioning process, and the person issued the device demanded a different one.

      • Chas. Owens says:

        Heh, I would loved to have had that serial number.

        So, since it is story time, the company I work for recently decided it wanted to send out coupons to new users. Now, we have had a lot of problems with people sharing their coupons (that are linked to their account) on sites like RetailMeNot.com. These coupons will only work if they have been activated for a given user, so sharing the coupon code tends to lead to angry phone calls to support wanting to know why the code doesn't work. The higher ups decided that, to discourage sharing the coupon code, the username of the user the coupon was issued should be part of the coupon code.

        This wasn't a bad idea, but, for some reason that was never adequately explained to me, the coupon code was limited to twenty characters and they wanted it to start with "WELCOME<two digit month><two digit year>". That meant there was only nine characters available for the username (which can be up to fifty characters long). In chat I suggested this would cause a problem for users named things like countyassessor, but the best real example (using user names we actually had) was tastybuttercup51. I suggested a obscenity filter, but I don't know if they guy implementing it actually added one or not.

        Of course, one of the users chose fuckinguselesswebsite as his/her username, so there is a limit to what is reasonable for use to do to clean up the coupon code. Heh, I just ran a query against the DB, we have 155 usernames that contain the work "fuck". We also have "blowmeyoucunts". And a potential scunthorpe problem: EarlyAtticUntiques.

  2. Brian Van Nieuwenhoven says:

    Wonky tech observation:

    WordPress' default links are actually post queries, not meant to be encoded shortlinks. They're the default URL scheme if you don't turn on mod-rewrite-backed "pretty permalinks" (which are infinitely configurable)

    It's because everything you throw at WordPress in any case goes to https://www.jwz.org/blog/index.php, which then calculates the page/view it's supposed to serve.

    "?p=13240780" is a basic GET parameter that is read as "the post ID is 13240780". This is what your public post links internally get mutated into, if you turn on the pretty permalinks - it's the server-side canonical URL for the post. It will ALWAYS work if you manually type it.

    Nobody uses the WP default though as their public link scheme. Its primary failure is bad SEO. (and it's ugly) Rare to see it in the wild unless a writer (who isn't a web dev) set up their own server halfway.

    It could be someone's scheme for "bad shortlinks" if they really wanted it. On a small server, post IDs don't often get past 10^6 or even 10^5.

    Your plugin creates shortlinks that are much better.

    -----

    Another wonky observation/tip: People have made plugins that clean out the post database, deleting/removing stale revisions & other non-public post data. This would free up a lot of IDs but it doesn't solve the problem of grouping the remaining DB rows into a smaller ID cluster range, and it's another thing to chomp away needlessly at your free time, so.

    • jwz says:

      By default WordPress puts the ?p= URL into the <link rel="shortlink"> so to claim that those are not really shortlinks doesn't stand up.

      • Brian Van Nieuwenhoven says:

        LOL. I look stupid

        But to humbly submit a defense: The link/meta block in is declared (or referenced) in header.php in the theme, so it's entirely theme dependent and easily added by a theme author or erased by the end user.

        I thought I might be forgetful for a second about this being a convention for WP, but WordPress' default themes (for self-hosted sites) don't seem to come with a shortlink declaration.

        On the other hand, my WP-based site (which doesn't use a default theme) seems to be writing these as shortlink rels. (I may edit this error out tonight)

        Further - WordPress.com, which runs on some mutated version of the O/S downloadable one, has a completely different shortlink generator.

        So it's somebody's convention at least.

        Regardless of whatever WordPress thinks is good or not, a shortlink isn't supposed to be the URL. It should be a condensed link acting as an alias, pointing at a resource that redirects to the URL. The ?p= convention is, technically, not an alias in the least.

        The idea that someone would call a direct resource link a "shortlink", just because the resource link is not a full English sentence/datestamp, seems pretty dumb to me. And yet here I am with my own website doing it.

        • David A says:

          Why are you explaining what GET parameters are to JWZ?

          • Brian Van Nieuwenhoven says:

            Sorry. I knew that he knew it was a GET parameter.

            The explanation part is where that parameter goes during bootstrapping. Although he figured that out too. It's not encoded or anything, it's the database row ID number. (That's what I really meant by "basic") It gets fed right into a DB query.

            Shortlinks are usually hashes & include a wider charset than base 10. They're also usually routed and not parameterized. There's also usually some intentional logic for shortening rather than posting up a real URL.

            I was intending to make the case that the convention for theme authors setting up rel links for these URLS as "shortlink" is somewhat bizarre & clumsy.

  3. Geoff Smith says:

    Kinda want some chitlins, now.

  4. Ryan says:

    Stupid question: is the "b" for "blog" and "backstage" and only incidentally the same letter, or did you pick it to mean something else (base64)? Or is it just bs all the way down, and you're actually a madder genius than anyone has previously given you credit for?

  5. jwz says:

    Speaking of interjecting rentiers along the lines of link-shortening "services" -- another common technique, used since day one by both Google and Facebook, is to rewrite all outgoing links on their service to go through a middleman, to capture the click. Back in the Jurassic, what you clicked on was between you and the site you were loading (who also got to know from whence you came) but no, the Clown Farms can't tolerate being cut out of the bargain. So they steal your rightful information and you get nothing.

    Meaning that tonight as I watch my error logs scroll by, there is some page on Facebook, probably a promoter-created event page, that has a mangled URL on it that is hitting me with 404 errors.

    But all I get to know about it is that it comes from "facebookexternalhit/1.1".

    GFY, FB.

    • Anonymous says:

      This also breaks the readability of links that end up in your clipboard if you try to copy the link (because Google rewrites the href on mousedown).

      This would have been solved by the "ping" attribute, but the idea was attacked on the grounds that it was non-standard and enabled tracking. What detractors didn't take the time to think about back when "ping" was proposed is that this was a form of tracking that was already starting to happen and was only going to happen more. So we're now in the worst timeline, where the same tracking happens, but it's not easily detectable and disabled; because "ping" was attacked and never got widespread adoption, everyone who would otherwise be using it is mangling links.

    • Dan says:

      For those using Firefox to hit Google, https://github.com/palant/searchlinkfix takes care of the problem. It's just maddening that it's needed at all.

  6. Gustaf says:

    Did you use Julian dates as a basis for the DNALounge day arithmetic?

  7. raven says:

    you're killing me with the plain text urls! LOL

  8. cxed says:

    A couple of months ago, I wrote about this with a focus on how normal (and fancy) people can actually figure out what's lurking behind the link.
    http://xed.ch/b/2017/0513.html
    HTH

  9. I started writing this as a solution to the issues with shortlinks. It should work now, although there are a bunch of things that could be done to make it better. But I don't have the time to set up hosting for it right now.

    https://github.com/kerkeslager/big.ly