Waybackify

I wrote this thing to replace all the links in an HTML document with Wayback Machine links. It looks at the file's date and tries to find contemporaneous archives.

I think I'll also do that with links in every blog post that is more than, say, 5 years old. Opinions differ widely on what the half-life of a link is, but it is Not Long because Everything is Terrible.

It would be nice if there was an easy way to detect whether a site is still really there, and only replace it with an archive link if not, but that's basically impossible to automate since so many sites turn into domain-parking spam pages.

By the way, there are many ways to tell the Archive to save a page immediately. The easiest way to script that is to just load "https://web.archive.org/save/...URL...". Whenever I make a post it does that on everything I've linked to, just in case.

The Archive is a critical piece of infrastructure of both the Internet, and increasingly, our society. Give them money.

Previously, previously.

Tags: , , ,

17 Responses:

  1. Grego says:

    1. Thank you for this and thank you for supporting archive.org.

    2. I suggest #!/usr/bin/env perl -w instead of hardcoding the path.

    3. I was surprised to have my original file overwritten by default; that's bad behavior. I suggest writing to stdout by default and adding a flag to allow overwriting in place.

    4. curl isn't a bot.

    $ curl -L https://www.jwz.org/hacks/waybackify.pl | grep -i bots
    [TITLE]403 Bots Forbidden[/TITLE]
    no robots.[/TD][/TR]

    -G

    • jwz says:

      That curl thing is there for a very good reason: someone who doesn't know how to work around that is too ignorant to be allowed to point command-line tools at my web site.

  2. Aristotle says:

    The issue with parked domains and such is something I’ve run up against as well. One thought I’ve had, but have yet to get around to implementing, is to save a sentence or two from the meat of a linked page in some custom attribute of the A tag.

    Then a link checker can look for that in a 200 response, and if it’s there it would know for sure that the link still works. If not, it’s less certain what that means – maybe the excerpt just happens to have been edited, or edited out, or, like, the article got re-/paginated or something. Also, some of those cases mean you might want to flip the link to the Wayback Machine even if it still works. So my guess is that these cases will require manual review.

    But I am also guessing that between having reliable failures (DNS lookup fail, 410, etc.) and reliable successes (200 with matching excerpt), the cases to review should drop to something that can be reasonably eyeballed. (Esp. if the output includes more signals, like was there a redirect, and what is the new URL.)

    • jwz says:

      save a sentence or two from the meat of a linked page

      Well, we already have the entire original page -- in the Wayback Machine. Perhaps it would be possible to use that to compute some sort of distance metric to guess whether the page has changed "too much".

      • Moofie says:

        You could use the Levenshtein distanced algorithm to compute the differences between the text of the online page and the Wayback page. However the running time of the Levenshtein distanced algorithm is O(n^2) so you would need to be very careful about how much text to compare. There is even Perl source code for the algorithm online if you don't like CPAN.

  3. Aristotle says:

    Maybe by running it through a page content extractor (like Reader View) that spits out plaintext? But it doesn’t make me happy to have to fudge a scoring threshold… the advantage of manually picking an important, fairly unique bit from the meat of the page is that it limits the matching to some bit that a human has declared meaningful to humans. But of course that doesn’t help for links created before you started doing that…

  4. Published name says:

    Don't get me wrong, the Internet Archive is an amazing piece of magic and they certainly deserve every dollar (and floppy disk) you throw at them. But pre-emptively changing all links to archive links introduces a single point of failure, similar to URL shorteners. Perhaps this is the kind of thing that a browser plugin should take care of instead?

    • Aristotle says:

      The difference is how trivial it is to recover the original URL from a Wayback Machine link. You can run a batch job to write it back out of all your links, or readers can just do it for themselves. With URL shorteners the whole point is that you can’t tell the original URL from the shortened link.

    • d_c_5 says:

      There is a browser-plugin that archive.org offers, so that endusers reading a site which has not already run jwz's script are able to get a similar effect client-side. Here is the firefox flavor: https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new

      I would argue that both website-owner archival efforts, and website-reader archival plugins, are good things. Wikipedia does pre-emptive server-side archival, as well ... https://en.wikipedia.org/wiki/Wikipedia:Link_rot#Automatic_archiving

    • jwz says:

      It's hard to get real numbers without a bunch of manual effort, but a spot-check suggests to me that the answer to the question, "How many of my years-old links still work?" is "very few". Certainly less than half. That means that this change makes things less bad. Not necessarily good. Less bad.

      If browsers behaved differently, or if web site maintainers behaved differently, or if literally every person who uses a web browser installed this thing or that thing, none of this would be necessary. But all of that is outside of my purview.

  5. Matt says:

    This could be a cool WP plugin, drop me an email if that’s interesting to you.

  6. Thank you.

    (That is all.)

  7. John Doty says:

    This is fantastic, thank you.

  8. Gokulakrishna says:

    Hi, I created "Wayback Everywhere" addon: auto-redirect all URLs opened from Twitter/feeds to archived versions in Wayback Machine, auto-save to WM if not yet archived. Has updatable "excludes list" of 800 sites to provide a starting point and it can auto-exclude sites.

    https://gitlab.com/gkrishnaks/WaybackEverywhere-Firefox/blob/master/README.md

    My 9 month stats:

    https://pbs.twimg.com/media/Dykno6SU8AAbNmd?format=jpg&name=large

  9. Cat Mara says:

    One thing I just noticed: on the xscreensaver download page, under the instructions for Debian to use package from the "unstable" branch in preference to the main one to avoid logging bugs on years-old versions of xscreensaver, the link to the "unstable" package has been waybackified. Is this not likely to cause the very problem using the "unstable" package in the first place is intended to solve?

  10. Lloyd says:

    How do we convince Rotten Tomatoes to use this for its review links?

    This is important.

  • Previously