Today in Computational Necromancy

Remember that guy who built a Cray-1A out of FPGAs? One of the problems he had was that, after recreating the hardware, he had access to no software to run on it, including the OS.

Well, someone sent him an old Cray drive pack and enclosure! Fantastic! Only a few problems:

The sound-foam inside had decayed into moving-part-hating dust...

And it was full of spiders...

And also wasps.

His paper is here: "Digital Archaeology with Drive-Independent Data Recovery: Now With More Drive Dependence!" I hope the irony is not lost on you that an almost-entirely-textual paper about an insanely difficult data-recovery problem is presented -- by archive.org of all people -- as a slideshow of images of scanned pages of non-OCR'ed text, in a horrid-to-use custom Javascript "reader".

No, really. We're utterly, utterly doomed.

I especially enjoy how the URLs in the footnotes are blue and underlined, despite being unclickable.

To put an enticing pull-quotes in here, I'd have to actually re-type them. Forgive me if I don't bother. TL/DR: He couldn't get any of the drive electronics working, and instead built a custom stepper-motor robot to move the read-heads in sub-track increments, then pulled off 8+ analog scans of each track, saved that raw data, and plans to re-digitize it all in software, deciding which streams are the tracks and which are inter-track noise statistically. After that comes the task of trying to turn a set of concentric rings of bits back into a file system.

Tags: , , , ,

39 Responses:

  1. Chris Morgan says:

    yeah. except for all these other formats : http://www.archive.org/details/2011-cdc-disk-archaeology-fenton

  2. Jason Scott says:

    Hey, jwz. Big fan and the guy who wrote the Internet Archive weblog entry.

    The PDF online streamer allows a quick and easy preview of endless amounts of scanned materials, most of them long predating PDFs - hence it's a godsend for many of the millions of books at the archive that have been scanned, but less so for more demanding PDFs, which is why, as mentioned above, all the other derivative formats and the original format are preserved as well. I was able to use it to link directly to a specific page within the PDF for people to see without making them download a 6mb PDF and start a reader, and expect it would work on most browsers. But the article also links to the main page as well, because I know dumping people in the middle of the paper wouldn't work for everyone.

    Keep on keeping on, and thanks for the mention!

    • jwz says:

      Hey there! I am a big fan of your work!

      Still, that reader makes me die inside. When you've got the raw text of the document in N different formats including text-based PDF, a system that results in the most "usable" form -- in any context -- being images-humped-by-JS is just unconscionable. It's not like converting such simple PDFs to HTML is rocket surgery...

      HTML: it works pretty good.™

      • Jason Scott says:

        God, how many times have you sat in a meeting while some douchenozzle in another seat goes "Well, you see... it's more complicated than that..." Sorry to be the douchenozzle this time.

        The online bookreader is the best solution to a host of problems that Internet Archive is trying to solve, mostly related to accessing millions of books scanned in as images, and often scanned in by many, many thousands. (I documented the scanning of books over on this entry and you may appreciate (or hate!) the example I used to move through their system. But getting more attention and scrutiny to the world of the Internet Archive is one of the jobs I have since joining up in May, and that seemed the way to do it.

        Specifically, the archive takes in digital copies through a whole host of methods, be they software, text, video, audio, and deep in there is the Deriving System, a php nightmare that analyzes the system as Best It Can and then creates versions from that. So if you upload, say, a pile of .JPG files as a .zip, it will turn that pile into a PDF, Kindle, Djvu, etc, as well as that streaming version you hate. If someone uploads, like I did, a .PDF with all the trimmings, it will keep that, but it will produce all the other versions from it. If I upload a .zip of images or other formats, it will try and get all the other formats out of it. Sometimes it fails, sometimes it succeeds.

        The online reader is a number of years old now, and things like the newer javascript PDF reader someone made are starting to kick its ass in terms of functionality, but the advantage of this particular one is that it works nearly everywhere, and the core function of the archive, besides the perservation of the original data (which it ALWAYS is), is to make it that people can benefit from the mass of free books as fast as possible on as many platforms as possible.

        It's a tiny group and not everything is perfect, but that's the goal.

        While I have your momentary attention, by the way, you might enjoy this page: http://www.textfiles.com/underconstruction/netscape/

        • jwz says:

          That Netscape page:

          Oh my god.

          • Jason Scott says:

            Scraped from a million Geocities pages I helped archive, MD5-checked so they're all unique (although as you can see, some still look similar). There's one for Under Construction GIFs as well. Anyway, I intentionally made the Netscape page grey just to capture that feeling.

            • Rick C says:

              Why, in the name of all that is unholy, do all those images (and yes, by extension, the real-world signs they are modeled on) have sideways shovels?

      • Justin Kerk says:

        For most of the archive.org texts which are scanned images of books, the JS reader really is the most usable form. In order to have reasonable file sizes, the auto-generated PDFs use quite aggressive compression which works OK for text but falls down for illustrations, and don't even get me started on the quality of the OCR that populates the text/EPUB/MOBI versions. So it's either download several hundred MB of JPEGs when I will probably lose interest after 10 pages anyway, or fire up the Javascript, which is honestly pretty darn slick. (I have yet to see a PDF reader with simulated page-flipping....)

        I agree that it's less than ideal for born-digital content, and hopefully they'll roll out a magical HTML5 version with integrated text at some point.

        • jwz says:

          Look, I'm sure there are plenty of texts for which that JS thing is the best that you can easily do, but for this case -- which I suspect is a very common case -- where you have all of the (non-OCR) text inside the PDF, and a few images, then converting it to HTML and a stack of JPEGs is the obviously winning approach for online presentation. Web pages are made of HTML for a reason. Other formats were tried! Those lost!

          You may have your eyes on the forest, but don't lose sight of some pretty damned important trees.

          (I have yet to see a PDF reader with simulated page-flipping...)

          Whereas I've yet to see any web page with simulated page flipping that didn't make me want to stab the author in the throat.

          If I wanted simulated page flipping, I'd be using a browser that does that with its pages. Instead, I choose to use one that has this invention called a scrollbar.

          • Tyler Wagner says:

            My EzPDF reader on Android has simulated page flipping. And I f&%^ing HATE IT. It's not a book, I know it's not a book, and I see no point at all in wasting my time and my CPU's cycles simulating part of one.

            • gryazi says:

              I will save my five minute hate for whoever decided Kindle for Android should have both "look how cute it is to swipe-flip your greasy finger across the 'page' to turn it" support and "okay, you're right, that's a bad idea, tapping the right screen edge for PgDn will work too", then put the invisible rotation lock button in the bottom right corner where your thumb naturally rests (and is the only place on the goddamn screen where your fingerprint won't be left atop text).

              Still no bookmarking/resume, either. This is the future?

              [Yes, you can use hardware volume buttons for scrolling. This would be great if I didn't suspect my phone's were built to survive about 80% of the number of cycles they're expected to take as volume buttons alone before something else kills the phone.]

              • Tyler Wagner says:

                If your mobi files aren't DRM-crippled, consider FBReader. It's awesome. That, plus Calibre, plus calibre2opds is the future of your personal library.

        • Art Delano says:

          I have seen PDF readers with animated page-flipping. They are frequently desired by people who personally have no use for the web and are responsible for their companies' websites. They can be difficult to talk out of. I envy you your naivete.

          • Alex says:

            I have seen them too. I have absolutely no use for animated page-fucking-flipping. It is a total waste of time and like all unnecessary things, a source of brokenness. (It's not like design austerity is a new concept in Western culture!)

            • Alex says:

              ps, JWZ, congratulations on having an OpenID consumer implementation that doesn't suck, even if the "authenticate this" runs off to the right and snarls up with the "connect with facebook" button.

              • jwz says:

                Yeah, it only gets mangled if the thread is deep and it has run out of room on the page. I'll try to sacrifice another CSS chicken.

  3. TJIC says:

    > instead built a custom stepper-motor robot to move the read-heads in sub-track increments, then pulled off 8+ analog scans of each track, saved that raw data, and plans to re-digitize it all in software,

    I know that de gustibus non est disputandum, and all hobbies look stupid from the outside (I spend my weekends using 10th century technology to make bowls out of dead trees, after all), but...

    WHY?

    I can see little intrinsic zero extrinsic benefit from this hobby.

    I know, I know, I'm being judgemental. Some folks recover Cray software. Others bang heroin between their toes or - shudder - go to Burning Man.

    Who am I to judge, right?

    Still.

    • Jason Scott says:

      You can look at it from several angles.

      * He was doing it to publish a paper (it was a research project)
      * There's an actual problem where a lot of Cray software has been lost.
      * It was an inspiring tinkering with FPGAs and hackery, that might have other uses
      * In the future, less items will be considered "lost" simply because the hardware associated with reading media is gone.

    • Dusk says:

      To take it a step further — right now, from what I understand, while there are a bunch of probably-working Cray computers in museums and whatnot (as well as emulations, like this FPGA implementation), there is no way to get them running again, as there is no software currently available. SGI destroyed most of it back in the 90s, so recovering it from hard disks (like this guy is doing!) is currently our best bet.

      • Samuel Erikson says:

        "SGI destroyed most of it back in the 90s[...]"

        Why the hell would they do that?

        • Andy says:

          The target hardware is no longer under maintenance, the systems you've built to maintain access to the archives are expensive to operate, the people who maintain the storage systems would like to move on to jobs that are actually relevant, and you're frantically trying to cut costs and compete in your marketplace. Keeping useless archives for decades, at six figures per year, would be the height of foolishness.

          Yes, I would have kept one or two copies of all the tapes in an archive vault somewhere, but that costs money. It could have been done more cheaply, but that too would have cost money.

          For all we know, actually, a copy still exists in a classified bunker somewhere -- a significant fraction of the Cray production went into classified sites, and once data storage media enters a classified location it *never* leaves. (Until it's destroyed, then it's *really* destroyed.)

          • candice says:

            About ten years ago, I got a tour of what was sometimes called the "DEC Historic Lab" in Nashua. It had, at the time, machines which took disk packs like this that actually worked, 9-track tape machines, pdp-11s, the works. All for outrageously expensive long term support contracts. Compaq hadn't gotten around to dismantling it, but HP or whoever uses that building surely has by now.

    • Andy says:

      > WHY?

      There is a physical object, which previously was used (by a long-lost tribe) to store data. There is still (maybe) data stored in it. But we can't read it, because of [insert esoteric technical challenge here].

      Sounds like a grand challenge, and a worthy quest, to me.

      How is your question different from "why would you climb a mountan when you can just hire a helicopter to fly you to the top?"? Except in this case, the helicopter doesn't exist and climbing the mountain [building a robot to read the magnetic media directly] is the only way to recover that data, because all known readable copies of the data were lost long ago.

      For that matter, this seems, to me, vastly more relevant than decoding the original Rosetta Stone.

      Many of the techniques, systems, and philosophies people like @textfiles and Fenton are developing are directly applicable, or are fundamental stepping-stones, to our coming age of necessary digital archaeology. Being able to read our digital heritage, and being able to recover from its inevitable partial demise, is becoming more relevant every month as more and more of everything moves into digital storage.

      • gryazi says:


        For that matter, this seems, to me, vastly more relevant than decoding the original Rosetta Stone.

        You sure? I think we might need to wait another decade until the quality of the porn recovered meets or exceeds.

        • Elusis says:

          Dude, that ascii piece of a guy fucking a horse was totally SWEET.

          • TJIC says:

            Heh. Just last night a friend and I were talking about telegraphy as "The Victorian Internet" and I was heard to say "dash dot dot dash dash dash dot dash .... wow, check it! She's got a HUGE rack!".

    • Tom says:

      Why?

      Nothing cements learning and methods better than practical application. We humans take on challenges because they help us grow in knowledge and experience. In this case, you learn even more about the tools you use, how certain electronics work and (especially in this case) what can fail and why.

      Would you ever take someone who has read every recipe ever written but never cooked and toss them in a 5-star restaurant kitchen? Heck, I learned stuff just reading it!

      • TJIC says:

        > Nothing cements learning and methods better than practical application.

        Point taken.

        OK, I withdraw my question / objection.

  4. db48x says:

    Yea, for pdfs it's suboptimal. Much better than nothing, but still not entirely optimal. Perhaps pdf.js will save us.

  5. Daen de Leon says:

    There's a plaintext version here.

  6. Daen de Leon says:

    I worked for J P Morgan Securities in London in the early 90s. We were looking for something to do realtime options pricing across the FTSE, which was decidedly non-trivial at the time. When the nice Cray salesmen quoted us a seven-figure price tag, we showed them out of the building. We ended up going for a Parsys 32-node Transputer (T800) box instead. Big as a fridge, way less grunt than a Y-MP EL, but about two orders of magnitude cheaper.

  7. Lee says:

    Dear God... How do you manage to get that much extra time on your hands??