Well, someone sent him an old Cray drive pack and enclosure! Fantastic! Only a few problems:
The sound-foam inside had decayed into moving-part-hating dust...
And it was full of spiders...
And also wasps.
No, really. We're utterly, utterly doomed.
I especially enjoy how the URLs in the footnotes are blue and underlined, despite being unclickable.
To put an enticing pull-quotes in here, I'd have to actually re-type them. Forgive me if I don't bother. TL/DR: He couldn't get any of the drive electronics working, and instead built a custom stepper-motor robot to move the read-heads in sub-track increments, then pulled off 8+ analog scans of each track, saved that raw data, and plans to re-digitize it all in software, deciding which streams are the tracks and which are inter-track noise statistically. After that comes the task of trying to turn a set of concentric rings of bits back into a file system.
yeah. except for all these other formats : http://www.archive.org/details/2011-cdc-disk-archaeology-fenton
Ah, I didn't see that. That "reader" is still execrable and embarrassing though.
Hey, jwz. Big fan and the guy who wrote the Internet Archive weblog entry.
The PDF online streamer allows a quick and easy preview of endless amounts of scanned materials, most of them long predating PDFs - hence it's a godsend for many of the millions of books at the archive that have been scanned, but less so for more demanding PDFs, which is why, as mentioned above, all the other derivative formats and the original format are preserved as well. I was able to use it to link directly to a specific page within the PDF for people to see without making them download a 6mb PDF and start a reader, and expect it would work on most browsers. But the article also links to the main page as well, because I know dumping people in the middle of the paper wouldn't work for everyone.
Keep on keeping on, and thanks for the mention!
Hey there! I am a big fan of your work!
Still, that reader makes me die inside. When you've got the raw text of the document in N different formats including text-based PDF, a system that results in the most "usable" form -- in any context -- being images-humped-by-JS is just unconscionable. It's not like converting such simple PDFs to HTML is rocket surgery...
HTML: it works pretty good.™
God, how many times have you sat in a meeting while some douchenozzle in another seat goes "Well, you see... it's more complicated than that..." Sorry to be the douchenozzle this time.
The online bookreader is the best solution to a host of problems that Internet Archive is trying to solve, mostly related to accessing millions of books scanned in as images, and often scanned in by many, many thousands. (I documented the scanning of books over on this entry and you may appreciate (or hate!) the example I used to move through their system. But getting more attention and scrutiny to the world of the Internet Archive is one of the jobs I have since joining up in May, and that seemed the way to do it.
Specifically, the archive takes in digital copies through a whole host of methods, be they software, text, video, audio, and deep in there is the Deriving System, a php nightmare that analyzes the system as Best It Can and then creates versions from that. So if you upload, say, a pile of .JPG files as a .zip, it will turn that pile into a PDF, Kindle, Djvu, etc, as well as that streaming version you hate. If someone uploads, like I did, a .PDF with all the trimmings, it will keep that, but it will produce all the other versions from it. If I upload a .zip of images or other formats, it will try and get all the other formats out of it. Sometimes it fails, sometimes it succeeds.
It's a tiny group and not everything is perfect, but that's the goal.
While I have your momentary attention, by the way, you might enjoy this page: http://www.textfiles.com/underconstruction/netscape/
That Netscape page:
Oh my god.
Scraped from a million Geocities pages I helped archive, MD5-checked so they're all unique (although as you can see, some still look similar). There's one for Under Construction GIFs as well. Anyway, I intentionally made the Netscape page grey just to capture that feeling.
Why, in the name of all that is unholy, do all those images (and yes, by extension, the real-world signs they are modeled on) have sideways shovels?
Because a shovel in its functional orientation looks like a stick.
I agree that it's less than ideal for born-digital content, and hopefully they'll roll out a magical HTML5 version with integrated text at some point.
Look, I'm sure there are plenty of texts for which that JS thing is the best that you can easily do, but for this case -- which I suspect is a very common case -- where you have all of the (non-OCR) text inside the PDF, and a few images, then converting it to HTML and a stack of JPEGs is the obviously winning approach for online presentation. Web pages are made of HTML for a reason. Other formats were tried! Those lost!
You may have your eyes on the forest, but don't lose sight of some pretty damned important trees.
Whereas I've yet to see any web page with simulated page flipping that didn't make me want to stab the author in the throat.
If I wanted simulated page flipping, I'd be using a browser that does that with its pages. Instead, I choose to use one that has this invention called a scrollbar.
My EzPDF reader on Android has simulated page flipping. And I f&%^ing HATE IT. It's not a book, I know it's not a book, and I see no point at all in wasting my time and my CPU's cycles simulating part of one.
I will save my five minute hate for whoever decided Kindle for Android should have both "look how cute it is to swipe-flip your greasy finger across the 'page' to turn it" support and "okay, you're right, that's a bad idea, tapping the right screen edge for PgDn will work too", then put the invisible rotation lock button in the bottom right corner where your thumb naturally rests (and is the only place on the goddamn screen where your fingerprint won't be left atop text).
Still no bookmarking/resume, either. This is the future?
[Yes, you can use hardware volume buttons for scrolling. This would be great if I didn't suspect my phone's were built to survive about 80% of the number of cycles they're expected to take as volume buttons alone before something else kills the phone.]
If your mobi files aren't DRM-crippled, consider FBReader. It's awesome. That, plus Calibre, plus calibre2opds is the future of your personal library.
I have seen PDF readers with animated page-flipping. They are frequently desired by people who personally have no use for the web and are responsible for their companies' websites. They can be difficult to talk out of. I envy you your naivete.
I have seen them too. I have absolutely no use for animated page-fucking-flipping. It is a total waste of time and like all unnecessary things, a source of brokenness. (It's not like design austerity is a new concept in Western culture!)
ps, JWZ, congratulations on having an OpenID consumer implementation that doesn't suck, even if the "authenticate this" runs off to the right and snarls up with the "connect with facebook" button.
Yeah, it only gets mangled if the thread is deep and it has run out of room on the page. I'll try to sacrifice another CSS chicken.
> instead built a custom stepper-motor robot to move the read-heads in sub-track increments, then pulled off 8+ analog scans of each track, saved that raw data, and plans to re-digitize it all in software,
I know that de gustibus non est disputandum, and all hobbies look stupid from the outside (I spend my weekends using 10th century technology to make bowls out of dead trees, after all), but...
I can see little intrinsic zero extrinsic benefit from this hobby.
I know, I know, I'm being judgemental. Some folks recover Cray software. Others bang heroin between their toes or - shudder - go to Burning Man.
Who am I to judge, right?
You can look at it from several angles.
* He was doing it to publish a paper (it was a research project)
* There's an actual problem where a lot of Cray software has been lost.
* It was an inspiring tinkering with FPGAs and hackery, that might have other uses
* In the future, less items will be considered "lost" simply because the hardware associated with reading media is gone.
To take it a step further — right now, from what I understand, while there are a bunch of probably-working Cray computers in museums and whatnot (as well as emulations, like this FPGA implementation), there is no way to get them running again, as there is no software currently available. SGI destroyed most of it back in the 90s, so recovering it from hard disks (like this guy is doing!) is currently our best bet.
"SGI destroyed most of it back in the 90s[...]"
Why the hell would they do that?
The target hardware is no longer under maintenance, the systems you've built to maintain access to the archives are expensive to operate, the people who maintain the storage systems would like to move on to jobs that are actually relevant, and you're frantically trying to cut costs and compete in your marketplace. Keeping useless archives for decades, at six figures per year, would be the height of foolishness.
Yes, I would have kept one or two copies of all the tapes in an archive vault somewhere, but that costs money. It could have been done more cheaply, but that too would have cost money.
For all we know, actually, a copy still exists in a classified bunker somewhere -- a significant fraction of the Cray production went into classified sites, and once data storage media enters a classified location it *never* leaves. (Until it's destroyed, then it's *really* destroyed.)
About ten years ago, I got a tour of what was sometimes called the "DEC Historic Lab" in Nashua. It had, at the time, machines which took disk packs like this that actually worked, 9-track tape machines, pdp-11s, the works. All for outrageously expensive long term support contracts. Compaq hadn't gotten around to dismantling it, but HP or whoever uses that building surely has by now.
HP will apparently still sell you a MIPS-based NonStop (nee Tandem) server. They last sold a new VAX in 2005 or so, though.
There is a physical object, which previously was used (by a long-lost tribe) to store data. There is still (maybe) data stored in it. But we can't read it, because of [insert esoteric technical challenge here].
Sounds like a grand challenge, and a worthy quest, to me.
How is your question different from "why would you climb a mountan when you can just hire a helicopter to fly you to the top?"? Except in this case, the helicopter doesn't exist and climbing the mountain [building a robot to read the magnetic media directly] is the only way to recover that data, because all known readable copies of the data were lost long ago.
For that matter, this seems, to me, vastly more relevant than decoding the original Rosetta Stone.
Many of the techniques, systems, and philosophies people like @textfiles and Fenton are developing are directly applicable, or are fundamental stepping-stones, to our coming age of necessary digital archaeology. Being able to read our digital heritage, and being able to recover from its inevitable partial demise, is becoming more relevant every month as more and more of everything moves into digital storage.
You sure? I think we might need to wait another decade until the quality of the porn recovered meets or exceeds.
Dude, that ascii piece of a guy fucking a horse was totally SWEET.
Heh. Just last night a friend and I were talking about telegraphy as "The Victorian Internet" and I was heard to say "dash dot dot dash dash dash dot dash .... wow, check it! She's got a HUGE rack!".
Nothing cements learning and methods better than practical application. We humans take on challenges because they help us grow in knowledge and experience. In this case, you learn even more about the tools you use, how certain electronics work and (especially in this case) what can fail and why.
Would you ever take someone who has read every recipe ever written but never cooked and toss them in a 5-star restaurant kitchen? Heck, I learned stuff just reading it!
> Nothing cements learning and methods better than practical application.
OK, I withdraw my question / objection.
Yea, for pdfs it's suboptimal. Much better than nothing, but still not entirely optimal. Perhaps pdf.js will save us.
There's a plaintext version here.
Oh, already covered. Sorry.
I worked for J P Morgan Securities in London in the early 90s. We were looking for something to do realtime options pricing across the FTSE, which was decidedly non-trivial at the time. When the nice Cray salesmen quoted us a seven-figure price tag, we showed them out of the building. We ended up going for a Parsys 32-node Transputer (T800) box instead. Big as a fridge, way less grunt than a Y-MP EL, but about two orders of magnitude cheaper.
Apologies, it was an 8-node box.
Dear God... How do you manage to get that much extra time on your hands??
It's not the time that's hard to come by, it's the tenacity that should be impressing you.