hard like a criminal

So here's my life for the last five days:

<LJ-CUT TEXT=" three fisted tales of how my computer hates me ">

I say to myself, ``Self, you've been really slack about doing backups on the machine that has all your MP3s lately.'' I retorted that doing backups sucks, because I do it to 12G DATs, and each one takes about 3.5 hours to write, and the same to verify, so when I get it in my head to ``do backups'', that's an every-night-for-a-week job. Then I said to myself, ``dude, here's a nickel, go get another disk drive and back them up to that instead!''

So I get this 120G drive, and I start copying my existing MP3s to it. God damn this machine is slow. I mean really slow. Wait, it shouldn't be that slow. This is crazy. Writing DATs is faster than this.

Shit, no wonder performance went to hell at around the time I upgraded to RH7.3: apparently, around that time, one of my three disks (the one my system is on, along with about 1/4th of my MP3s) decided to start running two orders of magnitude slower than it used to:

    /dev/hda:  64 MB in 526.75 seconds = 124.42 kB/sec
    /dev/hdb: 64 MB in 2.91 seconds = 21.99 MB/sec
    /dev/hdc: 64 MB in 3.29 seconds = 19.45 MB/sec
    /dev/hdd: 64 MB in 1.70 seconds = 37.65 MB/sec

Wow. Nice. So I tried a bunch of things -- checked jumper settings, bought new IDE cables, etc -- no luck. The next thing to try was to move it to another IDE controller and see if it still loses. Oh, but it's my system disk. So first I have to copy my system from this disk to a new one. I consolidate space so that I can overwrite one of my disks, and clone my system disk to it. Which takes forever, because note, I'm copying files off of my system disk at a whopping 125KB/s.

So I boot the new system disk, and yup, machine's a lot faster, and yup, the slow disk is still slow. I also attached the slow disk to a different computer, and it was also slow there. Great. So that disk is essentially dead, and now I'm down to only having room for one copy of my MP3s instead of two. So I need to buy another 120G disk. Except by now it's saturday night, and I can't get one until monday. Oh well, I'll spend the weekend copying the rest of the MP3s off of the slow disk. This is a lamentably manual process, because I don't trust the machine to, well, function, so I'm babysitting it a lot.

Somewhere in here I have a genuine premonition, and say to myself, ``Self, you ought to make checksums of all the files on the slow disk, and compare them to the copies. Just in case.'' This makes everything take twice as long (since I'm reading each file twice.)

So on monday, it's almost done copying and summing, and I go get another 120G disk. To add insult to injury, the price of 120G disks has gone up by $30 over the weekend.

So now I've got a machine with three disks, a small-ish one for the system, a big one for MP3s, and a big one for a copy of the MP3s, under the assumption that both disks probably won't fail at the same time.

I take advantage of my premonition, and check the checksums of the files on the new disk. Gasp! Some of them (a few dozen files, out of the many thousands) don't match! How did that happen? Well, the "slow disk" is obviously failing, so maybe this is just another symptom. I re-copy those files over, and they match this time.

I feel like I'm just about done. Ho ho ho!

Because I was using three disks before and now I'm using one, the partition sizes aren't the same in the new world, so there are some partitions that aren't all the way full. So I start moving directories around to pack things in. It's going well, and I'm basking in the glorious speed of the new disks, compared to the broken one.

Then the machine crashes.

And when I boot up, all those directories I was moving around? They're gone.

All of them.

The ext3 file system decided it was going to roll back the journal by at least fifteen minutes -- FIFTEEN MINUTES -- on the destination partitions. The source partitions, it left alone. So it went ahead and let the deletions happen, but un-did all of the file creations.

I spend some time tracking down what went missing, and it's like 70+ albums.

I check the contents of the decomissioned disks -- nope, none of the files are there. It turns out that all of these files happened to originally live on the disk I reformatted to be my new system disk.

I found about 20 of them on my old DAT backups from two years ago, and was able to restore them. But the rest were all things I'd gotten more recently than that. So now I have to re-rip 50+ CDs. And I can't even find them all: since my Damned Shelves have been full for years, new acquisitions have been sitting scattered in piles on the floor, and apparently some of those piles have gone where the socks in the dryer go. Or something.

Oh but wait, there's more!

Remember that premonition about the checksums? Well guess what. When I copy files from the new "main" MP3 disk to the new "backup" MP3 disk, I find that some of the files don't match. This can't be blamed on the "slow" disk, because it's not even attached to the system at this point. What the hell? I pull both versions of one of the files into Emacs and compare them. They're the same length, but starting a few MB into the file, there are a few bytes that have been changed in non-obvious ways. Oddly, mp3_check reports no MP3 errors in either file, which I guess just means the bytes didn't happen to be diddled in an MPEG header.

So I re-copy them again, and again that works.

Since then, I've seen this same kind of file corruption when moving files from partition to partition within the same disk, on both of the new disks. Let's recap:

  • brand new disks, two different vendors (IBM, Western Digital)
  • brand new IDE cables
  • latest stable kernel, 2.4.19
  • recently ran "memtest86 3.0" to verify my RAM
  • failures seen between two partitions both on the IBM disk
  • failures seen between two partitions both on the WD disk

So that sounds like either: ext3 is a way flakier file system than it can believably be, given how widely deployed it is; or, my mobo's IDE controller has lost its mind; or, there's some mysterious white-hole source of cosmic rays under my desk, flipping bits willy-nilly.

Someone said they had seen this kind of thing when slow disks were being used on a fast bus, but I'm pretty sure these disks are way faster than my bus. And my bus may even be running slower than it should be. I don't remember what mobo is in this machine, and I don't want to take it apart again to look, but syslog says

    ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

So the disks are probably running at like, 1/4th the speed they're capable of? That sounds like it ought to be pretty fucking safe.

So, after all this, I've been disabused of the fantasy that I don't need to back up to DAT any more, and my machine is now sitting here making the harmonizing rackets of ripping CDs and writing DATs at the same time. I'm going to be doing this for at least another week.

Wheee.

I really fucking hate computers. I just want an appliance that works.

Tags: , , ,

12 Responses:

  1. q says:

    Several of the IDE RAID onboard controllers (like the Promise PDC202) have serious corruption problems.... My brother just found this the hard way.

    I hate PC hardware.

  2. icis_machine says:

    not like this is helpful or anything, but i did just do the research on tape backups...

    for about $3200, you can get an autoloader from seagate in a bundle that will store 20GB uncompressed but 240GB max... and since it hold 6 tapes, you can have the entire system automated for the week. tapes are around $25.

    this may also compensate for penis size.

    eewwww i can hear my cat pee

  3. osi says:

    .. of shelves that is :)

    i made a similar construct out of pipe last year.. the threading part was a pita to deal with..

    http://fotap.org/gallery/table

  4. ivorjawa says:

    Stay away from Western Digital drives. They started going downhill a few years ago, and never recovered. Similarly, IBM used to have the best disks available for love or money, but they started turning out crap about two years ago.

    The reliable vendors are now Seagate and, surprisingly, Maxtor. Maxtor used to build the crappiest drives imaginable, but for the last couple of years they've been the most reliable cheap drives you can get.

    Get a 3ware Escalade controller. It's a true hardware RAID controller, has onboard processor and RAM, and it looks to the system like a SCSI controller. The one I have is a 32-bit dual-channel card, which they unfortunately no longer build. If you have a 64-bit PCI slot, you really want to take a look at this:
    3Ware Escalade 7410

    • jwz says:

      I have this theory. My theory is as follows. If you walk up to a group of five nerds and say "I bought [hardware] by [company]", one of them will always say, "oh, never buy from them, I had two fail in one year!" and one will say "you're crazy, they're the best, I've been using the same one for ten years."

      In other words, I believe all PC hardware to be identically crappy, and all nerd opinions about its relative stability to be pretty much worthless. Because in any group of nerds, you're going to have statistical good and bad experiences about which those nerds feel the need to go on at authoritative length.

      • evan says:

        To support that theory: I recently cataloged the six drives I have (four IBM, two Maxtor) and noticed that of the six, the only two that were making failing noises were the Maxtors. I used to buy only IBM, but then they had that mass recall.

        I'd say the best thing to do would be some sort of mirror across two different brands (preferably in a different room, 'cause that stuff is *noisy*), but I'm not one to talk because all of my music is on one disk that could fail at any time.

        Because it's sorta on-topic, I'll point out that I have been tracking pricewatch for disk prices: as of today, you can see cost versus drive size and dollars per gigabyte versus drive sizes. For a while you could get 80gb drives for less than a dollar per gig but now we continue to float just above that.

  5. rasp_utin says:

    If you have even an inkling of an idea of what music is missing, feel free to post about it, comment, etc.

    I may be able to help.

  6. bdu says:

    I believe you can force the journaling fs to take a snapshot to avoid ending up in a timewarp like this, you should look into that... There's definitely some funkyness here, as I've done some major file copying in ext3 and had no such problems.

  7. ch says:

    If it's mission critical, use SCSI.

  8. octal says:

    I'm pretty happy using 3Ware Escalades and IBM drives; I've got about 500 IBM drives, and have had corruption on 2 desktop drives and 1 32gb laptop drive in the past 4 years.

    Cooling is key to drive happiness.

    I actually don't bother backing up mp3s, I just keep them on a couple systems with big disk space. I'm probably getting a DVD-R changer once they become standardized -- I'm more interested in exchanging data in volume than having internal backups, and tape drives universally suck for interchangability.

    You could always get a Mac with cdrw/dvd-r and run OSX if you want an "appliance" and the ability to drive DVDs. I'm waiting for a laptop with dvd-r, or a g5, then I'll switch to mac.

  9. Personally, I've fallen in love with this sweet baby.

    There are smaller ones, but you can't beat price/performance. I was able to do .8 TB of RAID5 storage for under 5k.

    Of course, you still have the problem of a crappy filesystem. :)

  10. ronbar says:

    For burn-in tests on a machine, you want cerberus/CTCS at http://sourceforge.net/projects/va-ctcs/. It works much much better than memtest86 at finding hardware problems, in my experience.

    A great resource on HDs is http://www.storagereview.com/. A former co-worker of mine and a friend of his run it; they really know their shit.