Webcast Computer Hell Update

To recap: I have this Mac Mini with two external drives attached to it. It was working basically fine for years, but had the problem that when it would reboot, sometimes the drives wouldn't mount without power-cycling them. This became a bigger deal when the machine started crashing regularly (locking up, black screen, no logs, no auto-reboot).

Attempts to fix this so far have included:

  1. Replace and increase the machine's RAM.
  2. Put the USB2 drives on a powered USB hub.
  3. Replace both drive enclosures.
  4. Switch from USB2 to Firewire 400 through a powered hub.
  5. Replace both drive enclosures again.
  6. Upgrade from MacOS 10.5.8 to 10.6.7.
  7. Replace the entire Mini itself (macini1,1 → macmini2,1).

None of these have fixed the problem (and #6 has caused another) and at this point, I have replaced every piece of hardware except the power cable and the drives themselves (which fsck fine).

I guess it's time to buy a pair of new drives, unless someone has another idea. (I have put this off until last because copying 1TB of data takes almost two days, each.)

Incidientally, I think I understand the source of the crash/hang: the other day I saw one of the drives get into a situation where "ls" on that drive would hang forever in the kernel and be unkillable with -9, so that's an express train to a full process table and inability to reboot right there. I haven't seen that kind of shit since the bad old NFS days. Apparently my Mac has nostalgia for SunOS 4.1.3.

Tags: , , , ,

21 Responses:

  1. Andrew Wilcox says:

    This is actually quite common in MacOS 10.6. It's been happening for me since the developer preview days, even with internal drives. The best I can find is that somewhere in the kernel the IOATAFamily extension gets hung up waiting for the drive to be ready (DRQ) and all processes that attempt to access the disk get stuck in kernel-mode uninterruptable I/O.

    Or, in plain terms: Yes, MacOS 10.6 has nostalgia for SunOS 4. I never encountered this issue on 10.5, though I didn't run it for very long (I was on 10.4 pretty much until 10.6 was out).

  2. James C. says:

    Strangely, I just had the same exact hang occur while playing with System Preferences’s preview of various XScreenSaver modules. I checked my running fs_usage which showed that it was busy in the kernel, and indeed was unkillable. Shutdown -r failed since the drive was now inaccessible. This seems to be entirely a software problem for me, since my drive always behaves normally otherwise. Presumably you’ve got *both* a hardware problem and this obnoxious SunOS 4 style kernel bug.

    You should try replacing the power cable, just for “fun”. Do you have any chickens around?

  3. Zygo says:

    My experience with consumer hard drives is that they're the second thing to suspect for random hang problems, after cooling & power issues and before bad RAM. Of the drives I've been involved with that have had to be replaced, most fail by hanging without reporting failure on any kind of diagnostic test--other than getting a valid command from the host and never bothering to complete a reply to it. This failure mode is much more common than medium errors and total drive failure (which isn't surprising, since the other failure modes tend to result in warranty replacement that costs the drive vendors money, while random hangs can be blamed on the user's software choices for free).

    In some cases the failure cases are more specific, like some Seagate 1.5TB drives I had that were fine unless you happened to try to read the sector at LBA 0xFFFFFFFFF and the one after it in the same read request. You could run badblocks and fsck and SMART tests all day and never see that problem, but try to read a file that happens to straddle those sectors and the drive locks up. If that file is, say, some temporary file in your browser cache, you can go insane trying to debug this as the problem comes and goes away like sunshine on a partly cloudy day.

    Drive-level hangs (assuming the OS doesn't have a driver bug) are a firmware problem, often with symptoms triggered by drive age, and the fix is to get different firmware into the drive or get a drive with different firmware. I generally end up moving the offending drives to machines that care less about uptime, and use some other brand or size of drive to replace them. Repeat until the hanging occurs at a tolerable frequency (say once every two or three years, or less than once per intentional reboot).

    Needless to say, if the thing that has to deal with your drives isn't a high quality OS, but a cheap USB or firewire bridge...well, it'll get confused. Badly.

    I build every system from big media servers with a dozen disks down to PDAs running off SD/MMC cards with LVM and RAID layers so that switching out storage media is a simple matter of plugging a new drive in, waiting a few hours for it to sync up, and disposing of the old drive. Of course you want to have done this long before your older drives start hanging every few hours.

    • jwz says:

      I would have assumed that every new model of drive, even from the same manufacturer, has a new and exciting set of bugs. E.g., that replacing a 2-year-old 1TB drive with a 2TB drive from the same manufacturer spins the bug-wheel again just as well as switching to a new manufacturer does?

      On any given day, Central tends to have exactly one model in stock that is cheap, so getting picky about brand names is something I try to avoid needing to do. So I'd like the answer to this to be yes...

      Does a Firewire hub actually have smarts in it that I need to be picky about? I assumed those were basically a transformer and wire. (This Mini only has one FW port and I have 3 FW devices.)

      • Andrew Stern says:

        Avoid Seagate like the demon plague; we've seen a roughly 75% failure rate within 1 year. WD's sourced from Central seem to be very reliable in our operation, many machines are experiencing 5+ years of use with no significant disk issues.

        I've found the WD RE3 and RE4 "enterprise" models to be reliable.

        And with Seagate purchasing the HD business from Samsung today, there really are only two players now, Seagate and WD (who begat Hitachi who begat IBM who begat etc. etc.)

        • Zygo says:

          I have no data that supports the idea that any drive vendor is better than any other with a strong enough correlation for prediction--if anything, it supports the idea of mirroring drives in pairs from different vendors. Usually when I see multiple failures, they all happen to similar drives at about the same time, so the last thing you want is to have a storage monoculture.

          • Tyler Wagner says:

            Agreed. Over the years I've seen every manner of "Avoid drive X, but drive Y is fine" posts, for every permutation of drive vendors and models. Specific models or production runs may be suspect, but I don't see any reason to suspect a manufacturer's entire line.

            Buy what's cheap, keep backups, and use RAID if it really matters.

          • phuzz says:

            Ditto, I used to work at a medium-small company building PCs, they used pretty much every manufacturer and out of the several thousand disks that passed through my hands, there was a pretty even spread of failures.
            The only clear lesson is that paying extra for 'Enterprise Class' drives will get you a better warranty.
            (a few years back there was a google paper on the harddrive fail rates they were seeing, iirc they didn't have any manufacturer specific issues either)

        • jwz says:

          Hooray, I got Seagate! That was the only $80 2TB model in stock. The others were twice as much and I didn't buy that they'd be twice as likely to work.

        • Shitmittens. I'd been using Samsung's (quiet, power-efficient, cheap, realiable) drives to escape from Seagate hell for a few years now.

          • Joe Thompson says:

            Ditto. I was just thinking of buying a Spinpoint 2 TB to upgrade from the 1 TB I got a year or so ago -- might have to do that sooner rather than later... On the plus side, as previously noted the WD RE drives seem not to suck noticeably worse than anything else on the market, so it's not a complete landscape of fail quite yet.

      • Zygo says:

        The answer is "yes, mostly." The odds are in your favor, but it's still possible to get another copy of the same drive firmware even two years later. Supply chain warehouses are huge, and drive vendors try to spread their investment in firmware and embedded controllers over a few years and over several sizes of drive. Ditto for the USB- and FW-to-SATA bridges inside drive enclosures--you can get some frighteningly old chips in those, especially the lower-volume Firewire ones.

        I've never had a signal quality problem that I could trace to a hub, either FW or USB. I've seen a lot of awful USB host ports and cables, though. I've worn out all three USB ports on my old laptop with "only a few" thousand insert/remove cycles each.

    • dalvenjah says:

      Mostly this; it sounds like you have one or more spots on the drive that are marginal, the drive is taking several passes to try and read data, and at some point OS X gives up before the drive does.

      I forget where I first read this, but it seems to have been borne out by experience: "Consumer" drives (workstation/SATA - i.e. what's cheap) will try very very hard to return data when reading, to the point of rereading the same spot and taking several seconds (sometimes tens of seconds) to time out -- they assume that's the only copy of your data.

      "Enterprise" drives on the other hand, (SCSI/FC/SAS, and now the "nearline/enterprise grade" SATA drives), mostly assume you're running in a RAID setup, and will timeout a read much quicker.

      The other clever piece is that if you go look at drive specs, MTBF calculations (which are pretty meaningless anyway) for consumer drives assume it's on 9-5 and turned off at night; the enterprise drives get MTBF calculated assuming the drive is on 24/7. Buying one or the other isn't a guarantee either way, but I'll generally fork out for the enterprise drive if it's not going in a desktop, and if nothing else I at least get a longer warranty.

      Also, 1TB+ drives scare me if they're not in a RAID6 group; I'd get 2 more 1TB drives instead of a single 2TB drive.

      S.M.A.R.T.'s "OK/not OK" health check is mostly useless, but if you can get at the counters, and either "Current Pending Sectors", "Reallocated Sectors", or "Offline Uncorrectable" raw value are non-zero, you're seeing hard read errors. (This forum post is a decent explanation of the three). Seek Error Rate, Hardware ECC Recovered, and Raw Read Error Rate will be non-zero (and can be quite high) and that's mostly normal. If you can get software that'll kick off an extended self-test, too, do that -- the test will abort on a bad section of disk, and if it's in a section of important file (which smart won't tell you), it can indicate you'll see the "I'm trying very hard to read this sector and you can just wait" problem.

      SmartMonTools is command-line only (and for OSX you need to compile it or use Fink or MacPorts), but lets you do all of the above. SMART Utility puts a GUI wrapper around it, but wants to charge $25.

  4. Well, now your story is starting to sound like subtle filesystem corruption. You might want to get started on that slow copy...

  5. CBGoodBuddy says:

    I experienced USB-related system level hangs with my iMac (10.6.x) and USB+Firewire controller from Otherworld Computing. Connected with Firewire, the drive+controller would get painfully hot and the iMac would hang within a day. Connected with USB, the system was more stable, but any other USB activity could cause a hang. Somehow the kernel would get stuck servicing the dead USB drive, and slowly the other OS X system elements would hang, until eventual death. I tried exchanging drives, enclosures and USB cables, but nothing helped. The only usable solution I found was to ditch the drive and get a different one.

  6. Andrew Wilcox says:

    Without resorting to saying "You're all wrong", you're all wrong. Oh wait, I said it anyway.

    The drive manufacturer has absolutely nothing to do with it. I've used Seagate, WD, Samsung, Toshiba, hell my old mini had a Hitachi in it (pre-WD). The problems only ever mainfest themselves in OS X.6, and are caused by kernel issues that are not the disk's fault.

    Zygo's argument is pretty much null at this rate: MacOS is the issue, not the drives, as I've tried many different drives on many different buses (even /internal-only/ can cause this issue) and always experienced this issue on 10.6, and never on FreeBSD, Linux, or Windows Vista (eugh). I've even experienced this issue on a hackintoshed system, so I know it isn't a fault of Apple's shoddy Mac mini chipsets either. No, this issue is OS-related.

    To jwz, your best bet is to go back to 10.5 and put up with the mounting issue / get a new drive and hope for the best with it. There's something amiss in the 10.6 kernel and I'm afraid you won't be able to get around it.

    • Travis Dixon says:

      Does abstracting the storage (e.g. moving them to a NAS) help in that situation? IOW, is the issue one of interacting with the filesystem, or with connected hardware? Just curious...

      • Andrew Wilcox says:

        Somewhat. SMBFS is notoriously full of crap on OS X, and pretty much every kernel panic that hasn't been caused by logic board failures has been caused by com.apple.smbfs.

        That said, kernel NFS and user-land SSHFS both work great. I haven't tried anything more exotic. Oh, I guess I have. AFP also has no issues.

        So basically, your important Mac data should be stored on a NAS running NFS, SSHFS, or AFP.

  7. Back when you first posted about this issue, I was going to postulate that your computers were suffering an overheating problem. (Yet I never did, for fear of possibly incurring your wrath.) But I'm going to boldly make the suggestion now--if you haven't already, try upping the fan speed using a utility like smcFanControl. Seems you've mentioned that the room where these machines live is fairly warm.

    I've run distributed.net on Intel Mac mini systems both new and old. Every time I did, they got disturbingly hot (160-200 degrees at the CPU!) and the fan speed would not budge. I can't say as they crashed, but the heat level bothered me greatly. So the moment I found out about smcFanControl, I set it up, turned up the fan speed to about twice what it was and it made a huge difference. It's the one thing I've noticed as a common factor amongst the various Mac mini models--Apple runs 'em hot.

    All I can say regarding your USB problem is that I feel your pain. I don't understand how the Mac OS X USB stack (especially the EHCI part) can be so brain damaged, but it sure is. I wanted to add USB 2.0 support to a 2002 QuickSilver G4. What an education that turned out to be! I finally settled on an NEC-based USB 2.0 card that works 95% of the time. This is stupid. If even Microsoft can make any old cheap-n-nasty USB host adapter work properly under Windows, why can't the Apple brigade get it right?

    I also don't claim to understand how Mac OS X can become so fatally broken when I/O operations go awry. I have seen I/O operations fail in ways that would have been a mere hiccup under any other operating system you'd care to name bring Mac OS X (10.3-10.5, I'm not on 10.6 yet) to its knees in bizarre ways (mostly by deadlocking some part of the system). This is also just plain inexcusable.

    I hope someone from Apple is reading these postings...

    • jwz says:

      If even Microsoft can make any old cheap-n-nasty USB host adapter work properly under Windows, why can't the Apple brigade get it right?

      To be fair, the reason is that you have that exactly backwards. Microsoft ships whatever shit they like, and then the drive manufacturers don't ship their firmware until it works with that. They don't do that for Apple, who have to chase the tail-lights to be Microsoft-bug-compatible.

    • Andrew Wilcox says:

      I've run distributed.net on Intel Mac mini systems both new and old. Every time I did, they got disturbingly hot (160-200 degrees at the CPU!) and the fan speed would not budge. I can't say as they crashed, but the heat level bothered me greatly. So the moment I found out about smcFanControl, I set it up, turned up the fan speed to about twice what it was and it made a huge difference. It's the one thing I've noticed as a common factor amongst the various Mac mini models–Apple runs 'em hot.

      WRONG. This is a sign of your SMC starting to fail. You should never have to run smcFanControl to make the fan spin up past the low 1700rpm base speed. Apple's firmware is designed to start raising the fan speed at around 75 C and you'll be at max (3500rpm on 3rd gen, 5000rpm on 4th gen) by 90 C I've had the "fans stuck at idle" problem on 2 Intel Mac minis, and both bricked themselves within 6 months. smcFanControl seems to make it last longer (probably due to turning down the heat) but they all eventually brick.

      tl'dr: If you have to get a mini, get one with AppleCare.