
Activity Monitor is useless. I don't think I have ever had it answer the question, "Why are things slow?" It seems to only be capable of ever telling you, "Load is zero! kernel_task has used 7 CPU-years!" The macports version of "top" seems similarly worthless. "fs_usage" can't seem to say anything other than "yep, iTunes is reading your disk" (which is not surprising as it's playing music).
I've got 51 GB free, 32 GB of RAM, zero swap utilization, and Disk Utility has no complaints, even in recovery mode. But it's behaving like, I dunno, some bus is saturated and everything stops. Even just typing this text into an Emacs buffer, it randomly gave me a hypnowheel for 30+ seconds. Opening a Safari window just took a full minute, but yup, load is zero. This is some fun new definition of "load" with which I was previously unfamiliar. Even "ps" is slow.
And then, eventually it stops and everything is back to normal again. For a while.
If the problem is that the disk is dying, I'd like some diagnostic software to actually prove that to me before I pull the machine apart, because replacing a disk in an iMac is a gigantic pain in the ass (not to mention the 3 days of downtime waiting for the restore to happen.)
I think that's pretty much what my previous MBP was doing in the couple of weeks before it started spontaneously rebooting, and that was a logic board problem.
I waved a rubber chicken at PRAM several times.
Two suggestions:
1) Apple Diagnostics
2) Try booting from an external disk. That'll be slow too but not as hypnowheely if your internal disk is dying.
Also, before you open it up, try reinstalling the OS.
I didn't know about Apple Diagnostics. But it says "no issues".
It may be worth a detour through Disk Utility to check for a slowly dying storage medium:
https://support.apple.com/kb/PH25413?locale=en_US&viewlocale=en_US
My old 2011-ish iMac had exactly the same symptoms, I also suspected the hard drive but could never find anything specifically wrong with it. I even tried a complete OS reinstall. I fixed it in the end by replacing the HD with an SSD. You also need a thermal sensor as the original HD included it, and the iMac fans run full speed without one. Search for “OWC iMac thermal sensor”.
It still annoys me that I couldn’t diagnose the exact problem but the SSD fixed it.
Same thing for me with an '07 iMac. It just started getting slower and slower, no matter what I did. Like, 20-30 seconds to simply move from one tab to another in Safari (yikes!). All diagnostic things I tried showed that everything was fine. The thing that made me think it was a drive issue was when I hooked an external drive with nothing but the OS on it via Firewire, and suddenly everything was pretty much back to normal. I replaced the HDD with a SSD, and now it's well again.
I suspect MacOS load average is only ever CPU related, being derived from a BSD kernel. Linux is the odd one due to someone deciding to include disk IO and waiting on kernel locks into the load average calculation back in the day:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
A test would be to mount some NFS volume, take the volume away at the server end, then watch the load as you try to access it. It spikes on linux boxes, probably stays the same on MacOS.
Down in the linux ghetto, I'd be reaching for dmesg to look for some hardware bus timing out and being generally useless, but that doesn't help you :-(
I thought the loadavg change came in for compatibility with one of the Unixes at the time (Solaris? SunOS? Ultrix?). I recall participating in (or at least reading closely) a thread discussing the pros and cons of counting IO-wait with CPU-run at the time. I also think I recall seeing a Solaris box in the 1990's with a loadavg "mountain" on xload because Solaris had no writeback caching by default (or at all?), and IO saturation from our application was piling ever-increasing loadavg load on the system.
There are definitely new things in Linux that should be excluded from loadavg. Tasks in FROZEN state are marked uninterruptible because they can't run, not even to handle signals, but they are counted in the load average. So you freeze 100 processes in a cgroup because you want to pause a build, and now your completely-idle machine has a loadavg of 100.
My 2012 iMac has been doing that more and more, and for the longest time it was kernel_task growing into an ungodly behemoth managing.. I dunno. Sync services? It seems better now, but I'm pretty sure it's a race between the storage and the display for what will force me into a new one with a probably worse CPU.
Harumph. Stupid state of CPUs.
If you have a spinning hard disk, I'm inclined to think it's due to fragmentation - some parts of your user profile being spread far and wide across the disk so when you ask it to load Safari it has to go and fetch 20 bits of data from 20 different places on the disk just to be able to populate your bookmarks, for instance.
Try doing a complete backup of your user profile, then blow it away and copy it back (which will put everything on the disk in order instead of scattered), see if the problem is improved.
An SSD/HDD combo might help a great deal, either a fusion drive or a setup like this person's: https://mattgemmell.com/using-os-x-with-an-ssd-plus-hdd-setup/
DTrace can answer all these questions. You might need to disable SIP.
This is happening to me right now on my 2012 iMac. It has behaved poorly with the last few releases of OSX with occasional beach-balling for a few seconds. I presumed none of the devs have spinning disks, and they just keep writing in more disk I/O.
But it got really bad last weekend (beach balling for minutes at a time), and I suspected the disk is dying.
The scant evidence I saw was in Console.app - the disk error messages don't seem to make it to syslog anymore. There was a disk IO error reading a sqlite database for photos.app. I have no idea if this is a one time occurrence, but smartmon says that it has 24 "current_pending_sector", which I think is bad? (This SMART stuff is new to me.) smartmon still claims that the health of the disk is fine, even while reporting failing a "quick" scan with a read error.
I ended up breaking down and buying a MBP (even though there is no escape). If I ever find some spare time and motivation, I'll tear apart the iMac and replace the HD. Maybe give it to the kid or something.
Happened on one of my household's Macs (2012 Mac Mini) after the High Sierra update. I turned Time Machine off, and the problem went away, so I assumed Apple borked the USB disk driver even more than it already was borked.
For diagnostics you want something that will look at the SMART (sic) health of the drive. I use Disk Drill whose free edition will let you view SMART.
Sorry I have no fix.
Disk Utility has a field for SMART and I've never in my life seen it say something other than OK.
I went to a talk a while ago (years) by someone who should have known about disks and they said that, based on presumably large samples, SMART was useless: in almost all cases it said everything was fine until the disk actually failed. Perhaps this is no longer true, but I suspect it is.
As a general rule, SMART "OK" means anything from "OK" to "oh God oh God we're all going to die."
The individual attributes are the interesting bits:
On an SSD, you'll want to look at "% lifetime used". Every SSD has one of these, and each one is stored as a vendor-specific attribute. If Disk Drill understands your SSD, then it includes "wear level" under the menu bar icon. As flash cells wear out it gets harder to erase them and write speed degrades.
On a HDD, you'll want to look at the number of remapped sectors. When the HDD firmware has trouble reading a sector, it will spend a lot of time messing with amplifiers and re-trying. If it spends too much time doing that, it gets mad and moves the data permanently into a spare sector. I'd say if you have more than a dozen of these, you should worry.
Looks like my internal drive is a hybrid, 3TB physical, 120GB SSD. "smartctl -a" doesn't appear to have any real complaints, but Reallocated_Sector_Ct on the physical disk is 208. So that's bad? Maybe? And Wear_Leveling_Count on the SSD has the completely non-opaque value of 4741777524599. There are a zillion attributes here and the documentation is.... exactly what I expected it to be.
I think your HDD is dying. At work we make RAID systems and we would absolutely tell a customer to replace it.
Would like to hear others chime in before you spend money though.
It might be causing the hypnowheel, but I doubt it.
If you do have to replace anything, replace the SSD+HDD with a single SSD if at all possible.
Your symptoms are similar to what I used to experience on my 2010 Macbook Pro when it had a dying hard drive. Although it wasn't as bad as you've described, there were certain times where it would hang for no reason, then resume. It also got ungodly slow, and it got worse and worse over time. The stats you've quoted for reallocated sectors also doesn't sound good.
When I looked at copying the drive over, I found certain areas were completely unreadable. I didn't lose anything important, and I had backups, but that was the smoking gun for it having been a hard drive issue.
I know its a pain to change out the hard drive on the iMac, but it might be time to get a largish SSD and go that way. The SSD made my Macbook Pro much faster and is a worthy upgrade even if your system wasn't having issues.
I realize I am repeating what has already been said, but hopefully my contribution amplifies the signal. I'm fairly sure this is your issue, and hope you are able to get it fixed soon.
Good luck.
It's already a hybrid SSD. I don't think 3.5" 3TB full-SSDs are even a thing that exists.
They're starting to. I've seen a couple in the 3-4TB range from Samsung & Intel but they're pretty pricey - well over $1k.
They have 2.5" drives of various sizes and adapters here. I installed one in my machine a couople of years ago, and quite like the result.
None are listed at 3TB, but, depending on the model of iMac you have, I believe you could install two of them to get the storage size you desire.
https://eshop.macsales.com/shop/ssd/owc/imac
And even if I can fit a pair of 2TB drives in there, it would cost me $1,400.
At that price, I think spinning platters sound pretty good.
You could get a 500GB or 1TB sata ssd, which are cheap, and put all your slow IO (videos/music/etc) on a thunderbolt or firewire drive taped to the back of your imac.
500GB sata ssds are about $150. Pay someone to install it for you.
Actually, it looks like a normal ssd over thunderbolt 2 is faster than an hdd on sata (I could be wrong), so if you have thunderbolt 2 or greater, you don't even need to disassemble or install anything. Just get a 500GB sata ssd and 6 TB hdd and an enclosure that fits both, and then dive into `ln -ls` madness, and you are done.
208 reallocated sectors definitely means your disk is in trouble. Current pending sector count (sectors that the disk found to be bad, or had trouble reading) is also an important value.
Disk Utility won't indicate any problem with a disk's SMART status until a threshold (set by the disk vendor) has been exceeded. Most failing drives are in serious trouble long before the threshold value is finally exceeded.
This++. If the drive reports reallocated sectors, or pending (unreadable) sectors, the drive is starting to have physical problems. It's time to replace it when these values are nonzero. It's not a guarantee that it's about to drop dead suddenly, but that is a definite possibility at any time now.
The other poster is right about the "OK" SMART overall status value being useless. Only the detailed fields contain any useful info; reallocated and pending are the ones that are the best indicators of drive health.
C.
I'm sure this is a question with no good answer, but would it not be sensible, then, for that "Ok" in Disk Utility to change to "Time to buy a new disk" when that number was nonzero?
For that matter, doesn't it seem likely that every single person who has ever downloaded and installed "smartctl" just wants to know "is my disk fucked?" (dot com)
There is no correlation between the numbers and disk failure. People have been looking for one, sometimes in studies of tens of thousands of drives at a time, but so far nobody's found evidence that a correlation exists.
I have disks that have been operating for 6 years with anywhere from 8 to 250 reallocated sectors. I also have disks with 0 reallocated sectors, but they just stopped spinning up one day. I even have disks that old, with no service issues I can detect, but reporting SMART "not-OK" status.
Basically the way you detect that a disk is getting slow because its head positioning servos are jittery is by measuring the raw disk access latencies and going "yep, that's too slow to use, throw it away." SMART doesn't detect or report that (a gap in the SMART schema IMHO), so you have to do it yourself.
I think of the number of reallocated sectors like the PSA reading for prostate cancer. It's not the value so much as how fast that value is increasing.
Looking at the PSA value is about as useful as the SMART status.
Every male dies with a little bit of prostate cancer, but no one dies because of the prostate cancer. Similarly, most drives plug along just fine until after you throw them away.
(While we shouldn't go too far with this analogy, a little finger in the ass is welcome once in a while.)
Yes, it would be perfectly sensible for Disk Utility to say "Buy a new disk" instead of "OK", but you're thinking like an engineer. To understand why SMART is useless, you need to think like a marketer.
From the first days of SMART's introduction, technical folks have wondered why it's so useless at predicting drive failure. The processor(s) in modern hard drives - and by "modern", I mean everything from the post-ST-506, IDE-and-later era - collect enough information from embedded sensors of various types that the drive should be able to predict all of the common hardware and media failures which can doom your data. This is things like spindle bearings, motor and voice coil windings, head actuator bearings, etc.
So why don't they? The marketing guys made the engineers turn off prediction of failures, because it made the products look less reliable. If you don't predict "this drive is going cactus in a few months, because I'm getting a lot more flying height errors and servo track alignment errors than I should", then the drive looks very reliable. Right up until it actually dies and throws your data on the floor.
Can you prove it? No. But this was discussed at length, with at least one guy claiming to be a primary source for it:
https://groups.google.com/forum/#!topic/alt.folklore.computers/o_6UXZAoIl8%5B676-700%5D
Yes, people want SMART to tell them about impending failures. But the marketers didn't like it.
So a clean SMART bill of health means nothing; your drive could still be on the edge of throwing all your data on the floor. But certain of the reported values, when they start climbing, are indicative of developing problems. Reallocated sectors and pending/unreadable sectors are two of these.
Why should you believe me? Well, maybe you shouldn't. But I administer a significant number of machines, with a lot of spinning disks, and have done so for a lot of years. I've replaced a lot of disks.
C.
Sure, that all makes sense, but I was talking about the GUI of Disk Utility -- and from the POV of Apple, who are interested in A) having tech support spend less time talking to customers and diagnosing fiddly nonsense and B) move product... "buy a new disk" seems the right thing to suggest.
Likewise, I assume the authors of smartctl are... shall we say, not unduly influenced by the drive manufacturers' marketing departments.
If Disk Utility had said to me "your disk will soon no longer be a disk" I'd have just spent the money and avoided a lot of frustration. And a blog post.
True, but the SMART metadata is just a bag of key/value pairs. There's no standard schema for the key names, and the values can be in any units the manufacturer wants. The only good thing is that the drive also reports a threshold for each value; if the value is beyond the threshold then the drive is failing (or old, or whatever). Of course, the reported thresholds might be meaningless...
I think the issue here is that the "Drive health: OK" is actually the value being reported by the SMART self-test.
You're right that the smartmontools devs - and presumably Apple devs - should look at the detailed values being returned and override that bullshit "All is well" summary. I don't know why Apple's guys don't. Every sysadmin I know ignores the SMART overall health indicator as useless, and relies on the detailed values. The smartmontools/smartd developers at least do send warnings when these values climb, even if the drive still reports everything is okay, so they're sort of doing that already.
C.
Carlos, I think the link for the Frank McCoy message you're referring to is https://groups.google.com/d/msg/alt.folklore.computers/o_6UXZAoIl8/GX-LjjiICucJ -- your link doesn't open a specific message for me.
How do we get the good predictive models back in SMART?
Aren't there CEOs capable of understanding that appearing to be more reliable and losing more data is not as profitable as being more accurate and saving more data?
As SSDs plummet in price and match spinning disk capacity, can we at least get a way to report SSD expected time to failure in background OS monitoring? Surely the OS vendors can't miss the value proposition, or am I missing something?
once your reallocated sectors starts going up, it's time to replace the drive. sector reallocation appears as long IO pauses to the OS, and the hypnowheel is basically just a UI for the mach completion ports that are polling. Not saying this is the true root cause, but it's certainly consistent with the the data.
There are basically two categories of drives: desktop (aka "green", "cheap", the drive that likely ships with your computer, and the drive you're likely to get if you sort by price on Amazon) and NAS (aka "enterprise" or "RAID").
The difference--that you pay anywhere from 25% to 300% more for--is that the NAS drives report IO errors and the desktop drives don't. The SMART data is better too because it's not based on lies.
Desktop drives are designed for users with no competent admin (i.e. no backups and crap software that just keeps processing garbage instead of handling IO errors) and single-disk configurations. They spend up to 120 seconds retrying reads before giving up, and they only give up when they are dead. In most cases just reinstalling software over and over will recover enough lost data to limp along until the next bad sector, so people just reinstall stuff and the drive magically gets faster.
NAS drives are designed for use with competent admin (i.e. backups, RAID redundancy, and OS or application software that knows what IO errors are and what to do when they happen). The drives spend 0.7 seconds retrying read errors, then kick the error up to the OS. The OS then replaces the lost data with a copy from another disk in RAID array, or the admin replaces it from a backup. The admin monitors the recovery events per disk and replaces disks that are being recovered too often. Overwriting bad sectors remaps them to good sectors, so the pattern with these drives is a handful of bad sectors each year that go away as the OS replaces them.
If you have a hybrid drive you could have SSD and HDD failure modes overlapping.
SSDs get slow because they are internally fragmented. Every new write forces the drive to rewrite up to 4x as much existing data in order to fit it into erase blocks. The way to fix this is to enable discard in your OS filesystem, or run a utility that does TRIM in batches. Enabling discard all the time makes many SSDs run slower on average, and some SSDs have devastating firmware bugs which mean discard should not be used at all.
In the very worst cases for SSD drives, run a full-device TRIM or SECURE ERASE to completely reset the FTL block address translation layer. This erases your entire disk, so you'll want to not do that while the disk is your root filesystem, but it will make the disk go fast again when you restore your backups onto it.
HDDs get slower as they age. Cheaper disks get slower faster than NAS/enterprise disks. The head starts to wobble over the track and the only way to get data is to just keep spinning until eventually the track wanders under the read head on its own. Cheap HDDs will wait up to 120 seconds for that to happen, making the drive five orders of magnitude (99.999%) slower, but reporting no errors.
Smartmontools (via ports or brew) might give you more detailed information on that front.
I had similar issues with a 2012 non-Retina MacBook Pro. Mysterious beach-balling that seemed vaguely related to disk IO, minimal CPU usage, no memory pressure. Disk checks all looked fine, RAM checks were fine.
Turned out that the SATA cable was going bad. Replaced that and everything immediately worked fine. This is a very common issues with MacBooks of that era and it's really hard to diagnose.
Typically you'd expect a desktop computer to have a much more robust SATA cable than the whispy, brittle thing in a MacBook, but I'm not sure for an iMac. Might be worth looking into anyway.
Get
sar
back. For some ungodly reason they removed it from Sierra.I should add, it won't tell you what was doing it, but it'll tell you what kinds of activity were going on, which you can cross-reference with other logging.
Or it might just be the way Finder is made.
For my mini, it turned out kernel_task was just appearing to take up CPU because it thought everything was over heating (fan was at max, but not very loud, CPU speed was being artificially slowed). Blowing the dust out made everything back to normal.
What's so infuriating about this is that there are 30 procs in some kind of kernel wait state and Activity Monitor is unable to monitor that activity in any meaningful way.
YOU HAD ONE JOB
Not being a Mac user myself, I'm unsure this stuff is still applicable, but the idea of investigating with dtrace seems worth a try:
http://dtrace.org/blogs/brendan/2011/10/10/top-10-dtrace-scripts-for-mac-os-x/
It's just time to replace this OSX shit with GNU/Linux where diagnostic tools are working fine
Uh huh. So does closing the lid on your laptop work yet? Asking for a friend.
Okay I'll bite: it is supposed to work nowadays.
"it is supposed to work nowadays." -- The FOSS Credo
Suspending works. Resuming is still a crap shoot, though.
That said, modern laptops run for like 15 hours if you just send SIGSTOP to a handful of processes (e.g. chromium) when the lid is closed and send SIGCONT again when it's open. More than enough time to transit between power sources. I haven't suspended my laptop or turned it off since...August? July? Definitely not in 2018.
I wanna play: Everyone knows emacs is shit bloatware. I only use vi and I have never had this problem.
Ye olde dmesg(8)?
(i just wanted to say "olde" - `log show` and `log stream` are nicer interfaces to all of macos's logs, including what the kernel emits. Console.app is just the GUI version of log(1))
A story that might not help you directly, but might spark some debate that does, I hope. I use the Intel 'rapid storage' thing (on a windows machine). The similarity to your case I think is in the fact that this uses a small(ish) ssd as a cache for the hd. over the last month, I started to experience more frequent an longer system freezes. I had this before, and have done again what I have done then: remove the cache ssd, then just add it back again. This fixed the issue immediately and completely.
so, for your case, is there any chance the hybrid setup also does some form of caching, and that this 'cache on an ssd' thing somehow has the same issue (not sure what, cache fragmentation, or internal stuff in the ssd?) and that there is a way to tell the hybrid system to drop or re-init its cache?
Everyone seems to be putting their money on hard drive. I'm placing my bet on battery or temp sensor issues--just like the slow iphone fuss a few months ago.
So now I've got other bad craziness going on.
I have a new external drive, and I've installed the OS on it and copied all of my files back (then re-installed the OS over that with Recovery, because for some reason it woudln't boot without that, but whatever.)
I am booted off the new external drive, and the old internal drive is unmounted.
Every now and then, the machine just freezes solid for 10 to 20 seconds, then comes back.
Several times now, it has frozen; and then over the course of a few seconds, windows would start to spontaneously close; and then it would be stuck so solid I had to power cycle it. Once I got a kernel panic.
Possibly related, but who can tell: I have a second external drive that I use for Time Machine, and the internal drive is in the process of making its initial backup to it. Maybe the freeze doesn't happen, or happens less frequently, if backupd is not running.
What kind of component failure could cause this? I assume the hardware diagnostic would have detected bad RAM.
I'd really rather not have to spend $5k on a whole new computer to fix this, especially when that new computer isn't going to be significantly faster.
Oh, and you know what's fucking awesome?? If it's the very first Time Machine backup, and the machine reboots before it is 100% done, the next time it starts over from scratch. Since this backup is going to take three days, and the machine can't seem to stay booted that long, it will never be able to complete.
It's fucking rsync! How do you screw up this badly??
Oh, but apparently I've got multiple multi-TB temporary directories in the "inProgress" directory now. Nice.
Dan Benjamin had a similar issue this week, and discovered that Crashplan was burning a lot of cycles without it showing up in Activity Monitor: https://twitter.com/danbenjamin/status/973969181145288705 Are you using it, by chance?
Nope.
I got the Time Machine backup to finally work by adding nearly everything to the "don't back up" list, which let the initial backup be tiny and succeed. Apparently the very first backup is non-incremental and restarts from scratch, but subsequent backups will continue incrementally after a crash. Sigh.
If you really want some hardcore diagnostics, there's DTrace. Learning how to make it do useful stuff is completely another matter though. Here's the best writeup that I've seen on this topic http://dtrace.org/blogs/brendan/2011/10/10/top-10-dtrace-scripts-for-mac-os-x/
Some other links here: https://awesome-dtrace.com
Nice links!