wherein Spotlight is found to be useless

On OSX, the "xscreensaver-getimage-file" Perl script is really slow. I'm trying to fix that.

This is the script that picks a random image file for the screen savers to load. It descends a directory, gathers up all the JPEG, PNG, and GIF files, and picks one at random.

On my Linux machine, running it on a directory containing 47,000 files in 750 subdirectories takes 20 seconds the first time, and 5 seconds on subsequent runs. On my Mac, it takes 49 seconds the first time, and 43 seconds the second time. As both machines have 7200RPM SATA drives, this leads me to believe that HFS+ sucks compared to Ext3fs, but that insight isn't particularly helpful.

So I thought, maybe I can speed this up by using Spotlight instead of iterating the directories myself. <LJ-CUT text="So I tried some Spotlight crap."> So after some googling, I tried:

    mdfind -onlyin /Users/jwz/Pictures
          "kMDItemContentTypeTree == 'public.image'"

Oops, sorry, turns out "ContentType" is useless: it's only set on certain files, like ones created with Photoshop; I've got a ton of .jpg files that don't have that attribute (though Finder knows they're images.) Nice. So instead you have to do:

    mdfind -onlyin /Users/jwz/Pictures
          "kMDItemContentTypeTree == 'public.image' ||
          kMDItemFSName == '*.jpg' ||
          kMDItemFSName == '*.jpeg' ||
          kMDItemFSName == '*.pjpeg' ||
          kMDItemFSName == '*.pjpg' ||
          kMDItemFSName == '*.png' ||
          kMDItemFSName == '*.gif' ||
          kMDItemFSName == '*.tif' ||
          kMDItemFSName == '*.tiff' ||
          kMDItemFSName == '*.xbm' ||
          kMDItemFSName == '*.xpm'"


      So, it turns out that when "mdls" on a .jpg file does not list kMDItemContentTypeTree, what that means is "your Spotlight index is incomplete." Somehow, Spotlight failed to index a whole bunch of my files. So last night I nuked and re-created the index ("mdutil -E /") and now I can match files with either kMDItemDisplayName or kMDItemDisplayName. (kMDItemDisplayName is faster than the "FSName" parameters, which apparently stat() the file every time). But it's still incredibly slow. Way slower than traversing the disk directly.
And that takes... wait for it... wait for it... six minutes the first time. and seven minutes the second time.

I guess I could cache the results somewhere, and only re-list the directory once a day or something, but that's pretty lame.

Any other suggestions?

Update: I managed to speed it up a lot by reducing the number of stats (by assuming that certain file extensions are the gospel truth). It seems like using Spotlight for this is just a bad idea, which is too bad.

Tags: , , , ,

47 Responses:

  1. alecm says:

    Other than "configure a UFS partition somewhere", no, alas I don't. I've run MH / EXMH as a mail system for 19 years (eek) and that means I currently have 115,000 mail messages stored as separate files in 75 folders/directories.

    I used to run this on MacOS and there is/was nothing I could do to get HFS+ to deal with it efficiently, and it was worse when I had FileVault switched on.

    You might kludge around it by creating a UFS volume containing copies of all your pictures, but that's not realy sane.

    Your best bet really is some form of filename caching, perhaps with a quick cache-freshness check against mtime of respective directories in ~/Pictures

    • alecm says:

      I don't have Tiger around to check that DiskUtility still permits you to create UFS volumes in .dmg files; but I can attest that when I moved my MH ~/Mail to a UFS partition on my iBook, the system suddenly sprang back to life.

  2. jarkman says:

    Do you actually need the whole list, or just one random pick ?

    Perhaps you could random-walk the directory tree to pick the file. It would screw up the relative probabilities of images at different depths, but it would run quick.

    • jwz says:

      Screwing up the relative probabilities would be Bad.

      • solarbird says:

        Still, <lj user="jarkman"> has half an idea. You know the total number of files in a tree more easily than you know their individual metadatas, and that should be a lot faster to determine. Then pick a well-distributed random number and walk the tree in order to that particular number of file. If you end up with something that's not a picture, you can either hop over an additional (smaller) random number with rather minimal effort, or start the whole procedure over with a new random number. Unless you have a lot of not-images in your Pictures directory, you shouldn't have to restart often. Regardless, do this intelligently with a lot of directory counting more than directory name or metadata reading and it should be a lot faster, yes? I mean, even if it were no faster at all for some reason, it should cut the average selection time (roughly) in half without affecting the randomness of the distribution.

        Obviously this would not be effective in a file structure containing many fewer pictures than other file types, but that isn't the case here so I would think you would be safe from too much of that sort of Monte Carlo badness.

        • jwz says:

          You can't know the total number of files in a tree without knowing whether each of them is a directory or a file, and you can't know that without stat()ing each of them, in which case you already have all the (FS) metadata.

          • solarbird says:

            Okay, so you count files and directories, and if the random number you picked happened to plonk you on a directory, treat it as a non-graphic-file and pick a new random number. It's the same principle.

            The whole point, really, is to avoid doing all the stat()s. Of course, given my results with simple ls -R, it may be that you might also just consider dropping the wildcard/extended info filtering in the OS and doing it yourself instead. Maybe that's the bottleneck.

            • jwz says:

              You can't know the total number of files -- the modulus of the random number -- until you've already statted everything.

          • bodyfour says:

            > and you can't know that without stat()ing each of them

            OS X (or HFS+) doesn't fill in de->d_type?

            • jwz says:

              I only just learned about getdirentries(), below -- but Perl doesn't seem to expose it.

              • bodyfour says:

                Sorry, missed the fact that we're talking about perl here. Maybe it's time to write it in C instead?

                d_type is part of the normal "struct dirent" that you get back from readdir(). The things you have to remember about it though:

                • I think most modern UNIX's have it, but if you're trying to be 100% portable you shouldn't assume it's there (protect via "#ifdef DT_UNKNOWN" or the like)
                • Not all filesystems support it. Others will return it only if it can be determined without an additional seek. So you need to always be prepared to do the stat() yourself if you get DT_UNKNOWN.

                It's a nice optimization though since you can skip the stat() a lot of the time. Haven't tested on HFS+ though so I can't say for sure if it's supported there.

                See /usr/include/sys/dirent.h

      • solarbird says:

        It looks like just losing the wildcard and metadata searching makes it much faster. Even ls -R | wc on my user directory (>37000 files) on my 837Mhz G4 powerbook returns in 9 seconds (multiplies out to 11.4 seconds for 47000 files) and that's doing a lot more than you'd need to in the counting scenario I outline above. Even if it weren't, that would still be a 4.5 second (5.7 second) average search time, so I think this method could be quite fast. The only thing that I see offhand that would bog it down badly is if people put a lot of not-pictures in their pictures directory tree forcing many new random file number reselections, and even that could be optimised quite a bit.

  3. mkj says:

    Can perl use fts(3) ? I suspect it might be more HFS-optimised than opendir()/readdir(). (Not that useful, I know)

    • furia_krucha says:

      fts(3) is a user-level function, so it uses the same underlying kernel api as readdir().

      • mkj says:

        Ah, I was thinking it might be using something smarter like searchfs(). But searchfs() itself seems to have the same problem as mdfind etc. bleh.

  4. evan says:

    Agree with the first comment. Used to use mbox, couldn't use it on OS X.

    But, here's a hack: I wonder if, rather than indexing the directory yourself, you could use the locate cache. Just parse the output of "locate /path/to/dir".

    • edouardp says:

      Well I was going to offer my own brilliant suggestion, but apparently "what he said above" seems to cover it. I use locate on the command line quite frequently to find stuff.

      locate /Users/jwz | egrep (jpg$|tif$|png$)

      Unfortunately it's not part of the spotlight database (it's the same program as the Linux & BSD one), and so the results are based on a nightly update, and not a live version of the filesystem.

      • taffer says:

        The nightly doesn't seem to go off if you've got the machine slept or whatever; I tend to run updatedb by hand when I think of it because the database is always getting stale on my iBook because it's almost never turned on in the middle of the night.

  5. edouardp says:

    Oh goody - I get to post a "works for me" on Jamie's blog!

    The command "mdfind -onlyin /Users/edouard image" seems to find every image file on my system, even ones I've copied across from my linux box with scp a few seconds earlier. Perhaps your spotlight database is borked? You could try a "sudo mdutil -E /" before you go to bed, and see if the metadata works properly in the morning?

    It's not what you would call instant on my ancient 867MHz Powerbook though - it takes approx 1 min 20 secs to find 22790 images in my home dir. Your result of 7 minutes on a much faster machine (in both CPU and disk speed) seems pretty bad - I guess Spotlight doesn't have a decent query optimiser...

    Have you considered having a faceless background helper app that runs the full search (file based or spotlight based) once on startup, then uses the spotlight file modification APIs to maintain a live list of the contents of the pictures dir (by tracking file creation and deletions). Then the screensaver, when it kicks off, could ask the helper app for a random picture, and have an instant reply. A friend of mine who has written a slideshow application does this to have an accurate list of files in a directory tree to show slides from, and it works well.

    • alecm says:

      I'm torn by your posting, because I really wanted to like Spotlight, but after playing with it for a couple of weeks, I decided that I don't.

      In the end I switched it off because I have multiple 250Gb drives attached to my G5 iMac and really, really didn't need to use it to find stuff, not to mention having it spinning disks and burning CPU to index things that I'd never have to retreive.

      I know this is just one opinion, but I would like not to have to re-enable Spotlight just to run a screensaver.

      • edouardp says:

        You're right - since users can turn off Spotlight, xscreensaver shouldn't rely on it being a valid source of data for image files to use.

        I think the correct approach therefore is mostly what I said above then - use find (or a hand-rolled equivilent) to make an initial list of files from /Users/foo/Pictures (or another user-supplied dir) and then use kqueue file modification code (e.g. http://developer.apple.com/samplecode/FileNotification/listing1.html) to maintain an up-to-date live filelist within that dir. The screensavers can then simply ask the background app maintaining that list for a random file. (I called that monitoring "spotlight" before, but it's a kernel feature that spotlight itself uses, so is not reliant on spotlight running to work.)

        Just a data point, but Spotlight on my Quad G5 at work takes just over a second to return the results of "mdfind -onlyin /Users/edouard image" (on the second run). Spotlight runs well on top-end machines, but is unnacceptably slow on low-end machines (like my old powerbook at home).

    • jgreely says:

      This is more precise: "mdfind 'kMDItemContentTypeTree = public.image'". If you just use "image", you also get a bunch of text files containing the word "image", as well as every disk image and ISO file on your system.

      On my 1.25GHz G4 PowerBook, running this command on a mostly-full 80GB drive returns 16,762 image files in 2.4 seconds. It does need some additional filtering, because public.image includes Illustrator AI, Photoshop PSD, and various camera RAW formats you might not be interested in, as well as a few oddballs like "README.sgi" in the mkisofs distribution. :-)


    • taffer says:

      I just tried mdutil -onlyin ~/Pictures and it worked pretty well... 44 seconds for the first run, writing to iTerm (current CVS version). I ran it again and redirected to /dev/null and got 29 seconds. That's only a bit over 3800 images though. When I ran it the third time to pipe through wc -l, it took 1m18s. WTF, shouldn't that all be cached or something after two runs?

      I noticed that it returned a couple of MS Weird documents though, and all the thumbnails from iPhoto.

      For comparison, find ~/Pictures -type f takes under a second after it's already run once (sorry, blew the first run, but it was way less than 30 seconds), and adding grep to cut out the non-images isn't going to add an order of magnitude to the run time.

      So, I have no idea what mdfind is doing, or why it's bad at it.

  6. spike says:

    Disk Inventory X is (1) useful in and of itself, (2) open-source, and (3) good at reading buttloads of directory information lickety-split. On my 1.25GHz PowerBook G4 with its 4200 RPM IDE drive, DIX can "inventory" the sizes and types of every file in my /Developer/ tree (about 87,000 files) in about 20 seconds. According to the release notes, the directory scanning code was tuned with help from "Dave Payne from Apple" in mid-2004. So that's one possibility. It's written in ObjC. :/

    Alternatively, you could use find ... with a regexp matching all the file types.

    Alternatively alternatively, you could use locate | egrep ... which seems to be very very fast, but is only as up-to-date as the last update_db, which you could force, or just wait for...

    And now the three-year-old princess has awakened, and I am summoned to her bedroom chambers...


  7. edouardp says:

    (This post doesn't really help you, for which I apologise in advance.)

    I remember, back in the olden days days of Mac OS 7 or 8, you could search for a filename on your Mac and the Finder would present you with the results in just a few short seconds. It could do this because every file's meta data (well, filename at least) was stored in the HFS catalog B-tree file at the start of the volume. Hence a search for a file with a particular name (or sub-string) involved a single pass through the catalog b-tree, and this was really fast.

    Somewhere along the way this ability seems to have been lost. The b-tree is still there, it's just no-one lets you search it any more.

    (As an aside, I never understood why, on Windows machines with NTFS drives, you also didn't have near instant searches for filenames. Esp. since on NTFS the equivilent of the catalog b-tree, the MFT, is actually a normal file itself in the file system, and you could simply open it and search the contents with a single pass. Instead a search takes 5 minutes and uses the Win32 API to open each directory's contents recursively, with the Win32 APIs in turn eventually looking up the files in the MFT after pasing through the NTFS subsystem. It always seemed like a lot of unessessary work to me...)

  8. rubeon says:

    using find seems to work pretty quickly, at least with the 10,000 or so images I have in my Pictures directory on my Powerbook. You could pipe that into your program.

    rubook:~ rube$ time find Pictures/ -iname *.jpg -or -iname *.png > peg.txt

    real 0m1.174s
    user 0m0.272s
    sys 0m0.317s
    rubook:~ rube$ time find Pictures/ -iname *.jpg -or -iname *.png > peg.txt

    real 0m0.665s
    user 0m0.271s
    sys 0m0.315s

    rubook:~ rube$ wc peg.txt
    9470 33670 639200 peg.txt

    • jwz says:

      Find does not do some secret magic that I'm not capable of (and already) doing manually from Perl. There's no backdoor to the file system that find() uses, it has to walk the directory tree and stat() just like everyone else.

      • strspn says:

        stat() shouldn't be necessary here, should it? Can't you trust m/\.(jpg|gif|jpeg|png)$/i and just read the directories?

      • gnu find will look at a directory's nlink count and try to avoid calling stat() for files in leaf directories, which is a huge speedup if you're only looking at filename. some filesystems also store filetype in the directory, which gets exposed by a syscall like getdirentries(), which means find can avoid stat() in non-leaf directories too.

        perl's File::Find module will do the nlink optimization, but not the filetype optimization. I don't have os/x handy, but when I looked at this on a linux system a year ago, using File::Find was still noticeably slower than opening a pipe to find.

        • jwz says:

          I didn't know the nlink trick, that's nice. I can probably use that in the perl script.

          One other trick xscreensaver-getimage-name does is that it avoids getting stuck in symlink loops (by keeping a table of directory inode numbers it has already seen) but I guess I could just lose that and say "if you have looping symlinks, you're hosed."

          Doesn't look like perl exposes getdirentries().

          • using the nlink trick implies not following symlinks at all (symlinks don't affect the nlink count, so you won't notice them in leaf dirs without stat of every file).

            if you use the general syscall interface to do getdirentries(), you can follow symlinks and still detect loops cheaply, since you only need to stat the dirs and symlinks in the treewalk.

          • taffer says:

            Doesn't a symlink loop generally mean you're a bozo, or whoever maintains that part of the machine is a bozo? And it'll cause ELOOP eventually...

  9. cavorite says:

    Spotlight is so bad I have just disabled it completely for now. It was slow and pathetic as a search tool, and then when it started randomly eating up %85 of my CPU when I was not even using it, I had enough.

  10. kalephunk says:

    Just a thought -- I did a random sampling of images on my system and found that images that I'd *want* to find in a search for images all contained "public.image" as an attribute of kMDItemContentTypeTree, whereas "miscellaneous imagery" that I didn't care about (think: 32x32px program icons, buttons, arrows, etc) did not have that tag. It seems, from the files that have that tag versus those that don't, the tag is added when there's user interaction with the image. Files I've never looked at, not even in a finder window, don't have that tag, whereas files in my ~/Picturers folder, even those I've never opened and grabbed with a wget or scp, have that tag.

    Of course, this sounds like Spotlight trying to be smarter than the user, but for your purposes, I'd imagine it'd work, no? For what it's worth, an mdfind "kMDItemContentTypeTree == 'public.image'" takes about a half second on my machine, finding 2800 images on my 30gb drive.

    Hope these observations help somehow.

  11. erorus says:

    I'd probably just run the perl script to get the candidate image filenames and cache the list to a tempfile on the first call.. then load from cache on every subsequent call while xscreensaver is running. I doubt there are many cases when the image folders are getting new files while the user isn't interacting with the desktop (although there could always be background imagegrabbing processes, I guess).

    Perhaps make it an option either to pull from cache on 2nd-nth call, or to pull from the directory tree every time?

  12. mark242 says:

    That timing seems incredibly broken, as running a query of *.mp3 against my music library takes a few milliseconds to complete on a slow disk against a huge iTunes directory.

    Maybe the indexing on your drive got hosed in some way. Try "mdimport -f /Users/jwz/Pictures" and see if the times go up.

    • jwz says:

      Apparently it did (updated above), but even after re-indexing it's still incredibly slow. Today's lesson: Spotlight is highly unreliable in several different ways.

  13. ichae says:

    I'm trying to go through my image collection and make sure all the meta-data is right. I thought I would start with something easy by starting with the orientation of the image - so I could, for example, search for only portrait oriented images.

    The problem is, half of the images in portrait format were at some point rotated to the correct orientation (losslessly, I *think*), but don't have the EXIF tag set. Spotlight refuses to set kMDItemOrientation to 1 on these even if I set the EXIF flag. On the other hand, if I run the file through jpegtran to rotate them back, Spotlight doesn't even give them a kMDItemOrientation tag *at all*!

    And yet, Spotlight recognizes *some* of the already rotated images as being portrait. I know they don't have the EXIF tag set because Finder shows EXIF-tag only rotate files *not* rotated. I really wish there was an easy way to force this specific piece of meta-data, but I guess I'll just have to putz around some more trying to figure out what Spotlight likes...

    • babysimon says:

      The problem is, half of the images in portrait format were at some point rotated to the correct orientation (losslessly, I *think*), but don't have the EXIF tag set. Spotlight refuses to set kMDItemOrientation to 1 on these even if I set the EXIF flag. On the other hand, if I run the file through jpegtran to rotate them back, Spotlight doesn't even give them a kMDItemOrientation tag *at all*!

      Don't use jpegtran if you care about keeping the EXIF data; it will silently remove it. Use exiftran instead (with the added bonus it can rotate the data based on the EXIF orientation).

      I'm just bitter because I lost three years worth of EXIF data on all my portrait images...

      • ichae says:

        hmm odd... maybe they have fixed that deficiency? I did a test run first on a copy of an image, and then manually compared the EXIF headers. They looked identical...