dear XCopyArea, please stop exploring your frame buffer, it freaks me out.

Dear Lazyweb,

Here's today's unanswerable Mac programming question:

When I'm copying bits from an image to a window, Shark seems to show that I'm spending more than half of my time doing colorspace conversions:

    29.9%   CGContextDrawImage
    29.9%     CGContextDelegateDrawImage
    29.7%     ripc_DrawImage
    20.6%       ripc_AcquireImage
    20.0%         CGSImageDataLockWithReference
    19.8%           img_data_lock
    16.5%             img_colormatch_read
    14.6%               CGColorTransformConvertData
    14.4%                 CWMatchBitmap

My understanding is that this should only happen if my image and context did not have the same colorspace; if the colorspaces were already the same, then this should turn into basically a memmove(), which is what I want. (And what appears to accomplish somehow.)

In the case I'm looking at, I'm trying to copy a rectangle from a window, back onto itself, without scaling. Say, 50x50@10,10 to 50x50@200,200. I'm getting the bits off of the window with:

    NSBitmapImageRep *bm = [NSBitmapImageRep alloc];
    [bm initWithFocusedViewRect:rect];
    CGDataProviderRef prov = CGDataProviderCreateWithData (...);
    CGImageRef cgi = CGImageCreate (...);

The colorspace I'm using when creating that image is the default one of this window's CGDirectDisplayID, so it should match:

    CMGetProfileByAVID ((CMDisplayIDType) cgdpy, &profile);
    colorspace = CGColorSpaceCreateWithPlatformColorSpace (profile);

Then I draw the bits back onto the window with CGContextDrawImage(). CGImageGetColorSpace() says the image I'm drawing has the color space I expect. So how do I tell why I'm getting a colorspace mismatch?

I also hate that this process involves copying the image data at least twice, but I don't see any way around that. But the main problem here seems to be that not only is it copying it, it's bit-twiddling it too.

Reading the bits off the window is also slow, but according to Shark, it doesn't seem to be calling any obvious "convert" routines; looks like it spends all of its time directly inside of _NSReadImage:

    30.1%   -[NSBitmapImageRep initWithFocusedViewRect:]
    28.5%     _NSReadImage
     3.4%       _NSImageRealloc
     0.4%       CGSLockWindowRectBits
     0.4%       _NSImageMalloc
     0.4%       CGSUnlockWindowBits
     1.3%     -[NSBitmapImageRep initWithBitmapDataPlanes:...]

The thing is, I'm just implementing the X11 routine XCopyArea() here. When running an actual X11 program against Apple's, XCopyArea() is fast, so I know that it's possible to move bits around in the frame buffer fast, in a 2D-graphics context. I just can't see how. I'm guessing that the real X11 server is not going in via NSBitmapImageRep and CGContextDrawImage(), but I have yet to find any lower level API that will give me more direct access to an NSView's backing store.

Tags: , , , ,

15 Responses:

  1. ahruman says:

    My guess is that the real X11 server isn't written in Cocoa. Carbon can give more direct access to the contents of a whole window, although that way of doing things is deprecated. There are probably internal CoreGraphics/WindowServer routines it could be using... hey, isn't it open source? Yes it is.

    It's possible that your first problem could be solved by a more direct way of expressing the colour space identity. Dunno what offhand, though; my programming-type stuff isn't on this computer.

  2. mayoff says:

    Have you considered using glCopyPixels? I know you'll have to do some extra work to set up GL, but at least it should be hardware-accelerated. This page has sample code that might be helpful. It uses glReadPixels to create a screen-grab.

    • jwz says:

      That would just trade one performance disaster for another. Trying to implement Pixmaps and XImages in terms of OpenGL textures would be fucking nasty, and probably involve falling back to software GL rendering all the time anyway. E.g., you want to draw an arc into a Pixmap, then copy that Pixmap to the screen a few times. To do that kind of thing in GL, you have to render to an offscreen viewport (which I'm pretty sure on most systems will use software rendering) then copy the viewport to a texture (sending the whole bitmap down the graphics pipeline again). Finally, you get to use GL to actually put that texture on the screen, but getting there was very expensive.

      • >To do that kind of thing in GL, you have to render to an >offscreen viewport (which I'm pretty sure on most systems >will use software rendering)

        But not on a mac, since that is what they use for quartz.
        Rendering simple shapes to offscreen viewports is apparently hardware optimized- otherwise they couldn't do (for example) the cube-like userswitch.

      • strangehours says:

        Cross platform texture render targets are supported by the framebuffer object extension in OGL 2.0. Before that, pbuffers provide a cross(ish)-platform solution, and OS specific extensions exist on OSX and Windows.

        If you only care about supporting OSX for this, for the X11 wrapper, then it would be very easy to produce a hardware accelerated implementation of Pixmaps (I guess XImages would be harder as they're present in the client address space - though there's another apple specific extension for that too, I think).

        Having said all that, I suspect going down that path would end up gradually sucking you into the OpenGL vortex to the point where you end up redoing most, if not all, of what you've done so far.

        By the way, is there any reason why all the savers seem to be capped at and FPS of 15 on my laptop?

        • jwz says:

          As far as I can tell, [ScreenSaverView setAnimationTimeInterval] is a no-op, so everybody gets whatever frame rate ScreenSaverView has hardcoded into it.

  3. chanson says:

    I don't have an answer to your question, but I do have a strong suggestion about Cocoa programming: Always nest invocations of +alloc and -init methods. The reason is that -init methods can actually return an object other than the one that was allocated.

    The reason for this is that many classes are actually implemented as class clusters, where one class provides a common interface and a number of private subclasses may be used depending on what you're trying to accomplish. (For example, you could get back a different type of NSMutableDictionary from -[NSMutableDictionary initWithCapacity:] depending on the capacity passed.)

    I doubt this will have any affect on your performance issue, but it's important for correctness in Cocoa development.

  4. ajaxxx says:

    Wow, the darwin DDX is nasty.

    So, the rootless code for the nested X server has some acceleration hooks for solid fill, blit (XCopyArea), and alpha blend. The function used for accelerating copies is xp_copy_bytes(), which is not actually defined in the X source but is loaded from a plugin, /usr/lib/libXplugin.dylib. Which, you guessed it, doesn't have source available.

    Running nm over that file gives, among other things:

    U _CGBlt_copyBytes
    U _CGBlt_fillBytes
    U _CGSColorMaskCopyARGB8888
    U _CGSConvertXRGB8888toARGB8888
    u __xp_log
    8a83f03c D _xp_composite_area_threshold
    8a832af4 T _xp_composite_pixels
    8a832978 T _xp_copy_bytes
    8a83f038 D _xp_copy_bytes_threshold
    8a832a50 T _xp_fill_bytes
    8a83f034 D _xp_fill_bytes_threshold
    u dyld_stub_binding_helper

    So clearly CGBlt_copyBytes() is what you want. Too bad CoreGraphics.framework doesn't have a declaration for it in any of the headers.

    If you're willing to accept a dependency on having libXplugin installed (don't remember if it's in the default profile or not) you might be able to use xp_copy_bytes(), which doesn't seem to have any dependency on either X or CG:

    extern void xp_copy_bytes (unsigned int width, unsigned int height,
    const void *src, unsigned int src_rowbytes,
    void *dst, unsigned int dst_rowbytes);

    It's probably a fair bet that CGBlt_copyBytes has a similar function signature.

    • jwz says:

      Hmm. What's going on here then: Xserver/hw/darwin/quartz_1.3/XView.m. At the end of the file, it looks like X windows are rendered by doing exactly what I'm doing: CGImageCreate(), CGContextDrawImage(). Interestingly, they're using CGColorSpaceCreateDeviceRGB() instead of the display's colorspace.

      My guess is that xp_copy_bytes() is an accelerated implementation of memmove(), and not something that actually knows about frame buffers or backing stores.

      • ajaxxx says:

        The Imakefile would seem to indicate that the quartz_1.3 module is old, and that the plain quartz module is the one that gets used nowadays.

        The quartz_1.3 module appears to only use that copyToScreen method in a push sense; it implements CopyArea as just surface prep, call down the GC chain, and mark damage for flushing, and -copyToScreen only looks like it gets called from other dispatch paths besides CopyArea. If it's storing all the X image data in host memory and just shoving it across the bus periodically, that'd still be reasonably performant, but not necessarily hardware accelerated.

        The rootless accel code only shows up in the quartz module's xpr backend, and that's the one using the Xplugin stuff. There appear to be two other backends, cr for plain Cocoa rootless mode (which is nearly identical to the quartz_1.3 code), and fullscreen for, well. It looks like the only way they were able to make it go fast, was to cheat. But hey, xpr's MIT-licensed, go nuts.

        It might well be that using CGColorSpaceCreateDeviceRGB() does the right thing and avoids the conversion you're hitting; I have exactly zero clue there.

        • jwz says:

          Looks like CRStartDrawing() in this code in the "cr" directory writes directly into the window's backing store via GetPortPixMap([nsview qdPort]), which sounds like it might be worth trying... except that's a QuickDraw routine, and isn't QuickDraw going to become unsupported some day soon? (Or is it Carbon? I can't tell QuickDraw and Carbon apart. Anyway, aren't they both marked for death?)

          Which of these N backends is the one that the that I have is actually running?

          • ajaxxx says:

            Good question. I tried to examine the output of defaults to see if there was a pref, but didn't find one, even though -loadDisplayBundle would indicate that it can be set somehow. Neither would gdb let me set a breakpoint on QuartzModeBundleInit; maybe it's only broken on 10.3. Gross hack: attach Sampler to the X server, run x11perf, and grovel through the call frames until distinguishing calls are found. I found a few hits for xp_unlock_window, which is only ever called from xpr.

          • legolas says:

            You seem to be right about quickdraw; this says (under 'QuickDraw Reference'): "Describes the C API for the legacy two-dimensional drawing engine in Mac OS."

          • legolas says:

            I guess you may have seen this already? (especially the bit 'Flushing to the Window Buffer', but if I read it correctly that talks about quickdraw only again?)

          • ahruman says:

            Bit late, but... Carbon is not deprecated. QuickDraw is, but will be around for yonks. QuickDraw has direct access to the window's back buffer (I unfortunately referred to this less specifically as Carbon in my previous post), which is almost certainly implemented access to a rectangular texture in AGP-accessible memory with the Apple-proprietary "client storage" attribute set. This is likely to be the fastest way to draw straight to a window's back buffer... although in full-screen mode using your own back buffer and copying it with OpenGL (using a new texture with client storage or glCopySubPixels) is pretty certain to be faster. It also gives you free scaling.