
The breakthrough is a different type of sensor that captures what are known as light fields -- basically, all the light that is moving in all directions in the view of the camera. That offers several advantages over traditional photography, the most revolutionary of which is that photos no longer need to be focused before they are taken.
Lytro's camera works by positioning an array of tiny lenses between the main lens and the image sensor, with the microlenses measuring both the total amount of light coming in as well as its direction.
There are some neat demo photos on their site, where clicking in the Flash app moves the focus point, which show you the theoretical end result, but I don't understand how this works at all. Their "simple explanation" just says,
"Recording light fields requires an innovative, entirely new kind of sensor called a light field sensor. The light field sensor captures the color, intensity and vector direction of the rays of light. This directional information is completely lost with traditional camera sensors, which simply add up all the light rays and record them as a single amount of light."
which I'm pretty sure is the same as saying:

Their paper is mostly about how you simulate N different cameras once you have the light-field info, but I still don't understand how you de-focus light that has already passed through a lens (and there is a lens in front of this array of tiny lenses) or how having one lens per pixel gets you the direction of the ray. Because throwing away that information is what lenses are for.
It's just a commericalization of this stuff: http://graphics.stanford.edu/papers/lfcamera/
For a detailed explanation, this: http://graphics.stanford.edu/papers/lfcamera/lfcamera-150dpi.pdf is pretty good
Any idea how much the commercial implementations are going to cost?
perhaps by determining the polarization of the incoming light they can determine the distance; somehow I think the "focusing" part will take some time even with an i7. It's like seeing 3D with one eye. The lens is probably just set to capture maximum light source.
Which is why you'd offload most of the heavy-lifting to a GPU.
The first thing to note is that you don't actually have 1 camera, you effectively have many cameras in a single package. Each mini lens directs light towards an exclusive set of pixels on the image sensor (that is, no other minilens shines on them).
The light is not necessarily focused per se. An optical lens performs a mathematical operation: the Fourier Transform. So the light on each mini lens is not collected in a focused state, but the FFT can be applied digitally after the light is captured to create a focused image. IIRC, image resolution is proportional to the size of your lens array.
I forget how you place the lenses to collect the proper rays, but I'm pretty sure there's some academic papers on it if you google around.
basically, there are more sensor pixels than microlenses (in current sensors, there's a microlens per pixel, to focus all of the incoming rays onto the pixel) In this one, the "in focus" pixel receives information and the surrounding "out of focus" pixels ALSO receive information. And then these out of focus pixels can be processed, because you can guess the path a particular ray took through the system based on what the various pixels saw.
Basically, this: http://web.media.mit.edu/~raskar/Mask/
Figure 1 of that paper is a good summary. Instead of all the beams converging on the sensor, they converge at the lens microarray, which spreads them over many sensor pixels.
The hand-wavy semi-information-theoretic explanation is that this camera isn't capturing any more information than a standard camera, but the type of data it's capturing is different. There's a trade-off between image resolution and the amount of light field information you can capture given a particular sensor resolution. Their main argument is that 5-10 megapixels is absurd for most people and could be put to better use by capturing light field information.
Your hand-wavy explanation there isn't strictly correct: a regular camera does throw away information. It throws away the (relative) phase information that is present in the wavefront. That is, even if adjacent pixels are near the diffraction limit of the lens train, the pixels capture only the intensity. Any difference in phase—which corresponds in the ray/particle model to the angle of incidence—is thrown away.
In a more general sense, you're right. The camera is using pixels for something different.
Adobe's demo last year gives a good overview of how this works.
https://www.youtube.com/watch?v=-EI75wPL0nU&feature=related
Lenses don't "throw vector information away"--they transform it. Most lenses pick a transform that produces planar images some distance behind the lens intended for a flat sensor that can't use vector information--but that's certainly not the only useful function possible, especially if you can do math on the output of the sensor. It reminds me of insect eyes, which don't bother with focus at all.
This thing does work but it will take a whole lot more pixels to make it reasonable.
The phd thesis has all the details: http://www.lytro.com/renng-thesis.pdf
The basic idea is that, in a traditional camera, when the lens is properly focused on an object, all the light that hits the lens from a single point on that object is aimed by the lens so it hits a single pixel on the sensor (so that that point is in focus).
Now place a microlens array where the sensor was (and move the sensor slightly back). The light converging from the lens onto that single pixel now converges onto a single microlens--but that light is still coming from different directions. The microlens now spreads that light out to diverge again. But since the sensor is very close, the microlens makes that light diverge into a very small area.
Of course, if the object is in focus, all the light that you just spread out is colored the same since it came from the same point on the object. But if it wasn't in focus, you'd have light coming from different things converging on this microlens.
Effectively, you have a whole bunch of microlenses "seeing" the scene from slightly different positions; the microlenses are the effective pixels, and the data spread from each microlens records the difference in lighting from different directions.
And Jef is right, we need higher resolutions to get a high "microlens resolution" while still sampling directions reasonably. The dissertation suggests 100MP as a useful and achievable number.
100MP was my offhand guess too.
Outside of pro-photographers looking for new tricks, what problem is this solving?
Modern auto-focus works pretty well, and pocket cameras usually have relatively slow (high f-stop) lenses with plenty of depth of field. Most photos are ruined by camera shake, not bad focus.
Problems solved = modern auto-focus works for shit under anything less than standing-on-the-surface-of-the-sun conditions, and auto-focusing takes perceptible time and chooses the wrong target anyway.
The dissertation this derives from also shows an algorithm for computing a deep-focused image which uses all the photons gathered by the (wider aperature) lens, so you can take a deep-focused image with a wider aperature and thus a shorter shutter time and thus less camera shake. (Note that this algorithm is non-physical; it doesn't simulate something an actual camera can do. There are also some other non-photographic tricks, such as having making two planes in focus when photographing two people at different depths. It doesn't mention tilt-shift photography, but simulating that's possible as well.)
But yes, to some extent the dissertation does come across as "here's a neat thing derived from computer graphics that we can apply to photography" and less like "oh my god focusing is such a problem we need a solution to that stat".
I wonder if depth information can be recovered from the data this camera captures.
While I understand at a high level how the plenoptic lens concept operates ... and that the technology is really real ... and the fact that this technology is now actually feasible ... makes me want to say ...
ZOOM. ENHANCE. ENHANCE. ENHANCE.
https://www.youtube.com/watch?v=Vxq9yj2pVWk
I've just thought about it. This is even better than depth information. This could be the basis for something even better than normal mapping. Imagine 3D realtime rendered textures with volumetric effects! Think about it. Don't even need a fancy camera, render the volumetric texture in a simulated version of the camera, and raytrace through your polygon to the prerendered image.
This is gold. I want to try it.
Feathers, fur, hahah.
oh and I've even figured out how to pull off the effect in a painstaking manual process with a normal camera with multiple photographs ala HDR photography.
I saw this camera a few years ago in person. It works. (One of their initial funders was a classmate of mine, so I got an intro - I love computational photography stuff.)
A lens is basically a computational structure. It's a mathematical transform from 3d to 2d. Most modern lenses are very similar to pinhole cameras, effectively. However, you can do less of the computation in glass and more of it in software.
A light field camera captures a bunch of rays, not just the focused plane. You use software later to actually do the focusing. The new lens tech (which is not the same as traditional lenses) plus the microarray (to capture the angle the rays came through the lens) plus some software are computationally equivalent to a traditional lens.
Other computational photograph methods are similarly interesting. I like this example of coded apertures: http://groups.csail.mit.edu/graphics/CodedAperture/
Hope this makes more sense.
The downside of this tech, btw, is that it throws away a lot of resolution in order to capture the angle of the incoming rays.