I was going to start this post off with a discussion of how we all benefit from sharing information, but I just got an email from John Carmack that nicely sums up what I was going to say, so I’m going to go with that instead:
Subject: What a wonderful world…
Just for fun, I was considering writing a high performance line drawing routine for the old Apple //c that Anna got me for Christmas. I could do a pretty good one off the top of my head, but I figured a little literature review would also be interesting. Your old series of articles comes up quickly, and they were fun to look through again. I had forgotten about the run-length slice optimization.
What struck me was this paragraph:
First off, I have a confession to make: I’m not sure that the algorithm I’ll discuss is actually, precisely Bresenham’s run-length slice algorithm. It’s been a long time since I read about this algorithm; in the intervening years, I’ve misplaced Bresenham’s article, and have been unable to unearth it. As a result, I had to derive the algorithm from scratch, which was admittedly more fun than reading about it, and also ensured that I understood it inside and out. The upshot is that what I discuss may or may not be Bresenham’s run-length slice algorithm—but it surely is fast.
The notion of misplacing a paper and being unable to unearth it again seems like a message from another world from today’s perspective. While some people might take the negative view that people no longer figure things out from scratch for themselves, I consider it completely obvious that having large fractions of the sum total of human knowledge at your fingertips within seconds is one of the greatest things to ever happen to humanity.
Hooray for today!
But what’s in it for me?
Hooray for today indeed – as I’ve written elsewhere (for example, the last section of this), there’s huge value to shared knowledge. However, it takes time to write something up and post it, and especially to answer questions. So while we’re far better off overall from sharing information, it seems like any one of us would be better off not posting, but rather just consuming what others have shared.
This appears to be a classic example of the Prisoner’s Dilemma. It’s not, though, because there are generally large, although indirect and unpredictable, personal benefits. There’s no telling when they’ll kick in or what form they’ll take, but make no mistake, they’re very real.
For example, consider how the articles I wrote over a ten-year stretch – late at night after everyone had gone to sleep – opened up virtually all the interesting opportunities I’ve had over the last twenty years.
In 1992, I was writing graphics software for a small company, and getting the sense that it was time to move on. I had spent my entire career to that point working at similar small companies, doing work that was often interesting but that was never going to change the world. It’s easy to see how I could have spent my entire career moving from one such job to another, making a decent living but never being in the middle of making the future happen.
However, in the early 80’s, Dan Illowsky, publisher of my PC games, had wanted to co-write some articles as a form of free advertising. There was nothing particularly special about the articles we wrote, but I learned a lot from doing them, not least that I could get what I wrote published.
Then, in the mid-80’s, I came across an article entitled “Optimizing for Speed” in Programmer’s Journal, a short piece about speeding up bit-doubling on the 8088 by careful cycle counting. I knew from optimization work I’d done on game code that cycle counts weren’t the key on the 8088; memory accesses, which took four cycles per byte, limited almost everything, especially instruction fetching. On a whim, I wrote an article explaining this and sent it off to PJ, which eventually published it, and that led to a regular column in PJ. By the time I started looking around for a new job in 1992, I had stuff appearing in several magazines on a regular basis.
One of those articles was the first preview of Turbo C. Borland had accidentally sent PJ the ad copy for Turbo C before it was announced, and when pressed agreed to let PJ have an advance peek. The regular C columnist couldn’t make it, so as the only other PJ regular within driving distance, I drove over the Santa Cruz Mountains on zero notice one rainy night and talked with VP Brad Silverberg, then wrote up a somewhat breathless (I wanted that development environment) but essentially correct (it really did turn out to be that good) article.
In 1992, Brad had moved on to become VP of Windows at Microsoft, and when I sent him mail looking for work, he referred me to the Windows NT team, where I ended up doing some of the most challenging and satisfying work of my career. Had I not done the Turbo C article, I wouldn’t have known Brad, and might never have had the opportunity to work on NT. (Or I might have; Dave Miller, who I worked with at Video Seven, referred me to Jeff Newman, who pointed me to the NT team as well – writing isn’t the only way opportunity knocks!)
I was initially a contractor on the NT team, and I floundered at first, because I had no experience with working on a big project. I would likely have been canned after a few weeks, were it not for Mike Harrington, who had read some of my articles and thought it was worth helping me out. Mike got me set up on the network, showed me around the development tools, and took me out for dinner, giving me a much-needed break in the middle of a string of 16-hour workdays.
After a few years at Microsoft, I went to work at Id, an opportunity that opened up because John Carmack had read my PJ articles when he was learning about programming the PC. And a few years later, Mike Harrington would co-found Valve, licensing the Quake source code from Id, where I would be working at the time – and where I would help Valve get the license – and thirteen years after that, I would go to work at Valve.
If you follow the thread from the mid-80’s on, two things are clear: 1) it was impossible to tell where writing would lead, and 2) writing opened up some remarkable opportunities over time.
It’s been my observation that both of these points are true in general, not just in my case. The results from sharing information are not at all deterministic, and the timeframe can be long, but generally possibilities open up that would never have been available otherwise. So from a purely selfish perspective, sharing information is one of the best investments you can make.
The unpredictable but real benefits of sharing information are part of why I write this blog. It has brought me into contact with many people who are well worth knowing, both to learn from and to work with; for example, I recently helped Pravin Bhat, who emailed me after reading a blog post and now works at Valve, optimize some very clever tracking code that I hope to talk about one of these days. If you’re interested in AR and VR – or if you’re interested in making video games, or Linux, or hardware, or just think Valve sounds like an great place to work (and you should) – take a look at the Valve Handbook. If, after reading the Handbook, you think you fit the Valve template and Valve fits you, check out Valve’s job openings or send me a resume. We’re interested in both software and hardware – mechanical engineers are particularly interesting right now, but Valve doesn’t hire for specific projects or roles, so I’m happy to consider a broad range of experience and skills – but please, do read the Handbook first to see if there’s likely to be a fit, so you can save us both a lot of time if that’s not the case.
The truth is, I wrote all those articles, and I write this blog, mostly because of the warm feeling I get whenever I meet someone who learned something from what I wrote; the practical benefits were an unexpected bonus. Whatever the motivation, though, sharing information really does benefit us all. With that in mind, I’m going to start delving into what we’ve found about the surprisingly deep and complex reasons why it’s so hard to convince the human visual system that virtual images are real.
How images get displayed
There are three broad factors that affect how real – or unreal – virtual scenes seem to us, as I discussed in my GDC talk: tracking, latency, and the way in which the display interacts perceptually with the eye and the brain. Accurate tracking and low latency are required so that images can be drawn in the right place at the right time; I’ve previously talked about latency, and I’ll talk about tracking one of these days, but right now I’m going to treat latency and tracking as solved problems so we can peel the onion another layer and dive into the interaction of head mounted displays with the human visual system, and the perceptual effects thereof. More informally, you could think of this line of investigation as: “Why VR and AR aren’t just a matter of putting a display an inch in front of each eye and rendering images at the right time in the right place.”
In the next post or two, I’ll take you farther down the perceptual rabbit hole, to persistence, judder, and strobing, but today I’m going to start with an HMD artifact that’s both useful for illustrating basic principles and easy to grasp intuitively: color fringing. (I discussed this in my GDC talk, but I’ll be able to explain more and go deeper here.)
A good place to start is with a simple rule that has a lot of explanatory power: visual perception is a function of where and when photons land on the retina. That may seem obvious, but consider the following non-intuitive example. Suppose the eye is looking at a raster-scan display. Further, suppose a vertical line is being animated on the display, moving from left to right, and that the eye is tracking it. Finally, assume that the pixels on the display have zero persistence – that is, each one is illuminated very brightly for a very short portion of the frame time. What will the eye see?
The pattern shown on the display for each frame is a vertical line, so you might expect that to be what the eye sees, but the eye will actually see a line slanting from upper right to lower left. The reasons for this were discussed here, but what they boil down to is that the pattern in which the photons from the pixels land on the retina is a slanted line. This is far from unusual; it is often the case that what is perceived by the eye differs from what is displayed on an HMD, and the root cause of this is that the overall way in which display-generated photons are presented to the retina has nothing in common with real-world photons.
Real-world photons are continuously reflected or emitted by every surface, and vary constantly. In contrast, displays emit fixed streams of photons from discrete pixel areas for discrete periods of time, so photon emission is quantized both spatially and temporally; furthermore, with head-mounted displays, pixel positions are fixed with respect to the head, but not with respect to the eyes or the real world. In the case described above, the slanted line results from eye motion relative to the pixels during the time the raster scan sweeps down the display.
You could think of the photons from a display as a three-dimensional signal: pixel_color = f(display_x, display_y, time). Quantization arises because pixel color is constant within the bounds defined by the pixel boundaries and the persistence time (the length of time any given pixel remains lit during each frame). When that signal is projected onto the retina, the result for a given pixel is a tiny square that is swept across the retina, with the color constant over the course of a frame; the distance swept per frame is proportional to the distance the eye moves relative to the pixel during the persistence time. The net result is a smear, unless persistence is close to zero or the eye is not moving relative to the pixel.
The above description is a simplification, since pixels aren’t really square or uniformly colored, and illumination isn’t truly constant during the persistence time, but it will suffice for the moment. We will shortly see a case where it’s each pixel color component that remains lit, not the pixel as a whole, with interesting consequences.
The discrete nature of photon emission over time is the core of the next few posts, because most display technologies have significant persistence, which means that most HMDs have a phenomenon called judder, a mix of smearing and strobing (that is, multiple simultaneous perceived copies of images) that reduces visual quality considerably, and introduces a choppiness that can be fatiguing and may contribute to motion sickness. We’ll dive into judder next time; in this post we’ll establish a foundation for the judder discussion, using the example of color fringing to illustrate the basics of the interaction between the eye and a display.
The key is relative motion between the eye and the display
Discrete photon emission produces artifacts to varying degrees for all display and projector based technologies. However, HMDs introduce a whole new class of artifacts, and the culprit is rapid relative motion between the eye and the display, which is unique to HMDs.
When you look at a monitor, there’s no situation in which your eye moves very rapidly relative to the monitor while still being able to see clearly. One reason for this is that monitors don’t subtend a very wide field of view – even a 30-inch monitor would be less than 60 degrees at normal viewing distance – so a rapidly-moving image would vanish off the screen almost as soon as the eye could acquire and track it. In contrast, the Oculus Rift has a 90-degree FOV.
An even more important reason why the eye can move much more rapidly relative to head-mounted displays than to monitors is that HMDs are attached to heads. Heads can rotate very rapidly – 500 degrees per second or more. When the head rotates, the eye can counter-rotate just as fast and very accurately, based on the vestibulo-ocular reflex (VOR). That means that if you fixate on a point on the wall in front of you, then rotate your head as rapidly as you’d like, that point remains clearly visible as your head turns.
Now consider what that means in the context of an HMD. When your head turns while you fixate on a point in the real world, the pixels on the HMD move relative to your eyes, and at a very high speed – easily ten times as fast as you can smoothly track a moving object. This is particularly important because it’s common to look at a new object by first moving the eyes to acquire the target, then remaining fixated on the target while the head turns to catch up. This VOR-based high-speed eye-pixel relative velocity is unique to HMDs.
Let’s look at a few space-time diagrams that help make it clear how HMDs differ from the real world. These diagrams plot x position relative to the eye on the horizontal axis, and time advancing down the vertical axis. This shows how two of the three dimensions of the signal from the display land on the retina, with the vertical component omitted for simplicity.
First, here’s a real-world object sitting still.
I’ll emphasize, because it’s important for understanding later diagrams, that the x axis is horizontal position relative to the eye, not horizontal position in the real world. With respect to perception of images on HMDs it’s eye-relative position that matters, because that’s what affects how photons land on the retina. So the figure above could represent a situation in which both the eye and the object are not moving, but it could just as well represent a situation in which the object is moving and the eye is tracking it.
The figure would look the same for the case where both a virtual image and the eye are not moving, unless the color of the image was changing. In that case, a real-world object could change color smoothly, while a virtual image could only change color once per frame. However, matters would be quite different if the virtual image was moving and the eye was tracking it, as we’ll see shortly.
Next let’s look at a case where something is moving relative to the eye. Here a real-world object is moving from left to right at a constant velocity relative to the eye. The most common case of this would be where the eye is fixated on something else, while the object moves through space from left to right.
Now let’s examine the case where a virtual image is moving from left to right relative to the eye, again while the eye remains fixated straight ahead. There are many types of displays that this might occur on, but for this example we’re going to assume we’re using a color-sequential liquid crystal on silicon (LCOS) display.
Color-sequential LCOS displays, which are (alas, for reasons we’ll see soon) often used in HMDs, display red, green, and blue separately, one after another, for example by reflecting a red LED off a reflective substrate that’s dynamically blocked or exposed by pixel-resolution liquid crystals, then switching the liquid crystals and reflecting a green LED, then switching the crystals again and reflecting a blue LED. (Many LCOS projectors actually switch the crystals back to the green configuration again and reflect the green LED a second time each frame, but for simplicity I’ll ignore that.) This diagram below shows how the red, green, and blue components of a moving white virtual image are displayed over time, again with the eye fixated straight ahead.
Once again, remember that the x axis is horizontal motion relative to the eye. If the display had an infinite refresh rate, the plot would be a diagonal line, just like the second space-time diagram above. Given actual refresh rates, however, something quite different happens.
For a given pixel, each color displays for one-third of each frame. (It actually takes time to switch the mirrors, so each color displays for more like 2 ms per frame, and there are dark periods between colors, but for ease of explanation, let’s assume that each frame is evenly divided between the three colors; the exact illumination time for each color isn’t important to the following discussion.) At 60 Hz, the full cycle is displayed over the course of 16 ms, and because that interval is shorter than the time during which the eye integrates incident light, the visual system blends the colors for each point together into a single composite color. The result is that the eye sees an image with the color properly blended. This is illustrated in the figure below, which shows how the photons from a horizontal white line on an LCOS display land on the retina.
Here the three color planes are displayed separately, one after another, and, because the eye is not moving relative to the display, the three colored lines land on top of each other to produce a perceived white line.
Because each pixel can update only once a frame and remains lit for the persistence time, the image is quantized to pixel locations spatially and to persistence time temporally, resulting in stepped rather than continuous motion. In the case shown above, that wouldn’t produce noticeable artifacts unless the image moved too far between frames – “too far” being on the order of five or ten arc-minutes, depending on the frequency characteristics of the image. In that case, the image would strobe; that is, the eye would perceive multiple simultaneous copies of the image. I’ll talk about strobing in the next post.
So far, so good, but we haven’t yet looked at motion of the eye relative to the display, and it’s that case that’s key to a number of artifacts. As I noted earlier, the eye can move relative to the display, while still being able to see clearly, either when it’s tracking a moving virtual image or when it’s fixated on a static virtual image or real object via VOR while the head turns. (I say “see clearly” because the eye can also move relative to the display by saccading, but in that case it can’t see clearly, although, contrary to popular belief, it does still acquire and use visual information.) As explained above, the VOR case is particularly interesting, because it can involve very high relative velocities between the eye and the display.
So what happens if the eye is tracking a moving virtual object that’s exactly one pixel in size from left to right? (Assume that the image lands squarely on a pixel center each frame, so we can limit this discussion to the case of exactly one pixel being lit per frame.) The color components of each pixel will then each line up differently with the eye, as you can see in the figure below, and color fringes will appear. (This figure also contains everything you need in order to understand judder, but I’ll save that discussion for the next post.)
Remember, the x position is relative to the eye, not the real world.
For a given frame, the red component of the pixel gets drawn in the correct location – that is, to the right pixel – at the start of the frame (assuming either no latency or perfect prediction). However, the red component remains in the same location on the display and is the same color for one-third of the frame; in an ideal world, the pixel would move continuously at the same speed as the image is supposed to be moving, but of course it can’t go anywhere until the next frame. Meanwhile, the eye continues to move along the path the image is supposed to be following, so the pixel slides backward relative to eye, as you can see in the figure above. After a one-third of the frame, the green component replaces the red component, falling farther behind the correct location, and finally the blue component slides even farther for the final one-third of the frame. At the start of the next frame, the red component is again drawn at the correct pixel (a different one, because the image is moving across the display), so the image snaps back to the right position, and again starts to slide. Because each pixel component is drawn at a different location relative to the eye, the colors are not properly superimposed, and don’t blend together correctly.
Here’s how color fringing would look for eye movement from left to right – color fringes appear at the left and right sides of the image, due to the movement of the eye relative to the display between the times the red, green, and blue components are illuminated.
It might be hard to believe that color fringes can be large enough to really matter, when a whole 60Hz frame takes only 16.6 ms. However, if you turn your head at a leisurely speed, that’s about 100 degrees/second, believe it or not; in fact, you can easily turn at several hundred degrees/second. (And remember, you can do that and see clearly the whole time if you’re fixating, thanks to VOR.) At just 60 degrees/second, one 16.6ms frame is a full degree; at 120 degrees/second, one frame is two degrees. That doesn’t sound like a lot, but one or two degrees can easily be dozens of pixels – if such a thing as a head-mounted display that approached the eye’s resolution existed, two degrees would be well over 100 pixels – and having rainbows that large around everything reduces image quality greatly.
Color-sequential displays in projectors and TVs don’t suffer to any significant extent from color fringing because there’s no rapid relative motion between the eye and the display involved, for the two reasons mentioned earlier: because projectors and TVs have limited fields of view, and because they don’t move with the head and thus aren’t subject to the high relative eye velocities associated with VOR. Not so for HMDs; color-sequential displays should be avoided like the plague in HMDs intended for AR or VR use.
Necessary but not sufficient
There are two important conclusions to be drawn from the discussion to this point. The first is that it should now be clear that relative motion between the eye and a head-mounted display can produce serious artifacts, and what the basic mechanism underlying that is. The second is that a specific artifact, color fringing, is a natural by-product of color-sequential displays, and that as a result AR/VR displays need to illuminate all three color components simultaneously, or at least nearly so.
Illuminating all three color components simultaneously is, alas, necessary but not sufficient. Doing so will eliminate color fringing, but it won’t do anything about judder, so that’s the layer we’ll peel off the perceptual onion next time.