A Good Post by John Carmack about Latency

John Carmack of Id Software has posted a long, detailed piece that discusses a number of ways to address VR latency. It’s a good complement to my recent post about latency; the two posts have a good deal in common, but John approaches the problem from a different angle, and has a very nice way of diagramming pipeline latency. Definitely worth a read.

Raster-Scan Displays: More Than Meets The Eye

Back in the spring of 1986, Dan Illowsky and I were up against the deadline for an article that we were writing for PC Tech Journal. The name of the article might have been “Software Sprites,” but I’m not sure, since it’s one of the few things I’ve written that seems not to have made it to the Internet. In any case, I believe the article showed two or three different ways of doing software animation on the very simple graphics hardware of the time. With the deadline looming, both the article and the sample code that would accompany it were written, but one part of the code just wouldn’t work right.

As best I can remember, the problematic sample moved two animated helicopters and a balloon around the screen. All the drawing was done immediately after vsync; the point was to show that since nothing was being scanned out to the display at that time (vsync happens in the middle of the vertical blanking interval), the contents of the frame buffer could be modified with no visible artifacts. The problem was that when an animated object got high enough on the screen, it would start vanishing – oddly enough, from the bottom up – and more and more of the object would vanish as it rose until it was completely gone. Stranger still, the altitude at which this happened varied from object to object. We had no idea why that was happening – and the clock was ticking.

I’m happy to report that we did solve the mystery before the deadline. The problem was that back in those days of dog-slow 8088s and slightly faster 80286s, the display was scanning out pixels before the code had finished updating them. And if that explanation doesn’t make much sense to you at the moment, it should all be clear by the end of today’s post, which covers some decidedly non-intuitive consequences of an interesting aspect of the discussion of latency in the last post – the potentially problematic AR/VR implications of a raster scan display, and the way that racing the beam interacts with the raster scan to address those problems.

Raster scanning

Raster scanning is the process of displaying an image by updating each pixel one after the other, rather than all at the same time, with all the pixels on the display updated over the course of one frame. Typically this is done by scanning each row of pixels from left to right, and scanning rows from top to bottom, so the rightmost pixel on each scan line is updated a few microseconds after the leftmost pixel, and the bottommost row on the screen is updated a few milliseconds (roughly 15 ms for 60 Hz refresh – less than 16.7 ms because of vertical blanking time) after the topmost row. Figure 1 shows the order in which pixels are updated on an illustrative if not particularly realistic 8×4 raster scan display.

Originally, the raster scan pattern directly reflected the way the electron beam in a CRT moved to update the phosphors. There’s no longer an electron beam on most modern displays; now the raster scan reflects the order in which pixel data is scanned out of the graphics adapter and into the display. There’s no reason that the scan-in has to proceed in that particular order, but on most devices that’s what it does, although there are variants like scanning columns rather than rows, scanning each pair of lines in opposite directions, or scanning from the bottom up. If you could see events that happen on a scale of milliseconds (and, as we’ll see shortly, under certain circumstances you can), you would see pixel updates crawling across the screen in raster scan order, from left to right and top to bottom.

It’s necessary that pixel data be scanned into the display in some time-sequential pattern, because the video link (HDMI, for example) transmits pixel data in a stream. However, it’s not required that these changes become visible over time. It would be quite possible to scan in a full frame to, say, an LCD panel while it was dark, wait until all the pixel data has been transferred, and then illuminate all the pixels at once with a short, bright light, so all the pixel updates become visible simultaneously. I’ll refer to this as global display, and, in fact, it’s how some LCOS, DLP, and LCD panels work. However, in the last post I talked about reducing latency by racing the beam, and I want to follow up by discussing the interaction of that with raster scanning in this post. There’s no point to racing the beam unless each pixel updates on the display as soon as the raster scan changes it; that means that global display, which doesn’t update any pixel’s displayed value until all the pixels in the frame have been scanned in, precludes racing the beam.

So for the purposes of today’s discussion, I’ll assume we’re working with a display that updates each pixel on the screen as soon as the scanned-in pixel data provides a new value for it; I’ll refer to this as rolling display. I’ll also assume we’re working with zero persistence pixels – that is, pixels that illuminate very brightly for a very short period after being updated, then remain dark for the remainder of the frame. This eliminates the need to consider the positions and times of both the first and last photons emitted, and thus we can ignore smearing due to eye movement relative to the display. Few displays actually have zero persistence or anything close to it, although scanning lasers do, but it will make it easier to understand the basic principles if we make this simplifying assumption.

Raster scanning is not how anything works in nature

To recap, racing the beam is when rendering proceeds down the frame just a little ahead of the raster, so that pixels appear on the screen shortly after they’re drawn. Typically this would be done by rendering the scene in horizontal strips of perhaps a few dozen lines each, using the latest reading from the tracking system to position each strip for the current HMD pose just before rendering it.

This is an effective latency-reducing technique, but it’s hard to implement, because it’s very timing-dependent. There’s no guarantee as to how long a given strip will take to render, so there’s a delicate balance involved in leaving enough padding so the raster won’t overtake rendering, while still getting close enough to the raster to reap significant latency reduction. As discussed in the last post, there are some interesting ways to try to address that balance, such as rendering the whole frame, then warping each strip based on the latest position data. In any case, racing the beam is capable of reducing display latency purely in software, and that’s a rare thing, so it’s worth looking into more deeply. However, before we can even think about racing the beam, we need to understand some non-intuitive implications of rolling display, which, as explained above, is required in order for racing the beam to provide any benefit.

So let’s look at a few scenarios. If you’re wearing an HMD with a 60 Hz rolling display, and rendering each frame in its entirety, waiting for vsync, and then scanning the frame out to the display in the normal fashion (with no racing the beam involved at this point), what do you think you’d see in each of the following scenarios? (Hint: think about what you’d see in a single frame for each scenario, and then just repeat that.)

Scenario 1: Head is not moving; eyes are fixated on a vertical line that extends from the top to the bottom of the display, as shown in Figure 2; the vertical line is not moving on the display.

Scenario 2: Head is not moving; the vertical line in Figure 2 is moving left to right on the display at 60 degrees/second; eyes are tracking the line.

Scenario 3: Head is not moving; the vertical line in Figure 2 is moving left to right relative to the display at 60 degrees/second across the center of the screen; eyes are fixated on the center of the screen, and are not tracking the line.

Scenario 4: Head is rotating left to right at 60 degrees/second; the vertical line in Figure 2 is moving right to left on the display at 60 degrees/second, compensating for the head motion so that to the eye the image appears to stay in the same place in the real world; eyes are counter-rotating, tracking the line.

Take a second to think through each of these and write down what you think you’d see. Bear in mind that raster scanning is not how anything works in nature; the pixels in a raster image are updated at differing times, and in the case of zero persistence aren’t even on at the same time. Frankly, it’s a miracle that raster images look like anything coherent at all to us; the fact that they do has to do with the way our visual system collects photons and makes inferences from that data, and at some point I hope to talk about that a little, because it’s fascinating (and far from fully understood).

Here are the answers, as shown in Figure 3, below:

Scenario 1: an unmoving vertical line.

Scenario 2: a line moving left to right, slanted to the right by about one degree from top to bottom. (The slant is exaggerated in Figure 3 to make it more visible; in an HMD, a one-degree slant is much more visible, for reasons I’ll discuss a little later.)

Scenario 3: a vertical line moving left to right.

Scenario 4: a line staying in the same place relative to the real world (although moving right to left on the display, compensating for the display movement from left to right), slanted to the left by about one degree from top to bottom.

How did you do? If you didn’t get all four, don’t feel bad; as I said at the outset, this is not intuitive – which is what makes it so interesting.

In a moment, I’ll explain these results in detail, but here’s the underlying rule for understanding what happens in such situations: your perception will be based on whatever pattern is actually produced on your retina by the photons emitted by the image. That may sound obvious, and in the real world it is, but with an HMD, the time-dependent sequence of pixel illumination makes it anything but.

Given that rule, we get a vertical line in scenario 1 because nothing is moving, so the image registers on the retina exactly as it’s displayed.

Things get more complicated with scenario 2. Here, the eye is smoothly tracking the image, so it’s moving to the right at 60 degrees/second relative to the display. (Note that 60 degrees/second is a little fast for smooth pursuit without saccades, but the math works out neatly on a 60 Hz display, so we’ll go with that.) The topmost pixel in the vertical line is displayed at the start of the frame, and lands at some location on the retina. Then the eye continues moving to the right, and the raster continues scanning down. By the time the raster reaches the last scan line and draws the bottommost pixel of the line, it’s something on the order of 15 ms later, and here we come to the crux of the matter – the eye has moved about one degree to the right since the topmost pixel was drawn. (Note that the eye will move smoothly in tracking the line, even though the line is actually drawn as a set of discrete 60 Hz samples.)

That means that the bottommost pixel will land on the retina about one degree to the right of the topmost pixel, which, due to the way images are formed on the retina and then flipped, will cause the viewer to perceive it to be one degree to the left of the topmost pixel. The same is true of all the pixels in the vertical line, in direct proportion to how much later they’re drawn relative to the topmost pixel. The pixels of the vertical line land on the retina slanted by one degree, so we see a line that’s similarly slanted, as shown in Figure 4 for an illustrative 4×4, 60 Hz display.

Note that for clarity, Figure 4 omits the retinal image flipping step and just incorporates its effects into the final result. The slanted pixels are shown at the locations where they’d be perceived; the pixels would actually land on the retina offset in the opposite direction, and reversed vertically as well, due to image inversion, but it’s the perceived locations that matter.

If it’s that easy to produce this effect, you may well ask: Why can’t I see it on a monitor? The answer depends on whether the monitor waits for vsync; that is, whether the entire rendered frame is scanned out to the display only once per displayed frame (i.e., at the refresh rate), or scanned out to the display as fast as frames can be drawn (so multiple rendered frames affect a single displayed frame, each in its own horizontal strip – a form of racing the beam).

In the case where vsync isn’t waited for, you won’t see lines slant for reasons that may already be obvious to you – because each horizontal strip is drawn at the right location based on the most recent position data; we’ll return to this later. However, in this case it’s easy to see the problem with not waiting for vsync as well. If vsync is off on your monitor, grab a screen-height window that has a high-contrast border and drag it rapidly left to right, then back right to left, and you’ll see that the vertical edge breaks up into segments. The segments are separated by the scan lines where the copy to the screen overtook the raster. If you move the window to the left and don’t track it with your eyes, the lower segments will be to the left of the segments above them, because as soon as the copy overtakes the raster (this assumes that the copy is faster than the raster update, which is very likely to be the case), the raster starts displaying the new pixels, which represent the most up-to-date window position as it moves to the left. This segmentation is called tearing, and is a highly visible artifact that needs to be carefully smoothed over for any HMD racing-the-beam approach.

In contrast, if vsync is waited for, there will be no tearing, but the slanting described above will be visible. If your monitor waits for vsync, grab a screen-height window and drag it back and forth, tracking it with your eyes, and you will see that the vertical edges do in fact tilt as advertised; it’s subtle, because it’s only about a degree and because the pixels smear due to long persistence, but it’s there.

In either case, the artifacts are far more visible for AR/VR in an HMD, because objects that dynamically warp and deform destroy the illusion of reality; in AR in particular, it’s very apparent when artifacts mis-register against the real world. Another factor is that in an HMD, your eyes can counter-rotate and maintain fixation while you turn your head (via the combination of the vestibulo-ocular reflex, or VOR, and the optokinetic response, or OKR), and that makes possible relative speeds of rotation between the eye and the display that are many times higher than the speeds at which you can track a moving object (via smooth pursuit) while holding your head still, resulting in proportionally greater slanting.

By the way, although it’s not exactly the same phenomenon, you can see something similar – and more pronounced – on your cellphone. Put it in back-facing camera mode, point it at a vertical feature such as a door frame, and record a video while moving it smoothly back and forth. Then play the video back while holding the camera still. You will see the vertical feature tilt sharply, or at least that’s what I see on my iPhone. This differs from scenario 4 because it involves a rolling shutter camera (if you don’t see any tilting, either you need to rotate your camera 90 degrees to align with the camera scan direction – I had to hold my iPhone with the long dimension horizontal – or your camera has a global shutter), but the basic principles of the interaction of photons and motion over time are the same, just based on sampling incoming photons in this case rather than displaying outgoing ones. (Note that it is risky to try to draw rolling display conclusions relevant to HMDs from experiments with phone cameras because of the involvement of rolling shutter cameras, because the frame rates and scanning directions of the cameras and displays may differ, and because neither the camera nor the display is attached to your head.)

Scenario 3 results in a vertical line for the same reason as scenario 1. True, the line is moving between frames, but during a frame it’s drawn as a vertical line on the display. Since the eye isn’t moving relative to the display, that image ends up on the retina exactly as it’s displayed. (A bit of foreshadowing for some future post: the image for the next frame will also be vertical, but will be at some other location on the retina, with the separation depending on the velocity of motion – and that separation can cause its own artifacts.)

It may not initially seem like it, but scenario 4 is the same as scenario 2, just in the other direction. I’ll leave this one as an exercise for the reader, with the hint that the key is the motion of the eye relative to the display.

Rolling displays can produce vertical effects as well, and they can actually be considerably more dramatic than the horizontal ones. As an extreme but illustrative example (you’d probably injure yourself if you actually tried to move your head at the required speed), take a moment and try to figure out what would happen if you rotated your head upward over the course of a frame at exactly the same speed that the raster scanned down the display, while fixating on a point in the real world.

Ready?

The answer is that the entire frame would collapse into a single horizontal line, because every scan line will land in exactly the same place on the retina. Less rapid motion will result in vertical compression of the image. Vertical motion in the same direction as the raster scan will similarly result in vertical expansion. Either case can cause either intra- or inter-frame brightness variation.

None of this is hypothetical, nor is it a subtle effect. I’ve looked at cubes in an HMD that contort as if they’re made of Jell-O, leaning this way and that, compressing and expanding as I move my head around. It’s hard to miss.

Racing the beam fixes everything – or does it?

In sum, rolling display of a rendered frame produces noticeable shear, compression, expansion, and brightness artifacts that make both AR and VR less solid and hence less convincing; the resulting distortion may also contribute to simulator sickness. What’s to be done? Here we finally return to racing the beam, which updates the position of each scan line or block of scan lines just before rendering, which in turn occurs just before scan-out and display, thereby compensating for intra-frame motion and placing pixels where they should be on the retina. (Here I’m taking “racing the beam” to include the whole family of warping and reconstruction approaches that were mentioned in the last post and the comments on the post.) In scenario 4, HMD tracking data would cause each scan line or horizontal strip of scan lines to be drawn slightly to the left of the one above, which would cause the pixels of the image to line up in proper vertical arrangement on the retina. (Another approach would be the use of a global display; that comes with its own set of issues, not least the inability to reduce latency by racing the beam, which I hope to talk about at some point.)

So it appears that racing the beam, for all its complications, is a great solution not only to display latency but also to rolling display artifacts – in fact, it seems to be required in order to address those artifacts – and that might well be the case. But I’ll leave you with a few thoughts (for which the bulk of the credit goes to Atman Binstock and Aaron Nicholls, who have been diving into AR/VR perceptual issues at Valve):

1) The combination of racing the beam and compensating for head motion can fix scenario 4, but that scenario is a specific case of a general problem; head-tracking data isn’t sufficient to allow racing the beam to fix the rolling display artifacts in scenario 2. Remember, it’s the motion of the eye relative to the display, not the motion of the head, that’s key.

2) It’s possible, when racing the beam, to inadvertently repeat or omit horizontal strips of the scene, in addition to the previously mentioned brightness variations. (In the vertical rotation example above, where all the scan lines collapse into a single horizontal line, think about what each scan line would draw.)

3) Getting rid of rolling display artifacts while maintaining proper AR registration with the real world for moving objects is quite challenging – and maybe even impossible.

These issues are key, and I’ll return to them at some point, but I think we’ve covered enough ground for one post.

Finally, in case you still aren’t sure why the sprites in the opening story vanished from the bottom up, it was because both the raster and the sprite rendering were scanning downward, with the raster going faster. Until it caught up to the current rendering location, the raster scanned out pixels that had already been rendered; once it passed the current rendering location, it scanned out background pixels, because the foreground image hadn’t yet been drawn to those pixels. Different images started to vanish at different altitudes because the images were drawn at different times, one after the other, and vanishing was a function of the raster reaching the scan lines the image was being drawn to as it was being drawn, or, in the case of vanishing completely, before it was drawn. Since the raster scans at a fixed speed, images that were drawn sooner would be able to get higher before vanishing, because the raster would still be near the top of the screen when they were drawn. By the time the last image was drawn, the raster would have advanced far down the screen, and the image would start to vanish at that much lower level.

Latency – the sine qua non of AR and VR

Christmas always makes me think of Mode X – which surely requires some explanation, since it’s not the most common association with this time of year (or any time of year, for that matter).

IBM introduced the VGA graphics chip in the late 1980s as a replacement for the EGA. The biggest improvement in the VGA was the addition of the first 256-color mode IBM had ever supported – Mode 0x13 – sporting 320×200 resolution. Moreover, Mode 0x13 had an easy-to-program linear bitmap, in contrast to the Byzantine architecture of the older 16-color modes, which involved four planes and used four different pixel access modes, controlled through a variety of latches and registers. So Mode 0x13 was a great addition, but it had one downside – it was slow.

Mode 0x13 only allowed one byte of display memory – one pixel – to be modified per write access; even if you did 16-bit writes, they got broken into two 8-bit writes. The hardware used in 16-color modes, for all its complexity, could write a byte to each of the four planes at once, for a total of 32 bits modified per write. That four-times difference meant that Mode 0x13 was by far the slowest video mode.

Mode 0x13 also didn’t result in square pixels; the standard monitor aspect ratio was 4:3, which was a perfect match for the 640×480 high-res 16-color mode, but not for Mode 0x13’s 320×200. Mode 0x13 was limited to 320×200 because the video memory window was only 64KB, and 320×240 wouldn’t have fit. 16-color modes didn’t have that problem; all four planes were mapped into the same memory range, so they could each be 64KB in size.

In December of 1989, I remember I was rolling Mode 0x13’s aspect ratio around in my head on and off for days, thinking how useful it would be if it could support square pixels. It felt like there was a solution there, but I just couldn’t tease it out. One afternoon, my family went to get a Christmas tree, and we brought it back and set it up and started to decorate it. For some reason, the aspect ratio issue started nagging at me, and I remember sitting there for a minute, watching everyone else decorate the tree, phased out while ideas ran through my head, almost like that funny stretch of scrambled thinking just before you fall asleep. And then, for no apparent reason, it popped into my head:

Treat it like a 16-color mode.

You see, the CPU-access side of the VGA’s frame buffer (that is, reading and writing of its contents by software) and the CRT controller side (reading of pixels to display them) turned out to be completely independently configurable. I could leave the CRT controller set up to display 256 colors, but reconfigure CPU access to allow writing to four planes at once, with all the performance benefits of the 16-color hardware – and, as it turned out, a write that modified all four planes would update four consecutive pixels in 256-color mode. This meant fills and copies could go four times as fast. Better yet, the 64KB memory window limitation went away, because now four times as many bytes could be addressed in that window, so a few simple tweaks to get the CRT controller to scan out more lines produced a 320×240 mode, which I dubbed “Mode X” and wrote up in the December, 1991, Dr. Dobb’s Journal. Mode X was widely used in games for the next few years, until higher-res linear 256-color modes with fast 16-bit access became standard.

If you’re curious about the details of Mode X – and there’s no reason you should be, because it’s been a long time since it’s been useful – you can find them here, in Chapters 47-49.

One interesting aspect of Mode X is that it was completely obvious in retrospect – but then, isn’t everything? Getting to that breakthrough moment is one of the hardest things there is, because it’s not a controllable, linear process; you need to think and work hard at a problem to make it possible to have the breakthrough, but often you then need to think about or do something – anything – else, and only then does the key thought slip into your mind while you’re not looking for it.

The other interesting aspect is that everyone knew that there was a speed-of-light limit on 256-color performance on the VGA – and then Mode X made it possible to go faster than that limit by changing the hardware rules. You might think of Mode X as a Kobayashi Maru mode.

Which brings us, neat as a pin, to today’s topic: when it comes to latency, virtual reality (VR) and augmented reality (AR) are in need of some hardware Kobayashi Maru moments of their own.

Latency is fundamental

When it comes to VR and AR, latency is fundamental – if you don’t have low enough latency, it’s impossible to deliver good experiences, by which I mean virtual objects that your eyes and brain accept as real. By “real,” I don’t mean that you can’t tell they’re virtual by looking at them, but rather that your perception of them as part of the world as you move your eyes, head, and body is indistinguishable from your perception of real objects. The key to this is that virtual objects have to stay in very nearly the same perceived real-world locations as you move; that is, they have to register as being in almost exactly the right position all the time. Being right 99 percent of the time is no good, because the occasional mis-registration is precisely the sort of thing your visual system is designed to detect, and will stick out like a sore thumb.

Assuming accurate, consistent tracking (and that’s a big if, as I’ll explain one of these days), the enemy of virtual registration is latency. If too much time elapses between the time your head starts to turn and the time the image is redrawn to account for the new pose, the virtual image will drift far enough so that it has clearly wobbled (in VR), or so that is obviously no longer aligned with the same real-world features (in AR).

How much latency is too much? Less than you might think. For reference, games generally have latency from mouse movement to screen update of 50 ms or higher (sometimes much higher), although I’ve seen numbers as low as about 30 ms for graphically simple games running with tearing (that is, with vsync off). In contrast, I can tell you from personal experience that more than 20 ms is too much for VR and especially AR, but research indicates that 15 ms might be the threshold, or even 7 ms.

AR/VR is so much more latency-sensitive than normal games because, as described above, they’re expected to stay stable with respect to the real world as you move, while with normal games, your eye and brain know they’re looking at a picture. With AR/VR, all the processing power that originally served to detect anomalies that might indicate the approach of a predator or the availability of prey is brought to bear on bringing virtual images that are wrong by more than a tiny bit to your attention. That includes images that shift when you move, rather than staying where they’re supposed to be – and that’s exactly the effect that latency has.

Suppose you rotate your head at 60 degrees/second. That sounds fast, but in fact it’s just a slow turn; you are capable of moving your head at hundreds of degrees/second. Also suppose that latency is 50 ms and resolution is 1K x 1K over a 100-degree FOV. Then as your head turns, the virtual images being displayed are based on 50 ms-old data, which means that their positions are off by three degrees, which is wider than your thumb held at arm’s length. Put another way, the object positions are wrong by 30 pixels. Either way, the error is very noticeable.

You can do prediction to move the drawing position to the right place, and that works pretty well most of the time. Unfortunately, when there is a sudden change of direction, the error becomes even bigger than with no prediction. Again, it’s the anomalies that are noticeable, and reversal of direction is a common situation that causes huge anomalies.

Finally, latency seems to be connected to simulator sickness, and the higher the latency, the worse the effect.

So we need to get latency down to 20 ms, or possibly much less. Even 20 ms is very hard to achieve on existing hardware, and 7 ms, while not impossible, would require significant compromises and some true Kobayashi Maru maneuvers. Let’s look at why that is.

The following steps have to happen in order to draw a properly registered AR/VR image:

1) Tracking has to determine the exact pose of the HMD – that is, the exact position and orientation in the real world.
2) The application has to render the scene, in stereo, as viewed from that pose. Antialiasing is not required but is a big plus, because, as explained in the last post, pixel density is low for wide-FOV HMDs.
3) The graphics hardware has to transfer the rendered scene to the HMD’s display. This is called scan-out, and involves reading sequentially through the frame buffer from top to bottom, moving right to left within each scan line, and streaming the pixel data for the scene over a link such as HDMI to the display.
4) Based on the received pixel data, the display has to start emitting photons for each pixel.
5) At some point, the display has to stop emitting those particular photons for each pixel, either because pixels aren’t full-persistence (as with scanning lasers) or because the next frame needs to be displayed.

There’s generally additional buffering that happens in 3D pipelines, but I’m going to ignore that, since it’s not an integral part of the process of generating an AR/VR scene.

Let’s look at each of the three areas in turn.

Tracking latency is highly dependent on the system used. An IMU (3-DOF gyro and 3-DOF accelerometer) has very low latency – on the order of 1 ms – but drifts. In particular, position derived from the accelerometer drifts badly, because it’s derived via double integration from acceleration. Camera-based tracking doesn’t drift, but has high latency due to the need to capture the image, transfer it to the computer, and process the image to determine the pose; that can easily take 10-15 ms. Right now, one of the lowest-latency non-drifting accurate systems out there is a high-end system from NDI, which has about 4 ms of latency, so we’ll use that for the tracking latency.

Rendering latency depends on CPU and GPU capabilities and on the graphics complexity of the scene being drawn. Most games don’t attain 60 Hz consistently, so they typically have rendering latency of more than 16 ms, which is too high for AR/VR, which requires at least 60 Hz for a good experience. Older games can run a lot faster, up to several hundred Hz, but that’s because they’re doing relatively unsophisticated rendering. So let’s say rendering latency is 16 ms.

Once generated, the rendered image has to be transferred to the display. How long that takes for any particular pixel depends on the display technology and generally varies across the image, but for scan-based display technology, which is by far the most common, the worst case is that it will take nearly one full frame time for the pixel with the most delayed time between frame buffer update and scan-out to reflect the update. At 60 Hz, that’s 16 ms for the worst case, the worst case being where it’s nearly a full frame from the time the frame buffer is rendered until a given pixel gets scanned out to the display. For example, suppose a frame finishes rendering just as scan-out starts to read the topmost scan line on the screen. Then the topmost scan line will have almost no scan-out latency, but at 60 Hz it will be nearly 16 ms (almost a full frame time – not quite that long because there’s a vertical blanking period between successive frames) before scan-out reads the bottommost scan line on the screen and sends its pixel data to the display, at which point the latency between rendering that data and sending it to the display will be nearly 16 ms.

Sometimes each pixel’s data is immediately displayed as it arrives, as is the case with some scanning lasers and OLEDs. Sometimes it’s buffered and displayed a frame or more later, as with color-sequential LCOS, where the red components of all the pixels are illuminated at the same time, then the same is done separately for green, and then again for blue. Sometimes the pixel data is immediately applied, but there is a delay before the change is visible; for example, LCD panels take several milliseconds at best to change state. Some televisions even buffer multiple frames in order to do image processing. However, in the remainder of this discussion I’ll assume the best case, which is that we’re using a display that turns pixel data into photons as soon as it arrives.

Once the photons are emitted, there is no perceptible time before they reach your eye, but there’s still one more component to latency, and that’s the time until the photons from a pixel stop reaching your eye. That might not seem like it matters, but it can be very important when you’re wearing an HMD and the display is moving relative to your eye, because the longer a given pixel state is displayed, the farther it gets from its correct position, and the more it smears. From a latency perspective, far better for each pixel to simply illuminate briefly and then turn off, which scanning lasers do, than to illuminate and stay on for the full frame time, which some OLEDs and LCDs do. Many displays fall in between; CRTs have relatively low persistence, for example, and LCDs and OLEDs can have a wide range of persistence. Because the effects of persistence are complicated and subtle, I’ll save that discussion for another day, and simply assume zero persistence from here on out – but bear in mind that if persistence is non-zero, effective latency will be significantly worse than the numbers I discuss below; at 60 Hz, full persistence adds an extra 16 ms to worst-case latency.

So the current total latency is 4+16+16 = 36 ms – a long way from 20 ms, and light-years away from 7 ms.

Changing the rules

Clearly, something has to change in order for latency to get low enough for AR/VR to work well.

On the tracking end, the obvious solution is use both optical tracking and an IMU, via sensor fusion. The IMU can be used to provide very low-latency state, and optical tracking can be used to correct the IMU’s drift. This turns out to be challenging to do well, and there are no current off-the-shelf solutions that I’m aware of, so there’s definitely an element of changing the hardware rules here. Properly implemented, sensor fusion can reduce the tracking latency to about 1 ms.

For rendering, there’s not much to be done other than to simplify the scenes to be rendered. AR/VR rendering on PCs will have to be roughly on the order of five-year-old games, which have low enough overall performance demands to allow rendering latencies on the order of 3-5 ms (200-333 Hz). Of course, if you want to do general, walk-around AR, you’ll be in the position of needing to do very-low-latency rendering on mobile processors, and then you’ll need to be at the graphics level of perhaps a 2000-era game at best. This is just one of many reasons that I think walk-around AR is a long way off.

So, after two stages, we’re at a mere 4-6 ms. Pretty good! But now we have to get the rendered pixels onto the display, and it’s here that the hardware rules truly need to be changed, because 60 Hz displays require about 16 ms to scan all the pixels from the frame buffer onto the display, pretty much guaranteeing that we won’t get latency down below 20 ms.

I say “pretty much” because in fact it is theoretically possible to “race the beam,” rendering each scan line, or each small block of scan lines, just before it’s read from the frame buffer and sent to the screen. (It’s called racing the beam because it was developed back when displays were CRTs; the beam was the electron beam.) This approach (which doesn’t work with display types that buffer whole frames, such as color-sequential LCOS) can reduce display latency to just long enough to be sure the rendering of each scan line or block is completed before scan-out of those pixels occurs, on the order of a few milliseconds. With racing the beam, it’s possible to get overall latency down into the neighborhood of that 7 ms holy grail.

Unfortunately, racing the beam requires an unorthodox rendering approach and considerably simplified graphics, because each scan line or block of scan lines has to be rendered separately, at a slightly different point on the game’s timeline. That is, each block has to be rendered at precisely the time that it’s going to be scanned out; otherwise, there’d be no point in racing the beam in the first place. But that means that rather than doing rendering work once every 16.6 ms, you have to do it once per block. Suppose the screen is split into 16 blocks; then one block has to be rendered per millisecond. While the same number of pixels still need to be rendered overall, some data structure – possibly the whole scene database, or maybe just a display list, if results are good enough without stepping the internal simulation to the time of each block – still has to be traversed once per block to determine what to draw. The overall cost of this is likely to be a good deal higher than normal frame rendering, and the complexity of the scenes that could be drawn within 3-5 ms would be reduced accordingly. Anything resembling a modern 3D game – or resembling reality – would be a stretch.

There’s also the problem with racing the beam of avoiding visible shear along the boundaries between blocks. That might or might not be acceptable, although it would look like tear lines, and tear lines are quite visible and distracting. If that’s a problem, it might work to warp the segments to match up properly. And obviously the number of segments could be increased until no artifacts were visible, at a performance cost; in the limit, you could eliminate all artifacts by rendering each scan line individually, but that would induce a very substantial performance loss. On balance, it’s certainly possible that racing the beam, in one form or another, could be a workable solution for many types of games, but it adds complexity and has a significant performance cost, and overall at this point it doesn’t appear to me to be an ideal general solution to display latency, although I could certainly be wrong.

It would be far easier and more generally applicable to have the display run at 120 Hz, which would immediately reduce display latency to about 8 ms, bringing total latency down to 12-14 ms. Rendering should have no problem keeping up, since we’re already rendering at 200-333 Hz. 240 Hz would be even better, bringing total latency down to 8-10 ms.

Higher frame rates would also have benefits in terms of perceived display quality, which I’ll discuss at some point, and might even help reduce simulator sickness. There’s only one problem: for the most part, high-refresh-rate displays suitable for HMDs don’t exist.

For example, the current Oculus Rift prototype uses an LCD phone panel for a display. That makes sense, since phone panels are built in vast quantities and therefore are inexpensive and widely available. However, there’s no reason why a phone panel would run at 120 Hz, since it would provide no benefit to the user, so no one makes a 120 Hz phone panel. It’s certainly possible to do so, and likewise for OLED panels, but unless and until the VR market gets big enough to drive panel designs, or to justify the enormous engineering costs for a custom design, it won’t happen.

There’s another, related potential solution: increase the speed of scan-out and the speed with which displays turn streamed pixel data into photons without increasing the frame rate. For example, suppose that a graphics chip could scan-out a frame buffer in 8 ms, even though the frame rate remained at 60 Hz; scan-out would complete in half the frame time, and then no data would be streamed for the next 8 ms. If the display turns that data into photons as soon as it arrives, then overall latency would be reduced by 8 ms, even though the actual frame rate is still 60 Hz. And, of course, the benefits would scale with higher scan-out rates. This approach would not improve perceived display quality as much as higher frame rates, but neither does it place higher demands on rendering, so no reduction in rendering quality is required. Like higher frame rates, though, this would only benefit AR/VR, so it is not going to come into existence in the normal course of the evolution of display technology.

And this is where a true Kobayashi Maru moment is needed. Short of racing the beam, there is no way to get low enough display latency out of existing hardware that also has high enough resolution, low enough cost, appropriate image size, compact enough form factor and low enough weight, and suitable pixel quality for consumer-scale AR/VR. (It gets even more challenging when you factor in wide FOV for VR, or see-through for AR.) Someone has to step up and change the hardware rules to bring display latency down. It’s eminently doable, and it will happen – the question is when, and by whom. It’s my hope that if the VR market takes off in the wake of the Rift’s launch, the day when display latency comes down will be near at hand.

If you ever thought that AR/VR was just a simple matter of showing an image on the inside of glasses or goggles, I hope that by this point in the blog it’s become clear just how complex and subtle it is to present convincing virtual images – and we’ve only scratched the surface. Which is why, in the first post, I said we needed smart, experienced, creative hardware and software engineers who work well in teams and can manage themselves – maybe you? – and that hasn’t changed.

A bit of housekeeping

When I put up my first post, back in April, I was immediately buried in email. Over the next few months, I gradually dug out from under the pile, responding to most of the people who had written. However, there were a few hundred that I put in a folder for later, and there they sat until it seemed too late to respond. A lot of those emails were from people asking how they could get into the game industry, or what they should do to get hired at Valve, and for them, the best answers I can give are in this post and the Valve Handbook. If you mailed me back in April, and if after reading the above links and the blog to this point your question or topic remains unaddressed, please send your mail again – my turnaround time is much better these days.

When it comes to resolution, it’s all relative

The first video game I ever wrote ran in monochrome, at 160×72 resolution. My next four games moved up to four colors at 320×200 resolution. The game after that (Quake) normally ran with 256 colors at 320×200 resolution, but could go all the way up to 640×480 on the brand-new Pentium Pro, the first out-of-order Intel processor.

Those sound like Stone Age display modes, now that games routinely run in 24-bit color at 1600×1200 or even 2560×1600, but you know what? All those games looked great at the time. Quake at 640×480 would look pathetically low-resolution now, but when it shipped, even 320×200 looked great; it’s all a matter of what you’re used to.

That’s relevant just now because the first generation of consumer-priced VR head-mounted displays is likely to top out at 960×1080 resolution, for the simple reason that that’s what you get when you split a 1080p screen across two eyes, and 1080p is probably going to be the highest-resolution panel available in the near future that’s small enough to fit in a head-mounted display. At first glance, that doesn’t seem so bad; it falls short of 2560×1600, or even 1600×1200, but it’s half of the latter, so it’s in the same resolution ballpark as monitors. And besides, it’s way higher-resolution than any of my earlier games, and in fact it’s higher-resolution than anything that was available for more than 15 years after the PC was introduced, and, as I noted, those lower-resolution graphics looked great then. By analogy, VR should be in good shape at 960×1080, right?

Alas, it’s not that simple, because when it comes to resolution, it’s all relative. What do I mean by that? There are two very different interpretations, both applicable to the present discussion. We’ve seen the first one already: how good a given resolution looks depends on what you’re used to looking at. 160×72 looks great when the alternative is a text-based game, but less so next to a state-of-the-art game at 2560×1600. This first interpretation applies to VR in two senses. The first is that VR will inevitably be compared to current PC graphics – clearly not a favorable comparison. However, the second is that, like my early games, VR will also be judged against previous VR graphics in the PC space, and that’s a favorable comparison indeed, since there are none. For the latter reason, if VR is a unique enough experience, people will surely be very forgiving about low resolution; the brain is very good at filling in details, given an otherwise compelling experience, as happened, for example, with Quake at 320×200.

Another way to think about resolution, however, is relative to the field of view the pixels are spread across. The total number of pixels matters, of course, but the density of the pixels matters as well, and it’s here that VR faces some unique issues. Let’s run some numbers on that.

My very first game ran on a monitor that I’d estimate to have a horizontal field of view of maybe 15 degrees at a normal viewing distance. At 160×72, that’s about 11 pixels per horizontal degree.

A 30” monitor at 2560×1600 has about a 50-degree field of view at a normal viewing distance. That’s roughly 50 pixels per horizontal degree, and approximately the same is true of a 20” monitor at 1600×1200.

The first consumer VR head-mounted displays should have fields of view that are no less than a 90 degrees, and I’d hope for more, because field of view is key to a truly immersive experience. At 960×1080 resolution, that yields slightly less than 11 pixels per horizontal degree – the same horizontal pixel density as the CP/M machine I wrote my first game for in 1980, and barely one-fifth of the horizontal pixel density we routinely use now.

And that’s only the horizontal pixel density. The vertical pixel density is the same, and in combination they mean that a first-generation consumer head-mounted display will have about one-twentieth of the two-dimensional pixel density of a desktop monitor. As another way to understand just how low a wide field of view drives pixel density, consider that the iPhone 5 is 640×1136 – two-thirds as many pixels as the upcoming head-mounted displays, packed into a vastly smaller field of view; at a normal viewing distance, I’d estimate the iPhone has roughly 100 pixels per degree, so overall pixel density could be close to one-hundred times that of upcoming VR head-mounted displays.

It is certainly true that the brain can fill in details, especially when viewing scenes filled with moving objects. However, it would be highly optimistic to believe that a reduction in pixel density of more than an order of magnitude wouldn’t be obvious, and indeed it is. It’s certainly hard to miss the difference between these two images, which reflect the same base image at two different pixel densities:

And that’s only a 4X difference – imagine what 20X would be like.

If there were no monitors to compare to, low pixel density might not be as noticeable, but there are, not to mention omnipresent mobile devices with even higher pixel densities. Also, games that depend on very precise aiming may not work well on a head-mounted display where pixel location is accurate to only five or six arc-minutes. For that reason, antialiasing, which effectively provides subpixel positioning, will be very important for at least the first few generations of VR.

That’s not to say that the upcoming VR head-mounted displays won’t be successful; a huge field of view, together with high-quality tracking and low latency, can produce a degree of immersion that’s unlike anything that’s come before, with the potential to revolutionize the whole gaming experience. But I can tell you from personal experience that the visual difference between a 960×1080 40-degree horizontal field of view head-mounted display and a 640×800 90-degree HFOV HMD (both of which I happen to have worked with recently) is enormous – what looks like a blurry clump of pixels on one looks like a little spaceship you could reach out and touch on the other – and that’s only a ten-times difference.

So I’m pretty confident that we’ll be begging for more resolution from our head-mounted displays for a long time. Obviously, that was also the case for decades with monitors; the difference here is that every day we’ll encounter much higher pixel densities on our monitors, our laptops, our tablets, and even our phones than on our head-mounted displays, and that comparison is going to be a challenge for the consumer VR industry for some time to come.

Given which, the obvious question is: how high does VR resolution need to go before it’s good enough? I don’t know what would be ideal, but getting to parity with monitors in terms of pixel density seems like a reasonable target. Given a 90-degree field of view in both directions, 4K-by-4K resolution would be close to achieving that, and 8K-by-8K would exceed it. That doesn’t sound all that far from where monitors are now, but actually it’s four to sixteen times as many pixels; there’s no existing video link that can pump that many pixels – in stereo – at 60 Hz (which is the floor for VR), not to mention the lack of panels or tiny projectors that can come close to those resolutions (and the lack of business reasons to develop them right now), so pixel density parity is not just around the corner. However, if VR can become established as a viable market, competitive pressures of the same sort that operated (and continue to operate) in the 3D graphics chip business will drive VR resolutions, and hence pixel densities, rapidly upward. VR could well become the primary force driving GPU performance as well, because it will take a lot of rendering power to draw 16 megapixels, antialiased, in stereo, at 60 Hz – to say nothing of 64 megapixels.

Believe me, I can’t wait to have a 120-by-120-degree field of view at 8K-by-8K resolution – it will (literally) be a sight to behold. But I’m not expecting to behold it for a while.

Two Possible Paths into the Future of Wearable Computing: Part 2 – AR

The year is 2015. Wearable glasses have taken off, and they’re game-changers every bit as much as smartphones were, because these descendants of Google Glass give you access to information everywhere, all the time. You’re wearing these glasses, I’m wearing these glasses, all the early adopters are wearing them. We use them to make and receive phone calls, send and receive texts, do instant messaging, do email, get directions and route guidance, browse the Web, and do pretty much everything we do with smartphones today.

However, these glasses don’t support true AR (augmented reality); that is, they can’t display virtual objects that appear to be part of the real world. Instead, they can only display what I called HUDSpace in the last post – heads-up-display-type information that doesn’t try to seem to be part of the real world (Terminator vision, for example). No one particularly misses AR, because all the functionality of a smartphone is there in a more accessible form, and that’s enough to make the glasses incredibly useful.

But then someone comes out with a special edition of their HUDSpace glasses; the special part is that if you put a marker card down on a convenient surface, you can play virtual card and board games on it, either by yourself or with friends. This is a reasonably popular novelty; then the offerings expand to include anything you can play on a table – strategy games, RTSes, arena-type 2D arcade games extruded into 3D, and new games unique to AR – and someone comes out with a version of the glasses that doesn’t need markers, so you can play games in boring meetings, and suddenly everyone wants one. The race is on, and soon there’s room-scale AR, followed by steady progress on the long march toward general, walk-around AR.

And that’s how I think it’s most likely AR will come into our daily lives.

A quick recap

Last time, I described how my original thinking that AR was likely to be the next great platform shift had evolved to consider the possibility that VR (virtual reality) might be equally important, at least in the short and medium term, and far more tractable today, so perhaps it would make sense to pursue VR as well right now. (See the last post for definitions of AR, VR, and other terms I’ll use in this post.) Then I made the case for VR as the most promising target in the near future. I personally think that case is pretty compelling.

This time I’ll make a case for AR as more likely to succeed even in the short term (last time I explained why I think it’s the most important long-term goal), and I think that case is pretty compelling too. The truth is, given infinite resources, I’d want to pursue both as hard as possible; one doesn’t preclude the other, and both could pan out in a big way. But resources (especially time) are finite, alas, and choices have to be made, so a lot of thought has gone into choosing where to focus, and these two posts recount some of that thinking.

Of course, I hope we get this right, but as Yogi Berra put it: “It’s tough to make predictions, especially about the future.” All we can do is make our best assessment, start doing experiments, and see where that leads, constantly reevaluating and course correcting as needed (and it will be needed!). So I’m by no means laying out a roadmap of the future; this post and the last are just two possible ways wearable computing might unfold.

How AR might actually evolve

The first step in assessing whether to focus on AR or VR is to figure out how each is most likely to succeed, and then to compare the strengths and weaknesses of each of those probable paths. The likely path for VR is obvious, and already in motion: The Oculus Rift will come out, running ports of existing PC games. If it’s even moderately successful, games will be written specifically for VR, Oculus will improve the hardware and competitors will emerge, VR will likely spread to mobile and consoles, and the boom will be on.

The path AR might take to success is less clear, because there are many types of AR – tabletop, room-scale, and walk-around – and several platforms it could emerge on – PC, mobile, and console. Also, as the scenario I sketched out at the start illustrates, AR could evolve from HUDSpace. So let’s look in more detail at that scenario, and then examine why I think it might be more promising than other paths for AR and VR.

In my scenario, AR isn’t even part of the picture at first; see-through glasses emerge, but wearable computing develops along the Google Glass path, supporting only display of information that doesn’t appear to be part of the real world, rather than true AR. To be clear, it’s quite possible that Google Glass won’t be see-through, but will just provide an opaque information display above and to the side of your normal line of sight. However, I think see-through glasses have much more potential, if only because they’ll have much more screen real estate, won’t block your view, allow for in-place annotation of the real world, and will be more comfortable to look at. That’s a good thing for my scenario, since see-through potentially leads to AR, while an opaque display out of the line of sight doesn’t.

Having information available everywhere, all the time will be tremendously valuable, and HUDSpace glasses will probably become widely used; in fact, you could make a strong argument that people who wear them will seem to be smarter than everyone else, because they will have faster access to information. Think of all the times you’ve hauled out your phone in a conversation to look something up, and now imagine you can do that without having to visibly query anything; you’ll just seem smarter. (Obviously I could be wrong in assuming that HUDSpace glasses will be widely used – it may turn out that people hate having information fed to them through glasses – but certainly there’s a strong argument to be made that better access to information is likely to be compelling, and since the rest of this scenario depends on it, I’ll just take it as a given.)

You may well wonder why these glasses wouldn’t have AR capabilities – after all, even cellphones can do AR today, right? Here I need to draw a distinction between true AR and cellphone AR. Cellphone AR, although interesting, is at best a distant cousin to true AR, for one key reason: cellphone AR doesn’t have to fool your entire visual perception system into thinking virtual images exist in the real world. By this I’m referring not to photorealistic rendering (the eye and brain are quite tolerant of cruder rendering), but rather to the requirement that virtual images appear to be solid, crisp, and in exactly the right place relative to the real world at all times as your head moves. The tolerance of the human visual system for discrepancies in those areas when viewing 3D virtual images that are supposed to appear to be part of the real world – that is, true AR – is astonishingly low; violate exceedingly tight parameters (for example, something on the order of 20 ms for latency), and virtual objects simply won’t seem like they’re part of the world around you. With cellphone AR, you’re just looking at a 2D picture, like a TV, and in that circumstance there are all sorts of automatic reflexes and visual processing that don’t kick in, which greatly relaxes hardware requirements.

The visual system’s low tolerance for mismatches between the virtual and real worlds means that the hardware required to make true AR work well is significantly more demanding – and expensive – than the hardware needed for HUDSpace. This is particularly the case for general, walk-around AR, which has to constantly cope with new, wildly varying settings and lighting, but it’s true even for room-scale and tabletop AR, primarily due to the requirements for display technology and tracking of the real world. At some point I’ll post about those areas, but for now, trust me, it’s a lot easier to build glasses that display HUD information, or at most images that are loosely related to the real world (like floating signs in the general direction of restaurants) than it is to build one that displays virtual images that fool your visual system into thinking they exist in the real world.

Given that true AR is hard, expensive, and not required, HUDSpace glasses will initially almost certainly not support true AR. Interestingly, because they’ll almost certainly be see-through, HUDSpace glasses won’t even support cellphone AR well,.

So in this scenario, a few years from now we’re all wearing HUDSpace glasses and using them to do what we do now with a smartphone, but more effectively, because the glasses give us access to information all the time, and privately. They’ll also do things that a smartphone isn’t good at, such as popping up the names of people you encounter, which you can’t politely use your smartphone to do. The obvious difference from a smartphone is that the glasses won’t have a capacitive touchscreen, and honestly I don’t know what the input method will be, but there are several plausible answers, so I’ll assume that’ll work out and skip over it for now. Several large companies are making HUDSpace glasses, and the competition is as fierce as it is in smartphones today. All kinds of great apps are being written for the glasses, including HUDSpace versions of existing casual and location-based games, but there are no true AR apps, because the hardware doesn’t support them.

As I described in the opening, it’s at this point that someone will probably put a camera on their glasses that’s good enough for tabletop AR, probably with the help of a fiducial (a marker designed to be tracked by a camera) placed where you want the AR to appear. Add good tracking code, and you’ll be able to play any tabletop game anyone cares to write. The glasses will be networked, so you’ll be able to play any card or board game you can think of, and you’ll be able to do that either with someone sitting at the table with you or with anyone on the Internet. Better tracking hardware and software will eliminate the need for fiducials, and the Tetris or Angry Birds of tabletop AR will appear, sparking a rapidly escalating AR arms race, similar to what happened with 3D accelerators and 3D games. AR will expand to room scale, which will involve group games, of course, and a general expansion of current console gaming possibilities, but also non-game applications like construction kits (living room Minecraft- and Lego-type applications), virtual toys, and virtual pets, and at that point there will be a critical mass of AR users, hardware, and software that makes it economically and technically feasible to start chipping away at walk-around AR. It’ll probably take a decade or two, or even more, before truly general AR exists, but it’s easy to see how an accelerating curve heading in that direction could spring from the first wearable glasses that provide a good-enough tabletop AR experience.

Why AR is more likely to evolve from HUDSpace than to appear on its own

There are several reasons I think evolving from HUDSpace is a more likely way for AR to come into broad use than emerging as a fully-formed product on its own.

The first thing you’ll notice is that my favored scenario doesn’t involve walk-around AR at all for a long time. That’s a huge plus; even though I think walk-around AR is the end point and hugely valuable, it’s very hard to get to in any near-term timeframe. One problem with a lot of potential technological innovations is that they require abandoning existing systems and making a wholesale jump to a new system, and it’s hard to make all the parts of those sorts of transitions happen successfully at the same time. That’s certainly true of walk-around AR, which would require display, image-generation, and tracking technology that doesn’t exist today, all packaged in a form factor similar to bulky sunglasses, running on a power budget that far exceeds what’s now possible in a mobile device, along with completely new types of applications, as I discussed in the last point. Honestly, though, I used walk-around AR as a strawman in that post; it’s clear that it’s a long way away from being good enough to be a product, so it served as a useful counterpoint to illustrate the advantages of VR. Constrained AR, both room-scale and tabletop, lies somewhere between walk-around AR and VR, and is much closer than walk-around AR to being ready for broad use, although not as close as VR. Room-scale AR has many of the same technical challenges as walk-around AR, although to a lesser degree; tracking, for example, is difficult, but there are potentially workable, albeit currently expensive, solutions. Tabletop AR, on the other hand, is relatively tractable, although not quite to VR’s level; the problem with tabletop AR is primarily that because it’s so limited, it’s simply not as compelling or novel as room-scale or walk-around AR.

AR that emerges in stages from HUDSpace glasses, on the other hand, doesn’t require any great leaps; each step is an incremental one that stands on its own. Solving those problems separately and incrementally is far more realistic, especially assuming the preexistence of a HUDSpace business that’s big enough to justify the R&D AR will need. As a starting point, tabletop AR that evolves from HUDSpace glasses involves tracking that’s doable today, optics and image projectors that will be a manageable step from HUDSpace, power and processing technologies that will be largely driven by phones, tablets, and HUDSpace glasses, and initial software that’s familiar, including at least the tabletop games I listed in the introduction.

In short, the technological path from HUDSpace glasses to HUDSpace-plus-tabletop-AR glasses seems realistic, while going from nothing directly to walk-around or even room-scale AR seems like a big stretch. That’s true not only technically, but also in a business sense, because HUDSpace-plus-tabletop-AR doesn’t require AR to justify the cost of the hardware by itself; in contrast, standalone AR systems would be in direct competition with consoles and dedicated gaming devices, with all the costs and risks that involves.

Consider two products that support AR. The first product is a special edition of a widely-used pair of HUDSpace glasses that is normally sold for $199; the special edition sells for $299 because it has cameras and more powerful processors that let it support tabletop AR gaming. The second product is a pair of AR glasses designed specifically for living-room use; it supports room-scale AR games that you can play on your own or with friends, and costs $299, plus $199 for each additional pair of glasses.

Even though the pure AR glasses are more powerful and would support a wider variety of novel experiences for the same total price, it’s hard to see how they could be successful unless the experience was truly awesome. At $299 and up, this would be going directly against existing consoles, and it’s hard to make the first games for a whole new type of gaming be killer apps that it’s worth buying the whole system for, because it takes time to figure out what unique experiences new hardware makes possible. Getting developers to devote effort to support a new, unproven platform is hard as well – it obviously can be done, but it’s a major undertaking. Also, the up-front expenditures and risk would be relatively large, since this would be a new type of product that at least overlaps with the existing console space. In short, it would require a console-scale effort, with all the risk a new console with new technology involves. A tabletop AR product would be less of a step into the unknown, and could be somewhat less expensive – but at the same time it would be more limited and less novel than room-scale AR, so there’s still the question of whether it’d be compelling enough to justify the purchase of a complete system. I’d love to be wrong – It’d be great if a standalone tabletop or room-scale AR system could be successful on its own merits. It just seems like they would have to overcome considerably greater market and technical challenges than evolution from HUDSpace glasses.

On the other hand, I have no problem imagining that a lot of people who are buying the HUDSpace glasses anyway – which they will be, because they’re very useful – would spend $100 to upgrade to make them more fun to use. The key here is that AR itself doesn’t have to justify the cost of the system, just the much smaller upgrade cost. You might say that’s not fair, that it’s not as powerful a system, but that’s the point – in the beefed-up HUDSpace case, AR doesn’t have to be compelling enough to justify the purchase of the glasses in the first place. If you want to convince people to buy a whole new system to put in their living room, or to buy a dedicated AR system for tabletop gaming, you have to get over the barrier of convincing them that they want to own yet another gaming device. If, on the other hand, you want to sell people established HUDSpace glasses with tabletop AR capability, they’ve already decided to make a purchase, and it’s just a question of whether they want to buy a cool and not very expensive option; in fact, far from being a barrier to purchase, the AR option makes the purchase of HUDSpace glasses more attractive.

Better yet, if you want to play a multiplayer game with someone else, they’re likely to have their own glasses, since HUDSpace glasses will probably be widely used, so there’s no incremental cost for multiplayer. The network effect from widespread adoption based on HUDSpace is a huge advantage for beefed-up HUDSpace glasses.

The bottom line is that the HUDSpace-plus-tabletop-AR scenario is a pull model, with the right incentives; a lot of the hardware and a sizeable market get developed for HUDSpace independent of AR, and AR then serves as an enhancement to help sell HUDSpace glasses into that existing market. In contrast, any scenario involving a standalone AR product is a push model, where a market for a new type of relatively expensive product has to be created and developed rapidly, in competition with existing consoles. It could happen, but it seems less likely to succeed.

Advantages of constrained AR over VR

Now that you know how I think AR is most likely to emerge, and that it will likely be constrained to tabletop and possibly room-scale AR for quite a while, we can return to our original question, which is whether it makes sense to pursue AR only, or a mix of AR and VR, especially in the near- and medium-term. Last time I discussed why VR was interesting; now it’s time to talk about why AR might be more interesting.

I will first note again that last time I compared VR to walk-around AR, and that that was a strawman argument. I don’t think there’s any world in which true walk-around AR is feasible in any way in the next five years. As I discussed above, the challenges that constrained AR – room-scale and tabletop – face are similar to but far less daunting than walk-around AR, and constrained AR is probably doable to at least some extent in the next five years, a somewhat but not greatly longer timeframe than VR, so the question is which makes more sense to pursue.

First off, technically VR is easier to implement with existing and near-term technology; that’s just a fact, as evidenced by the Oculus Rift. The Rift definitely has some rough edges to smooth out, but there are ways to address those, and I expect Oculus to ship a credible product at a consumer price; in contrast, as of this writing, I have been unable to obtain a pair of AR glasses capable of being a successful consumer product. The core issue generally has to do with the great difficulty of making good enough see-through optics in glasses that with acceptable form factor and weight. However, I know of several approaches in development, any of which would be sufficient if all the kinks were ironed out, and it seems probable that this will be solved relatively soon, so it’s a disadvantage for AR, but not a decisive one.

VR is also more immersive in several ways: field of view, blocking of real-world stimuli, and full control over the color and intensity of every pixel , which can make for deeper, more compelling experiences, but there are downsides as well. Immersion may not be good for extended use, either because it induces unpleasant sensory overload or simply because it makes people sick. AR provides anchoring to the real world, and that helps a lot; I personally get simulator sickness quite easily with existing VR systems, but rarely have that problem with AR. I’m confident that AR will be easier for most people to use for long periods than VR.

Another advantage that comes with being less immersive is awareness of the real world around you, and that’s a big one.

For starters, being not-blind means that you can reach for your coffee or soda, find the keyboard and mouse and controller, answer the phone, and see if someone’s come into the room. This is such a big deal that I believe VR will not be widely adopted until VR headsets appear that make it possible to be not-blind instantly, most likely by being able to switch the display over to the feed from a camera on the headset with a touch of a button, but also possibly with a small picture-in-picture feed from the camera while otherwise immersed in VR, or with a display that can become transparent instantly.

Being not-blind also means that you can give only a part of your attention to AR. For example, you could have an in-progress game of chess sitting on a corner of your desk; you’d notice it every so often, but you wouldn’t have to be focused on it all the time. This lets you use AR a lot of the time, in a variety of situations. In contrast, when you’re doing something in VR, it’s the only thing you can be doing, which considerably limits the possibilities. Being not-blind also means that you can be mobile while using AR, even if only to move around a table for a better view, while VR pretty much requires you to be immobile, further limiting the possibilities.

Most important, being able to see the real world means that you can have far more social AR experiences with other people than you can with VR. Sitting around the table with your family playing a board game, sitting on the couch with a friend seeing how high you can build a tower together, or having a quadracopter dogfight are all appealing in very different ways than isolated VR experiences, and given how intensely social humans are, those are ways that are arguably more compelling. In this respect, AR experiences will be more complex and unique than VR experiences, since they will incorporate both the real world and that most unpredictable and creative of factors, other people, and consequently have greater potential.

Finally, constrained AR is on the path to walk-around AR, and walk-around AR is where I think we all end up eventually.

So, AR or VR?

At long last, to quote the renowned technology sage Meat Loaf: “What’s it gonna be, boy?” Unfortunately, after our long and interesting journey through possible futures, I’m not going to give you the crisp, decisive answer you (and I) would like, because there are two time frames and two scopes at work here.

There’s no way it makes sense to simply abandon AR for VR. Interaction with the real world and especially with other people is why AR is the right target in the long run; we live our lives in the real world and in the company of other people, and eventually AR will be woven deeply into our lives. In the medium term, I believe AR will likely emerge from HUDSpace roughly along the lines of the scenario above; another possibility is that a console manufacturer will decide to make room-scale AR a key feature, as hinted at by the purported leak of Microsoft’s Project Fortaleza a few months ago. All this makes it highly likely that work on tabletop and room-scale AR now will bear fruit in the future; it might be a little early right now to be working on that, but the problems are challenging and will take time to solve, so it makes sense to investigate them now.

In the near term, though, VR hardware will be shipping, and because the requirements are more limited, it should improve more rapidly than AR hardware. Also, it’s easier to adapt existing AAA titles to VR, and while VR won’t really take off until there are great games that built around what VR can do, AAA titles should get VR off the ground and attract a hard-core gaming audience. And a lot of the work done on VR will benefit AR as well.

So my personal opinion (which is not necessarily Valve’s) is that it makes sense to do VR now, and push it forward as quickly as possible, but at the same time to continue research into the problems unique to AR, with an eye to tilting more and more toward AR over time as it matures. As I said, it’s not the definitive answer we’d all like, but it’s where my thinking has led me. However, I’ve encountered intelligent opinions from one end of the spectrum to the other, and I look forward to continuing the discussion in the comments.