Christmas always makes me think of Mode X – which surely requires some explanation, since it’s not the most common association with this time of year (or any time of year, for that matter).
IBM introduced the VGA graphics chip in the late 1980s as a replacement for the EGA. The biggest improvement in the VGA was the addition of the first 256-color mode IBM had ever supported – Mode 0×13 – sporting 320×200 resolution. Moreover, Mode 0×13 had an easy-to-program linear bitmap, in contrast to the Byzantine architecture of the older 16-color modes, which involved four planes and used four different pixel access modes, controlled through a variety of latches and registers. So Mode 0×13 was a great addition, but it had one downside – it was slow.
Mode 0×13 only allowed one byte of display memory – one pixel – to be modified per write access; even if you did 16-bit writes, they got broken into two 8-bit writes. The hardware used in 16-color modes, for all its complexity, could write a byte to each of the four planes at once, for a total of 32 bits modified per write. That four-times difference meant that Mode 0×13 was by far the slowest video mode.
Mode 0×13 also didn’t result in square pixels; the standard monitor aspect ratio was 4:3, which was a perfect match for the 640×480 high-res 16-color mode, but not for Mode 0×13’s 320×200. Mode 0×13 was limited to 320×200 because the video memory window was only 64KB, and 320×240 wouldn’t have fit. 16-color modes didn’t have that problem; all four planes were mapped into the same memory range, so they could each be 64KB in size.
In December of 1989, I remember I was rolling Mode 0×13’s aspect ratio around in my head on and off for days, thinking how useful it would be if it could support square pixels. It felt like there was a solution there, but I just couldn’t tease it out. One afternoon, my family went to get a Christmas tree, and we brought it back and set it up and started to decorate it. For some reason, the aspect ratio issue started nagging at me, and I remember sitting there for a minute, watching everyone else decorate the tree, phased out while ideas ran through my head, almost like that funny stretch of scrambled thinking just before you fall asleep. And then, for no apparent reason, it popped into my head:
Treat it like a 16-color mode.
You see, the CPU-access side of the VGA’s frame buffer (that is, reading and writing of its contents by software) and the CRT controller side (reading of pixels to display them) turned out to be completely independently configurable. I could leave the CRT controller set up to display 256 colors, but reconfigure CPU access to allow writing to four planes at once, with all the performance benefits of the 16-color hardware – and, as it turned out, a write that modified all four planes would update four consecutive pixels in 256-color mode. This meant fills and copies could go four times as fast. Better yet, the 64KB memory window limitation went away, because now four times as many bytes could be addressed in that window, so a few simple tweaks to get the CRT controller to scan out more lines produced a 320×240 mode, which I dubbed “Mode X” and wrote up in the December, 1991, Dr. Dobb’s Journal. Mode X was widely used in games for the next few years, until higher-res linear 256-color modes with fast 16-bit access became standard.
If you’re curious about the details of Mode X – and there’s no reason you should be, because it’s been a long time since it’s been useful – you can find them here, in Chapters 47-49.
One interesting aspect of Mode X is that it was completely obvious in retrospect – but then, isn’t everything? Getting to that breakthrough moment is one of the hardest things there is, because it’s not a controllable, linear process; you need to think and work hard at a problem to make it possible to have the breakthrough, but often you then need to think about or do something – anything – else, and only then does the key thought slip into your mind while you’re not looking for it.
The other interesting aspect is that everyone knew that there was a speed-of-light limit on 256-color performance on the VGA – and then Mode X made it possible to go faster than that limit by changing the hardware rules. You might think of Mode X as a Kobayashi Maru mode.
Which brings us, neat as a pin, to today’s topic: when it comes to latency, virtual reality (VR) and augmented reality (AR) are in need of some hardware Kobayashi Maru moments of their own.
When it comes to VR and AR, latency is fundamental – if you don’t have low enough latency, it’s impossible to deliver good experiences, by which I mean virtual objects that your eyes and brain accept as real. By “real,” I don’t mean that you can’t tell they’re virtual by looking at them, but rather that your perception of them as part of the world as you move your eyes, head, and body is indistinguishable from your perception of real objects. The key to this is that virtual objects have to stay in very nearly the same perceived real-world locations as you move; that is, they have to register as being in almost exactly the right position all the time. Being right 99 percent of the time is no good, because the occasional mis-registration is precisely the sort of thing your visual system is designed to detect, and will stick out like a sore thumb.
Assuming accurate, consistent tracking (and that’s a big if, as I’ll explain one of these days), the enemy of virtual registration is latency. If too much time elapses between the time your head starts to turn and the time the image is redrawn to account for the new pose, the virtual image will drift far enough so that it has clearly wobbled (in VR), or so that is obviously no longer aligned with the same real-world features (in AR).
How much latency is too much? Less than you might think. For reference, games generally have latency from mouse movement to screen update of 50 ms or higher (sometimes much higher), although I’ve seen numbers as low as about 30 ms for graphically simple games running with tearing (that is, with vsync off). In contrast, I can tell you from personal experience that more than 20 ms is too much for VR and especially AR, but research indicates that 15 ms might be the threshold, or even 7 ms.
AR/VR is so much more latency-sensitive than normal games because, as described above, they’re expected to stay stable with respect to the real world as you move, while with normal games, your eye and brain know they’re looking at a picture. With AR/VR, all the processing power that originally served to detect anomalies that might indicate the approach of a predator or the availability of prey is brought to bear on bringing virtual images that are wrong by more than a tiny bit to your attention. That includes images that shift when you move, rather than staying where they’re supposed to be – and that’s exactly the effect that latency has.
Suppose you rotate your head at 60 degrees/second. That sounds fast, but in fact it’s just a slow turn; you are capable of moving your head at hundreds of degrees/second. Also suppose that latency is 50 ms and resolution is 1K x 1K over a 100-degree FOV. Then as your head turns, the virtual images being displayed are based on 50 ms-old data, which means that their positions are off by three degrees, which is wider than your thumb held at arm’s length. Put another way, the object positions are wrong by 30 pixels. Either way, the error is very noticeable.
You can do prediction to move the drawing position to the right place, and that works pretty well most of the time. Unfortunately, when there is a sudden change of direction, the error becomes even bigger than with no prediction. Again, it’s the anomalies that are noticeable, and reversal of direction is a common situation that causes huge anomalies.
Finally, latency seems to be connected to simulator sickness, and the higher the latency, the worse the effect.
So we need to get latency down to 20 ms, or possibly much less. Even 20 ms is very hard to achieve on existing hardware, and 7 ms, while not impossible, would require significant compromises and some true Kobayashi Maru maneuvers. Let’s look at why that is.
The following steps have to happen in order to draw a properly registered AR/VR image:
1) Tracking has to determine the exact pose of the HMD – that is, the exact position and orientation in the real world.
2) The application has to render the scene, in stereo, as viewed from that pose. Antialiasing is not required but is a big plus, because, as explained in the last post, pixel density is low for wide-FOV HMDs.
3) The graphics hardware has to transfer the rendered scene to the HMD’s display. This is called scan-out, and involves reading sequentially through the frame buffer from top to bottom, moving right to left within each scan line, and streaming the pixel data for the scene over a link such as HDMI to the display.
4) Based on the received pixel data, the display has to start emitting photons for each pixel.
5) At some point, the display has to stop emitting those particular photons for each pixel, either because pixels aren’t full-persistence (as with scanning lasers) or because the next frame needs to be displayed.
There’s generally additional buffering that happens in 3D pipelines, but I’m going to ignore that, since it’s not an integral part of the process of generating an AR/VR scene.
Let’s look at each of the three areas in turn.
Tracking latency is highly dependent on the system used. An IMU (3-DOF gyro and 3-DOF accelerometer) has very low latency – on the order of 1 ms – but drifts. In particular, position derived from the accelerometer drifts badly, because it’s derived via double integration from acceleration. Camera-based tracking doesn’t drift, but has high latency due to the need to capture the image, transfer it to the computer, and process the image to determine the pose; that can easily take 10-15 ms. Right now, one of the lowest-latency non-drifting accurate systems out there is a high-end system from NDI, which has about 4 ms of latency, so we’ll use that for the tracking latency.
Rendering latency depends on CPU and GPU capabilities and on the graphics complexity of the scene being drawn. Most games don’t attain 60 Hz consistently, so they typically have rendering latency of more than 16 ms, which is too high for AR/VR, which requires at least 60 Hz for a good experience. Older games can run a lot faster, up to several hundred Hz, but that’s because they’re doing relatively unsophisticated rendering. So let’s say rendering latency is 16 ms.
Once generated, the rendered image has to be transferred to the display. How long that takes for any particular pixel depends on the display technology and generally varies across the image, but for scan-based display technology, which is by far the most common, the worst case is that it will take nearly one full frame time for the pixel with the most delayed time between frame buffer update and scan-out to reflect the update. At 60 Hz, that’s 16 ms for the worst case, the worst case being where it’s nearly a full frame from the time the frame buffer is rendered until a given pixel gets scanned out to the display. For example, suppose a frame finishes rendering just as scan-out starts to read the topmost scan line on the screen. Then the topmost scan line will have almost no scan-out latency, but at 60 Hz it will be nearly 16 ms (almost a full frame time – not quite that long because there’s a vertical blanking period between successive frames) before scan-out reads the bottommost scan line on the screen and sends its pixel data to the display, at which point the latency between rendering that data and sending it to the display will be nearly 16 ms.
Sometimes each pixel’s data is immediately displayed as it arrives, as is the case with some scanning lasers and OLEDs. Sometimes it’s buffered and displayed a frame or more later, as with color-sequential LCOS, where the red components of all the pixels are illuminated at the same time, then the same is done separately for green, and then again for blue. Sometimes the pixel data is immediately applied, but there is a delay before the change is visible; for example, LCD panels take several milliseconds at best to change state. Some televisions even buffer multiple frames in order to do image processing. However, in the remainder of this discussion I’ll assume the best case, which is that we’re using a display that turns pixel data into photons as soon as it arrives.
Once the photons are emitted, there is no perceptible time before they reach your eye, but there’s still one more component to latency, and that’s the time until the photons from a pixel stop reaching your eye. That might not seem like it matters, but it can be very important when you’re wearing an HMD and the display is moving relative to your eye, because the longer a given pixel state is displayed, the farther it gets from its correct position, and the more it smears. From a latency perspective, far better for each pixel to simply illuminate briefly and then turn off, which scanning lasers do, than to illuminate and stay on for the full frame time, which some OLEDs and LCDs do. Many displays fall in between; CRTs have relatively low persistence, for example, and LCDs and OLEDs can have a wide range of persistence. Because the effects of persistence are complicated and subtle, I’ll save that discussion for another day, and simply assume zero persistence from here on out – but bear in mind that if persistence is non-zero, effective latency will be significantly worse than the numbers I discuss below; at 60 Hz, full persistence adds an extra 16 ms to worst-case latency.
So the current total latency is 4+16+16 = 36 ms – a long way from 20 ms, and light-years away from 7 ms.
Clearly, something has to change in order for latency to get low enough for AR/VR to work well.
On the tracking end, the obvious solution is use both optical tracking and an IMU, via sensor fusion. The IMU can be used to provide very low-latency state, and optical tracking can be used to correct the IMU’s drift. This turns out to be challenging to do well, and there are no current off-the-shelf solutions that I’m aware of, so there’s definitely an element of changing the hardware rules here. Properly implemented, sensor fusion can reduce the tracking latency to about 1 ms.
For rendering, there’s not much to be done other than to simplify the scenes to be rendered. AR/VR rendering on PCs will have to be roughly on the order of five-year-old games, which have low enough overall performance demands to allow rendering latencies on the order of 3-5 ms (200-333 Hz). Of course, if you want to do general, walk-around AR, you’ll be in the position of needing to do very-low-latency rendering on mobile processors, and then you’ll need to be at the graphics level of perhaps a 2000-era game at best. This is just one of many reasons that I think walk-around AR is a long way off.
So, after two stages, we’re at a mere 4-6 ms. Pretty good! But now we have to get the rendered pixels onto the display, and it’s here that the hardware rules truly need to be changed, because 60 Hz displays require about 16 ms to scan all the pixels from the frame buffer onto the display, pretty much guaranteeing that we won’t get latency down below 20 ms.
I say “pretty much” because in fact it is theoretically possible to “race the beam,” rendering each scan line, or each small block of scan lines, just before it’s read from the frame buffer and sent to the screen. (It’s called racing the beam because it was developed back when displays were CRTs; the beam was the electron beam.) This approach (which doesn’t work with display types that buffer whole frames, such as color-sequential LCOS) can reduce display latency to just long enough to be sure the rendering of each scan line or block is completed before scan-out of those pixels occurs, on the order of a few milliseconds. With racing the beam, it’s possible to get overall latency down into the neighborhood of that 7 ms holy grail.
Unfortunately, racing the beam requires an unorthodox rendering approach and considerably simplified graphics, because each scan line or block of scan lines has to be rendered separately, at a slightly different point on the game’s timeline. That is, each block has to be rendered at precisely the time that it’s going to be scanned out; otherwise, there’d be no point in racing the beam in the first place. But that means that rather than doing rendering work once every 16.6 ms, you have to do it once per block. Suppose the screen is split into 16 blocks; then one block has to be rendered per millisecond. While the same number of pixels still need to be rendered overall, some data structure – possibly the whole scene database, or maybe just a display list, if results are good enough without stepping the internal simulation to the time of each block – still has to be traversed once per block to determine what to draw. The overall cost of this is likely to be a good deal higher than normal frame rendering, and the complexity of the scenes that could be drawn within 3-5 ms would be reduced accordingly. Anything resembling a modern 3D game – or resembling reality – would be a stretch.
There’s also the problem with racing the beam of avoiding visible shear along the boundaries between blocks. That might or might not be acceptable, although it would look like tear lines, and tear lines are quite visible and distracting. If that’s a problem, it might work to warp the segments to match up properly. And obviously the number of segments could be increased until no artifacts were visible, at a performance cost; in the limit, you could eliminate all artifacts by rendering each scan line individually, but that would induce a very substantial performance loss. On balance, it’s certainly possible that racing the beam, in one form or another, could be a workable solution for many types of games, but it adds complexity and has a significant performance cost, and overall at this point it doesn’t appear to me to be an ideal general solution to display latency, although I could certainly be wrong.
It would be far easier and more generally applicable to have the display run at 120 Hz, which would immediately reduce display latency to about 8 ms, bringing total latency down to 12-14 ms. Rendering should have no problem keeping up, since we’re already rendering at 200-333 Hz. 240 Hz would be even better, bringing total latency down to 8-10 ms.
Higher frame rates would also have benefits in terms of perceived display quality, which I’ll discuss at some point, and might even help reduce simulator sickness. There’s only one problem: for the most part, high-refresh-rate displays suitable for HMDs don’t exist.
For example, the current Oculus Rift prototype uses an LCD phone panel for a display. That makes sense, since phone panels are built in vast quantities and therefore are inexpensive and widely available. However, there’s no reason why a phone panel would run at 120 Hz, since it would provide no benefit to the user, so no one makes a 120 Hz phone panel. It’s certainly possible to do so, and likewise for OLED panels, but unless and until the VR market gets big enough to drive panel designs, or to justify the enormous engineering costs for a custom design, it won’t happen.
There’s another, related potential solution: increase the speed of scan-out and the speed with which displays turn streamed pixel data into photons without increasing the frame rate. For example, suppose that a graphics chip could scan-out a frame buffer in 8 ms, even though the frame rate remained at 60 Hz; scan-out would complete in half the frame time, and then no data would be streamed for the next 8 ms. If the display turns that data into photons as soon as it arrives, then overall latency would be reduced by 8 ms, even though the actual frame rate is still 60 Hz. And, of course, the benefits would scale with higher scan-out rates. This approach would not improve perceived display quality as much as higher frame rates, but neither does it place higher demands on rendering, so no reduction in rendering quality is required. Like higher frame rates, though, this would only benefit AR/VR, so it is not going to come into existence in the normal course of the evolution of display technology.
And this is where a true Kobayashi Maru moment is needed. Short of racing the beam, there is no way to get low enough display latency out of existing hardware that also has high enough resolution, low enough cost, appropriate image size, compact enough form factor and low enough weight, and suitable pixel quality for consumer-scale AR/VR. (It gets even more challenging when you factor in wide FOV for VR, or see-through for AR.) Someone has to step up and change the hardware rules to bring display latency down. It’s eminently doable, and it will happen – the question is when, and by whom. It’s my hope that if the VR market takes off in the wake of the Rift’s launch, the day when display latency comes down will be near at hand.
If you ever thought that AR/VR was just a simple matter of showing an image on the inside of glasses or goggles, I hope that by this point in the blog it’s become clear just how complex and subtle it is to present convincing virtual images – and we’ve only scratched the surface. Which is why, in the first post, I said we needed smart, experienced, creative hardware and software engineers who work well in teams and can manage themselves – maybe you? – and that hasn’t changed.