Using integrated ML to deliver low-latency mobile VR graphics

3/5/2020

What it is:

A new low-latency, power-efficient framework for running machine learning in the rendering pipeline for standalone VR devices that use mobile chipsets. This architecture makes it possible to use ML on these devices to significantly improve image quality and video rendering.

We have created a sample application under this framework to reconstruct higher-resolution rendering (known as super-resolution) to improve VR graphics fidelity on a mobile chipset with minimal compute resources. This new framework can also be used to perform compression artifact removal for streaming content, frame prediction, feature analysis, and feedback for guided foveated rendering.

How it works:

In a typical mobile VR rendering system, the application engine retrieves movement-tracking data at the beginning of each frame and uses this information to generate images for each eye. To work effectively for VR applications, processing time for the whole graphics pipeline is typically constrained tightly, with, for example, a budget of 11 milliseconds rendering time for both eye buffers in order to achieve a 90 Hz refresh rate.

To overcome these constraints, our new architecture offloads model execution so it is asynchronized on specialized processors. In this design, the digital signal processor (DSP) or neural processing unit (NPU) is pipelined with the graphics processing unit (GPU) and takes either portions of the rendered buffers or the entire rendered buffers for further processing. The processed content is picked up asynchronously by a GPU warping thread for latency compensation before sending to display.

This graphic shows how we parallelize machine learning model execution on DSP with other processors in the graphics display pipeline.

To improve performance, we modify the graphics memory allocation system in the OS to use the specialized allocator for the GPU-DSP shared memory. This is more efficient than direct mapping, because the graphics framebuffer is often optimized for GPU-only access (and performs poorly on CPU) and because a special memory registration process is needed to avoid copying with remote calls at runtime.

We tested this pipeline with a sample application that applies deep learning to improve image quality in the central region, but uses more efficient, lower-resolution rendering for other parts of the scene. The super-resolved content is blended with the surrounding regions in asynchronized timewarp. If we render at around 70 percent lower resolution in each direction, we save approximately 40 percent of GPU time, and developers can use those resources to generate better content. To achieve temporally coherent and visually pleasing results in VR, we developed recurrent networks trained with a specially designed temporal loss function. The quality comparison between network prediction and ground truth for 2x super-resolution is shown in this video below.

Something Went Wrong

We're having trouble playing this video.

Learn more

This video uses the game Beat Saber as an example to demonstrate the capabilities of this research work. The image on the left was generated using a fast super-resolution network applied to 2x low resolution content. The image on the right is the full-resolution ground truth.

Why it matters

Creating next-gen VR and AR experiences will require finding new, more efficient ways to render high-quality, low-latency graphics. Traditional rendering and super-resolution techniques may not be acceptable on low-persistence displays used in VR headsets, since temporal artifacts are more perceptible. This method provides a new way to use AI to address this challenge on devices powered by mobile chipsets.

In addition to its AR/VR applications, we believe this new framework can open the door for innovations in mobile computational graphics, by removing constraints on memory and enabling other new innovations in image quality enhancement, artifacts removal, and frame extrapolations.