ALVR-IPv6/wiki/How-ALVR-works.md

29 KiB

How ALVR works

This document details some technologies used by ALVR.

If you have any doubt about what is (or isn't) written in here you can contact @zarik5, preferably on Discord.

This document was last updated on June 27th 2023 and refers to the master branch.

Table of contents

  • Architecture
    • The packaged application
    • Programming languages
    • Source code organization
  • Logging and error management
    • The event system
  • Session and settings
    • Procedural generation of code and UI
  • The dashboard
    • The user interface
    • Driver communication
    • Driver lifecycle
  • The streaming pipeline: Overview
  • Client-driver communication
    • Discovery
    • Streaming
  • SteamVR driver
  • Client and driver compositors
    • Foveated rendering
    • Color correction
  • Video transcoding
  • Audio
  • Tracking and display timing
  • Other streams
  • Upcoming
    • Phase sync
    • Sliced encoding

Architecture

The packaged application

ALVR is made of two applications: the streamer and client. The streamer can be installed on Windows and Linux, while the client is installed on Android VR headsets. The client communicates with the driver through TCP or UDP sockets.

The client is a single unified APK, named alvr_client_android.apk. It is powered by OpenXR and it is compatible with Quest headsets, recent Pico headsets and HTC Focus 3 and XR Elite.

The streamer is made of two main parts: the dashboard and the driver (also known as server). The driver is dynamically loaded by SteamVR. This is the file structure on Windows:

  • bin/win64/
    • driver_alvr_server.dll: The main binary, responsible for client discovery and streaming. Loaded by SteamVR.
    • driver_alvr_server.pdb: Debugging symbols
    • openvr_api.dll: OpenVR SDK used for updating the chaperone.
    • vcruntime140_1.dll: Windows SDK used by C++ code in the driver.
  • ALVR Dashboad.exe: Dashboard binary used to change settings, manage clients, monitor statistics and do installation actions. It can launch SteamVR.
  • driver.vrdrivermanifest: Auxiliary config file used by the driver.

At runtime, some other files are created:

  • session.json: This contains unified configuration data used by ALVR, such as settings and client records.
  • session_log.txt: Main log file. Each line is a json structure and represents an event generated by the driver. This gets cleared each time a client connects.
  • crash_log.txt: Auxiliary log file. Same as session_log.txt, except only error logs are saved, and does not get cleared.

Programming languages

ALVR is written in multiple languages: Rust, C, C++, HLSL, GLSL. The main language used in the codebase is Rust, which is used for the dashboard, networking, video decoding and audio code. C and C++ are used for graphics, video encoding and SteamVR integration. HLSL is used for graphics shaders on the Windows driver, GLSL is used on the Linux driver and the client. Moving forward, more code will be rewritten from C/C++ to Rust and HLSL code will be moved to GLSL or WGSL.

Rust is a system programming language focused on memory safety and ease of use. It is as performant as C++ but Rust code is less likely to be affected by runtime bugs. The prime feature Rust feature used by ALVR is enums, that correspond to tagged unions in C++. Rust's enum is a data type that stores different kinds of data, but only one type can be accessed at a time. For example the type Result can contain either an Ok value or an Err value but not both. Together with pattern matching, this is the foundation of error management in Rust applications.

Source code organization

ALVR code is hosted in a monorepo. This is an overview of the git tree:

  • .github/: Contains scripts used by the GitHub CI.
  • alvr/: Each subfolder is a Rust crate ("crate" means a code library or executable).
    • audio/: Utility crate hosting audio related code shared by client and driver.
    • client_core/: Platform agnostic code for the client. It is used as a Rust library for alvr_client_openxr and can also compiled to a C ABI shared library with a .h header for integration with other projects.
    • client_mock/: Client mock implemented as a thin wrapper around alvr_client_core.
    • client_openxr/: Client implementation using OpenXR, compiled to a APK binary.
    • common/: Some common code shared by other crates. It contains code for versioning, logging, struct primitives, and OpenXR paths.
    • dashboard/: The dashboard application.
    • events/: Utility crate hosting code related to events.
    • filesystem/: Utility crate hosting code for filesystem abstraction between Windows and Linux.
    • packets/: Utility crate containing packet definitions for communication between client, driver and dashboard.
    • server/: The driver shared library loaded by SteamVR.
    • server_io/: Common functionality shared by dashboard and driver, for interaction with the host system. This allows dashboard and driver to work independently from each other.
    • session/: Utility crate related to session file and data management.
    • sockets/: Utility crate shared by client and driver with socket and protocol implementation.
    • vrcompositor_wrapper/: Small script used on Linux to correctly load the ALVR Vulkan layer by SteamVR.
    • vulkan_layer/: Vulkan WSI layer used on Linux to work around limitations of the OpenVR API on Linux. This is mostly patchwork and hopefully will be removed in the future.
    • xtask/: Utility CLI hosting a variety of scripts for environment setting, building, and packaging ALVR. Should be called with cargo xtask.
  • resources/: resources for the README.
  • wiki/: Contains the source for the Github ALVR wiki. Changes are mirrored to the actual wiki once committed.
  • about.toml: Controls what dependency licenses are allowed in the codebase, and helps with generating the licenses file in the packaged ALVR streamer.
  • Cargo.lock: Contains versioning information about Rust dependencies used by ALVR.
  • Cargo.toml: Defines the list of Rust crates contained in the repository, and hosts some other workspace-level Rust configuration.

Logging and error management

Logging is split into interface and implementation. The interface is defined in alvr/common/src/logging.rs, the implementations are defined in each binary crate as logging_backend.rs.

ALVR logging system is based on the crate log. log is already very powerful on its own, since its macros can collect messages, file and line number of the invocation.

ALVR defines some structures, macros and functions to ease error management. The base type used for error management is StrResult<T> that is an alias for Result<T, String>. Read more about Rust's Result type here.

There are many ways of logging in ALVR, each one for different use-cases. To make use of them you should add use alvr_common::prelude::* at the top of the Rust source file.

  • error!(), warn!(), info!(), debug!() (reexported macros from the log crate). Log is processed depending on the logging backend.
  • show_e() and show_w() are used to log a string message, additionally showing a popup.
  • show_err(), show_warn() work similarly to show_e() and show_w(), but they accept a Result<> and log only if the result is Err().
  • fmt_e!() adds tracing information to a message and produces a Err(), that can be returned.
  • err!() and enone!() used respectively with ..map_err() and .ok_or_else(), to map a Result or Option to a StrResult, adding tracing information.
  • Some other similarly named functions and macros with similar functionality

The event system

Events are messages used internally in the driver and sent to dashboard instances. Events are generated with send_event() and is implemented on top of the logging system.

This is the layout of Event, in JSON form

{
  "timestamp": "<timestamp>",
  "event_type": {
    "id": "<EventType>",
    "content": { <depends on id> }
  }
}

Log is a special kind of event:

{
  "timestamp": "<timestamp>",
  "event_type": {
    "id": "Log",
    "content": {
      "severity": "Error or Warn or Info or Debug",
      "content": "<the message>"
    }
  }
}

The driver logs events in JSON form to session.json, one per line.

Currently its use is limited, but eventually this will replace the current logging system, and logging will be built on top of the event system. The goal is to create a unified star-shaped network where each client and dashboard instance sends events to the server and the server broadcasts events to all other clients and dashboard instances. This should also unify the way the server communicates with clients and dashboards, making the dashboard just another client.

Session and settings

ALVR uses a unified configuration file, that is session.json. It is generated the first time ALVR is launched. This file contains the following top-level fields:

  • "server_version": the current version of the streamer. It helps during a version upgrade.
  • "drivers_backup": temporary storage for SteamVR driver paths. Used by the dashboard.
  • "openvr_config": contains a list of settings that have been checked for a diff. It is used by C++ code inside the driver.
  • "client_connections": contains entries corresponding to known clients.
  • "session_settings": all ALVR settings, laid in a tree structure.

Procedural generation of code and UI

ALVR lays out settings in a tree-like structure, in a way that the code itself can efficiently make use of. Settings can contain variants (in session.json are specified in PascalCase), which represent mutually exclusive options.

ALVR uses the macro SettingsSchema in the settings-schema crate to generate auxiliary code, ie a schema and the "default representation" of the settings. This is a crate created specifically for ALVR but can be used for other projects too.

The schema is made of nested SchemaNodes that contain metadata. Some of the metadata is specified directly inside inline attributes in structures and enums.

The "default representation" (the type names are generated by concatenating the structure/enum name with Default), are structures that can hold settings in a way no not lose information about unselected variants; enums are converted to structs and variants that hold a value are converted to fields. The main goal of this is to meet the user expectation of not losing nested configuration when changing some options. The default representation is exactly what is saved inside session.json in "session_settings".

Info about the various types of schema nodes can be found here.

The dashboard makes use of schema metadata and the default representation to generate the settings UI. The end result is that the settings UI layout closely matches the structures used internally in the code, and this helps understanding the inner workings of the code.

When upgrading ALVR, the session might have a slightly different layout, usually some settings might have been added/removed/moved/renamed. ALVR is able to handle this by doing an extrapolation process: it starts from the default session, and replace values taken from the old session file with the help of the settings schema.

The dashboard

The dashboard is the main way of interacting with ALVR. Functionality is organized in tabs.

The User Interface

These are the main components:

TODO: Add screenshots

  • Sidebar: is used to select the tab for the main content page.
  • Connections tab: used to trust clients or add them manually specifying the IP
  • Statistics tab: shows graphs for latency and FPS and a summary page
  • Settings tab: settings page split between Presets and All Settings. All Settings are procedurally generated from a schema. Presets are controls that modify other settings.
  • Installation tab: utilities for installation: setting firewall rules, registering the driver, launching the setup wizard.
  • Logs tab: shows logs and events in a table.
  • Debug tab: debugging actions.
  • About tab: information about ALVR.
  • Lower sidebar button: can be either "Launch SteamVR" or "Restart SteamVR", depending on the driver connection status
  • Notification bar: shows log in a non-obstructive way.

Driver communication

The dashboard communicates with the driver in order to update its information and save configuration. This is done through a HTTP API, with base URL http://localhost:8082. These are the endpoints:

  • /api/dashboard-request: This is the main URL used by the dashboard to send messages and data to the server. The body contains the specific type and body of the request.
  • /api/events: This endpoint is upgraded to a websocket and is used for listening to events from the driver
  • /api/ping: returns code 200 when the driver is alive.

The dashboard retains some functionality when the driver is not launched. It can manage settings, clients and perform installation actions, but clients cannot be discovered. Once The driver is launched all these actions are performed by the server, requested with the HTTP API. This mechanism ensures that there are no data races.

Driver lifecycle

The dashboard is able to launch and restart SteamVR, in order to manage the driver's lifecycle.

The driver launch procedure is as follows:

  • The driver is registered according to the "Driver launch action" setting, if needed. By default, current SteamVR drivers are unregistered and backed up inside session.json.
  • On Linux, the vrcompositor wrapper is installed if needed
  • SteamVR is launched.

Once the drivers shuts down, if there are backed up drivers, these are restored.

The driver restart procedure is as follows:

  • The dashboard notifies the driver that it should be restarted.
  • The driver sends a request for restart to the dashboard.
  • The driver asks SteamVR to shutdown, never unregistering drivers.
  • The dashboard waits for SteamVR to shutdown, otherwise killing it after a timeout.
  • The dashboard relaunches SteamVR.

This might seem unnecessarily complicated. The reason for the first message round trip is to plug-in to the existing restarting system used by settings invalidation, which is invoked from the driver itself. The reason which the driver cannot be autonomous in restarting is because any auxiliary process spawned by the driver will block SteamVR shutdown or leave it in a zombie state.

The streaming pipeline: Overview

The goal of ALVR is to bridge input and output of a PCVR application to a remote headset. In order to do this ALVR implements pipelines to handle input, video and audio. The tracking-video pipeline (as known as the motion-to-photon pipeline) is the most complex one and it can be summarized in the following steps:

  • Poll tracking data on the client
  • Send tracking to the driver
  • Execute the PCVR game logic and render layers
  • Compose layers into a frame
  • Encode the video frame
  • Send the encoded video frame to the client through the network
  • Decode the video frame on the client
  • Perform more compositor transformations
  • Submit the frame to the VR runtime
  • The runtime renders the frame during a vsync.

Client-driver communication

ALVR uses a custom protocol for client-driver communication. ALVR supports UDP and TCP transports. USB connection is supported although not as a first class feature; you can read more about it here.

Discovery

Usually the first step to establish a connection is discovery. When the server discovers a client it shows it in the "New clients" section in the Connection tab. The user can then trust the client and the connection is established.

ALVR uses a UDP socket at 9943 for discovery. The client broadcasts a packet and waits for the driver to respond. It's the client that broadcasts and it's the driver that then asks for a connection: this is because of the balance in responsibility of the two peers. The client becomes the portal though a PC, that can contain sensitive data. For this reason the server has to trust the client before initiating the connection.

This is the layout of the discovery packet

Prefix Protocol ID Hostname
"ALVR" + 0x0 x 12 8 bytes 32 bytes
  • The prefix is used to filter packets and ensure a packet is really sent by an ALVR client
  • The protocol ID is a unique version identifier calculated from the semver version of the client. If the client version is semver-compatible with the streamer, the protocol ID will match.
  • Hostname: the hostname is a unique identifier for a client. When a client is launched for the first time, an hostname is chosen and it persists for then successive launches. It is reset when the app is upgraded or downgraded.

The format of the packet can change between major versions, but the prefix must remain unchanged, and the protocol ID must be 8 bytes.

Streaming

ALVR uses two sockets for streaming: the control socket and stream socket. Currently these are implemented with async code; there's a plan to move this back to sync code.

The control socket uses the TCP transport; it is used to exchange small messages between client and server, ALVR requires TCP to ensure reliability.

The stream socket can use UDP or TCP; it is used to send large packets and/or packets that do not require reliability, ALVR is robust to packet losses and packet reordering.

The specific packet format used over the network is not clearly defined since ALVR uses multiple abstraction layers to manipulate the data (bincode, tokio Length Delimited Coding). Furthermore, packets are broken up into shards to ensure they can support the MTU when using UDP.

Since the amount of data streamed is large, the socket buffer size is increased both on the driver side and on the client.

SteamVR driver

The driver is the component responsible for most of the streamer functionality. It is implemented as a shared library loaded by SteamVR. It implements the OpenVR API in order to interface with SteamVR.

Using the OpenVR API, ALVR pushes tracking and button data to SteamVR using vr::VRServerDriverHost()->TrackedDevicePoseUpdated(). SteamVR then returns a rendered game frame with associated pose used for rendering. On Windows, frames are retrieved implementing the IVRDriverDirectModeComponent interface: SteamVR calls IVRDriverDirectModeComponent::Present(). On Linux this API doesn't work, and so ALVR uses a WSI Vulkan layer to intercept display driver calls done by vrcompositor. The pose associated to the frame is obtained from the vrcompositor execution stack with the help of libunwind.

Client and driver compositors

ALVR is essentially a bridge between PC and headset that transmits tracking, audio and video. But it also implements some additional features to improve image quality and streaming performance. To this goal, ALVR implements Fixed Foveated Rendering (FFR) and color correction.

The client compositor is implemented in OpenGL, while on the server it's either implemented with DirectX 11 on Windows or Vulkan on Linux. There are plans to move all compositor code to the graphics abstraction layer wgpu, mainly for unifying the codebase.

It's important to note that ALVR's compositors are separate from the headset runtime compositor and SteamVR compositors. The headset runtime compositor is part of the headset operative system and controls compositing between different apps and overlays, and prepares the image for display (with lens distortion correction, chroma aberration correction, mura and ghosting correction). On the driver side, on Windows ALVR takes responsibility for compositing layers returned by SteamVR. The only responsibility of SteamVR is converting the frame into a valid DXGI texture if the game uses OpenGL or Vulkan graphics. On Linux ALVR grabs Vulkan frames that are already composited by vrcompostor. This introduced additional challenges since vrcompositor implements async reprojection which disrupts our head tracking mechanism.

Foveated encoding

Foveated rendering is a technique where frame images are individually compressed in a way that the human eye barely detects the compression. Particularly, the center of the image is kept at original resolution, and the rest is compressed. ALVR refers to foveated rendering as "Foveated encoding" to clarify its scope. In native standalone or PCVR apps, foveated rendering reduces the load on the GPU by rendering parts of the image ar lower resolution. In ALVR's case frames are still rendered at full resolution, but are then "encoded" (compressing the outskirts of the image) before actually encoding and transmitting them. The image is then reexpanded on the client side after decoding and before display.

Currently ALVR supports only fixed foveation, but support for tracked eye foveation is planned.

In its history, ALVR implemented different algorithms for foveated encoding. The first one is "Warp", where the image is compressed in an elliptical pattern using the tangent function to define the compression ratio radially. A problem with algorithm is that it causes the image to become blurry. Here is a demo of this algorithm in action. The second algorithm used was "Slices" where the image is sliced up into 9 sections (center, edges, corners), compressed to different degrees and the re-packed together into a single rectangular frame. The main issue with this algorithm was its complexity. You can find a demo here. The current algorithm in use is reimplementation of Oculus AADT (Axis-Aligned Distorted Transfer), which simply compresses the lateral edges of the image horizontally and the vertical edged vertically. This algorithm has less compression power but it's much simpler and less taxing on the Quest's GPU.

Color correction

Color correction is implemented on the server and adds simple brightness, contrast, saturation, gamma and sharpening controls. It's implemented on the server for performance reasons and to avoid amplifying image artifacts caused by transcoding.

Video transcoding

To be able to send frames from driver to client through the network, they have to be compressed since current WiFi technology doesn't allow to send the amount of data of raw frames. Doing a quick conservative calculation, let's say we have 2 x 2048x2048 eye images, 3 color channels, 8 bits per channel, sent 72 times per second, that would amount to almost 15 Gbps.

ALVR uses h264 and HEVC video codecs for compression. These codecs are chosen since they have hardware decoding support on Android and generally hardware encoding support on the PC side. On Windows, the driver uses NvEnc for Nvidia GPUs and AMF for AMD GPUS; on Linux ALVR supports VAAPI, NvEnc and AMF through FFmpeg. In case the GPU doesn't support hardware encoding, on both Windows and Linux ALVR supports software encoding with x264 (through FFmpeg), although the performance is often insufficient for a smooth experience. The client supports only MediaCodec, which is the API to access hardware video codecs on Android.

h264 and HEVC codecs compression works on the assumption that consecutive frames are similar to each other. Each frame is reconstructed from past frames + some small additional data. For this reason, packet losses may cause glitches that persist many frames after the missing frame. When ALVR detects packet losses, it requests a new IDR frame from the encoder. A IDR frame is a packet that contains all the information to build a whole frame by itself; the encoder will ensure that successive frames will not rely on older frames than the last requested IDR.

Audio

Game audio is captured on the PC and sent to the client, and microphone audio is captured on the client and sent to the PC. Windows and Linux implementation once again differ. On Windows, game audio is captured from a loopback device; microphone is is sent to virtual audio cable software to expose audio data from a (virtual) input device. On Linux the microphone does not work out-of-the-box, but there is a bash script available for creating and plugging into pipewire audio devices.

Unlike for video, audio is sent as a raw PCM waveform and new packets do not rely on old packets. But packet losses may still cause popping, which happens when there is a sudden jump in the waveform. To mitigate this, when ALVR detects a packet loss (or a buffer overflow or underflow) it will render a fadeout or cross-fade.

Tracking and display timing

Handling head and controller tracking is tricky for VR applications, and even more for VR streaming applications.

In a normal native VR application, tracking is polled at the beginning of the rendering cycle, it is used to render the eye views from a certain perspective and render the controller or hand models. When the game finished rendering the frame it submits it to the VR runtime which will display it on screen. From the time tracking is polled and the frame is displayed on screen, 1 or more frame durations may have passed (for example at 72fps the frame duration is 13ms). Our eyes are very sensitive to latency, especially for orientation, so VR runtimes implement image reprojection (Oculus calls it Asynchronous Time Warp). Reprojection works by rendering the frame rotated in 3D to compensate for the difference in orientation between the tracking pose polled at the beginning of the rendering cycle and the actual pose of the headset at the time of vsync when the image should be pushed to the display. To be able to correctly rotate the image, the runtime will also need to know the timestamp used for polling tracking, which can be the time of poll, or better, the predicted time of the vsync. If a time in the future is used for tracking poll, the polled tracking will be extrapolated.

For VR streaming applications, the pipeline is similar, except that tracking is polled for a more distant point in the future, to compensate for the whole transcoding pipeline, and it's not trivial to decide on how much to predict in the future. ALVR calculates the prediction offset by reading how much time passes between the tracking poll time and the time a frame rendered with the same tracking is submitted. These interval samples are averaged and then used for future tracking polls. (To calculate the correct total latency you also need to add the VR runtime compositor latency, which in the dashboard latency graph is shown as "Client VSync").

On the streamer side, ALVR needs to workaround a OpenVR API limitation. SteamVR returns frames with its pose, but then ALVR is responsible of matching the pose with one of the poses submitted previously and re-match its timestamp.

Other streams

There are some other kinds of data which can be streamed without requiring any special timing. These are button presses and haptics, respectively sent from client to driver and from driver to client.

Upcoming

Phase sync

Phase sync is not a single algorithm but many that share similar objectives, reducing latency or jitter in the rendering/streaming pipeline. The term "phase sync" comes from Oculus, that describes its algorithm for reducing latency in its OpenXR runtime by starting the rendering cycle as late las possible to reduce waiting time before the vsync.

In general, a phase sync algorithm is composed of two parts: a queue that holds data resources or pointers, and a statistical model to predict event times. The statistical model is fed with duration or other kinds of timing samples and as output it returns a refined time prediction for a recurring event. The statistical model could be simple and just aim for a average-controlled event, or more complex that aims for submitting for a deadline; the second case needs to take into account the variance of the timing samples. Unlike Oculus implementation, these statistical models can be highly configurable to tune the target mean or target variance.

There are a few phase sync algorithms planned to be implemented: frame submission timing (to reduce frame queueing on the client, controlled by shifting the phase of the driver rendering cycle), SteamVR tracking submission timing (to make sure SteamVR is using exactly the tracking sample we want) and tracking poll timing (to reduce queuing on the server side).

Sliced encoding

Sliced encoding is another algorithm showcased by Oculus and it's about reducing latency by parallelizing work. In a simple streaming pipeline, frames are processed sequentially: rendering, then encoding, then transmission, then decoding. There is already some degree of parallelism, as rendering, encoding, transmission, and decoding can happen at the same time. Sliced encoding can help in reducing encoding and decoding time, as the frames are split into "slices". This allows for more efficient utilization of hardware encoders/decoders, or even use hardware and software codecs in parallel. It's crucial to note that network latency cannot be optimized. Given the constraint of network, sliced encoding can reduce waiting times between encoder/transmission and transmission/decoder as each encoded slice can be transmitted immediately and doesn't have to wait for the rest of the frame to be encoded (and a similar reasoning applies for the decoding side).