# RTC Video Encoder API v2 This file describes the next generation (or "v2") of video encoder API, which represents a large departure from the legacy `VideoEncoder` interface. ## Video Encoder Model The video encoder model is intended to be as low level as possible, while still being codec agnostic. The model is that an encoder is "just a set of buffers". Every time a frame is encoded, the user decides which buffers may be used for prediction, and which buffer to update. From this, the user can construct different encoding structures and scalability modes, dynamically change the encoding structure, implement encoding strategies such as LTR, selectively drop or re-encode frames, etc. At a high level, the API consists of two parts: the `VideoEncoderFactoryInterface` and the `VideoEncoderInterface`. * The `VideoEncoderFactoryInterface` maps one-to-one to an underlying implementation. Its main purpose is to describe what the implementation is capable of, and to create encoder instances of that implementation. * The `VideoEncoderInterface` is an instance of an encoder, and its main purpose is to encode video. --- ## VideoEncoderFactoryInterface The general outline of the `VideoEncoderFactoryInterface` is described here: ```cpp class VideoEncoderFactoryInterface { public: enum class RateControlMode { kCqp, kCbr }; struct Capabilities { struct PredictionConstraints { enum class BufferSpaceType { kMultiInstance, // multiple independent sets of buffers kMultiKeyframe, // single set of buffers, but can store multiple // keyframes simultaneously. kSingleKeyframe // single set of buffers, can only store one keyframe // at a time. }; int num_buffers; int max_references; int max_temporal_layers; BufferSpaceType buffer_space_type; int max_spatial_layers; std::vector scaling_factors; std::vector supported_frame_types; } prediction_constraints; struct InputConstraints { Resolution min; Resolution max; int pixel_alignment; std::vector input_formats; } input_constraints; std::vector encoding_formats; struct BitrateControl { std::pair qp_range; std::vector rc_modes; } rate_control; struct Performance { bool encode_on_calling_thread; std::pair min_max_effort_level; } performance; }; struct StaticEncoderSettings { struct Cqp {}; struct Cbr { TimeDelta max_buffer_size; TimeDelta target_buffer_size; }; Resolution max_encode_dimensions; EncodingFormat encoding_format; std::variant rc_mode; int max_number_of_threads; }; virtual ~VideoEncoderFactoryInterface() = default; virtual std::string CodecName() const = 0; virtual std::string ImplementationName() const = 0; virtual std::map CodecSpecifics() const = 0; virtual Capabilities GetEncoderCapabilities() const = 0; virtual std::unique_ptr CreateEncoder( const StaticEncoderSettings& settings, const std::map& encoder_specific_settings) = 0; }; ``` ### CodecName, ImplementationName, CodecSpecifics * **`CodecName()`**: Returns the name of the codec (e.g., "VP8", "VP9", "AV1"). * **`ImplementationName()`**: Returns the specific implementation used (e.g., "libaom", "vaapi/Intel gen10"). * **`CodecSpecifics()`**: Returns a string-to-string map containing codec-specific information such as profile/tier/level, and encoder-specific information like driver version or software version. ### CreateEncoder Creates an encoder instance (`VideoEncoderInterface`). #### StaticEncoderSettings Represents settings that are fixed for the entire lifetime of the encoder. * **`max_encode_dimensions`**: The maximum resolution (`Resolution`) that will be requested from the encoder. This allows the encoder to pre-allocate buffers and know that only sizes equal to or lower than those will be used. * **`encoding_format`**: The encoded output format (`EncodingFormat`). * **`rc_mode`**: The rate control mode to be used when encoding. * ** `Cqp` **: Constant QP mode. * **`Cbr`**: Constant bitrate mode. * **`max_buffer_size`**: The max buffer size, specified as a `TimeDelta` duration. * **`target_buffer_size`**: What the target buffer size should be, specified as a `TimeDelta` duration. * **`max_number_of_threads`**: The maximum number of threads that the encoder instance may spawn. --- ## Querying Encoder Capabilities One of the most important goals for the new API is to be able to reason about the encoders you have available. The factory can return an instance of a `Capabilities` struct via `GetEncoderCapabilities()`. ### Prediction Constraints To facilitate implementations of different scalability structures, such as temporal & spatial layers, we need to know how many reference buffers are available to store states from previous frames, and how those reference buffers may be used when encoding a frame. The `PredictionConstraints` describe these constraints: ```cpp struct PredictionConstraints { enum class BufferSpaceType { kMultiInstance, // multiple independent sets of buffers kMultiKeyframe, // single set of buffers, but can store multiple // keyframes simultaneously. kSingleKeyframe // single set of buffers, can only store one keyframe // at a time. }; int num_buffers; int max_references; int max_temporal_layers; BufferSpaceType buffer_space_type; int max_spatial_layers; std::vector scaling_factors; std::vector supported_frame_types; }; ``` * **`num_buffers`**: How many reference buffers are available. For a fully implemented VP8 encoder this would be 3, for VP9/AV1 it would be 8, and for H264/H265 it would be 16. Specific encoder implementations may limit this number to something lower than the spec allows. * **`max_references`**: This field indicates how many buffers can be used as reference for any given frame. For both VP8 and VP9 this would then be up to 3. Again, implementations may limit it further. * **`max_temporal_layers`**: The maximum number of temporal layers the encoder supports. This is mainly used to determine how many temporal layers can be used in CBR-mode, as many encoders lack per-layer CBR rate control. * **`max_spatial_layers`**: Maximum number of spatial/quality layers. * **`scaling_factors`**: Indicates the resolution scale factors supported when referencing other frames. Examples: * `{1, 1/2}` needed in an SVC scenario using a higher layer referencing a buffer from a lower spatial layer where the width and height are ½ of the layer currently encoded. * `{1}` A single 1/1 factor indicates the encoder can do inter-layer prediction, but only when using the same resolution. * `{}` An empty list indicates inter-layer prediction is not supported (i.e., only simulcast modes are supported). * **`buffer_space_type`**: This enum is important when multiple independent encodings are produced, i.e., simulcast or SVC s-modes. The enum indicates if the encodings have independent sets of reference buffers, or if they share a common buffer space, and if multiple keyframes can be stored at the same time in the same buffer space. * **`kMultiInstance`**: Spatial layers are encoded in separate buffer spaces. Inter-layer prediction is not supported. An example is the libvpx VP8 multires encoder. * **`kMultiKeyframe`**: Spatial layers share the same buffer space, and multiple keyframes can be stored in the encoder at the same time. An example is the libvpx VP9 encoder. * **`kSingleKeyframe`**: Spatial layers share the same buffer space, and only one keyframe can be stored in the buffer at any time. Note that encoders that support `kStartFrame` can still encode independent spatial layers. This is the typical encoder buffer space type, and an example that supports both keyframes and startframes is the libaom AV1 encoder. * **`supported_frame_types`**: Indicates which types of frames the encoder can produce: * **`FrameType::kKeyframe`**: An all-intra coded frame which depends on no previous state and refreshes all current decoder state (recreates all reference buffers, coding contexts, entropy tables, etc.). * **`FrameType::kStartFrame`**: An independently decodable frame that only updates specified decoder state, but allows previous reference frames to be kept. A new decoder must be able to start decoding from this frame. * **`FrameType::kDeltaFrame`**: A frame that references existing state as part of the coding. ### Input Constraints Input constraints define the constraints for the source frames to be encoded. ```cpp struct InputConstraints { Resolution min; Resolution max; int pixel_alignment; std::vector input_formats; }; ``` * **`min` / `max`**: The lower and upper bound on the resolutions the encoder supports (`Resolution`). * **`pixel_alignment`**: Requested output resolution must be a multiple of the pixel alignment. * **`input_formats`**: The set of formats the encoder supports for input frames (e.g., `VideoFrameBuffer::Type::kI420`, `VideoFrameBuffer::Type::kNV12`). ### Encoding Formats The supported encoded image formats, in terms of chroma subsampling and the sample bit depth. Note that the input format does not necessarily match the encoded format. E.g., the input may be in RGB but the encoded images are in YUV 4:2:0 format. ```cpp struct EncodingFormat { enum SubSampling { k420, k422, k444 }; SubSampling sub_sampling; int bit_depth; }; ``` * **`sub_sampling`**: Supported chroma subsampling (4:2:0, 4:2:2, 4:4:4). * **`bit_depth`**: Supported sample bit depth (typically 8, 10, or 12). ### Bitrate Control ```cpp struct BitrateControl { std::pair qp_range; std::vector rc_modes; }; ``` * **`qp_range`**: Min and max QP supported by the encoder. * **`rc_modes`**: The set of supported rate control modes (`RateControlMode`): * **`RateControlMode::kCbr` (Constant Bitrate)**: If the encoder indicates it can handle scalability modes (see Prediction Constraints), it MUST allow setting per-layer bitrate targets to advertise CBR support. * **`RateControlMode::kCqp` (Constant QP)**: The encoder MUST allow setting the QP on a per-frame basis in this mode. By setting per-frame QP we can implement external rate control. ### Performance Performance-related options: ```cpp struct Performance { bool encode_on_calling_thread; std::pair min_max_effort_level; }; ``` * **`encode_on_calling_thread`**: Whether the thread calling `Encode` will be used for encoding. * **`min_max_effort_level`**: Indicates min and max available effort level. The min effort level can be a negative value and indicates faster encoding but at a lower quality or compression. The max effort level can be a positive value and indicates slower encoding but at a higher quality or compression. The default (0) should always be part of the range. --- ## VideoEncoderInterface ```cpp class VideoEncoderInterface { public: virtual ~VideoEncoderInterface() = default; enum class FrameType { kKeyframe, kStartFrame, kDeltaFrame }; struct EncodingError {}; struct EncodedData { FrameType frame_type; int encoded_qp; }; using EncodeResult = std::variant; struct FrameOutput { virtual ~FrameOutput() = default; virtual std::span GetBitstreamOutputBuffer(DataSize size) = 0; virtual void EncodeComplete(const EncodeResult& encode_result) = 0; }; struct TemporalUnitSettings { VideoCodecMode content_hint = VideoCodecMode::kRealtimeVideo; Timestamp presentation_timestamp; }; struct FrameEncodeSettings { struct Cbr { TimeDelta duration; DataRate target_bitrate; }; struct Cqp { int target_qp; }; std::variant rate_options; FrameType frame_type = FrameType::kDeltaFrame; int temporal_id = 0; int spatial_id = 0; Resolution resolution; std::vector reference_buffers; std::optional update_buffer; int effort_level = 0; std::unique_ptr frame_output; }; virtual void Encode(scoped_refptr frame_buffer, const TemporalUnitSettings& settings, std::vector frame_settings) = 0; }; ``` --- ## Encoding Parameters ### TemporalUnitSettings These settings are provided for each temporal unit to be encoded. ```cpp struct TemporalUnitSettings { VideoCodecMode content_hint = VideoCodecMode::kRealtimeVideo; Timestamp presentation_timestamp; }; ``` * **`content_hint`**: Indicates if the content is primarily real-time video or screen sharing (e.g., `VideoCodecMode::kRealtimeVideo`, `VideoCodecMode::kScreenshare`). * **`presentation_timestamp`**: When the frame is meant to be presented. Only used in CBR mode, where it determines how much buffer levels should be adjusted between encoded frames. ### FrameEncodeSettings These settings are provided for each frame. Note that there may be multiple encodings for a single input frame (e.g. when spatial layering is used). This is represented by a vector of `FrameEncodeSettings`. ```cpp struct FrameEncodeSettings { struct Cbr { TimeDelta duration; DataRate target_bitrate; }; struct Cqp { int target_qp; }; std::variant rate_options; FrameType frame_type = FrameType::kDeltaFrame; int temporal_id = 0; int spatial_id = 0; Resolution resolution; std::vector reference_buffers; std::optional update_buffer; int effort_level = 0; std::unique_ptr frame_output; }; ``` * **`rate_options`** * **`Cbr`** * **`duration`**: The duration of the frame (`TimeDelta`). * **`target_bitrate`**: The target rate for the duration of the frame (`DataRate`). * **`Cqp`** * **`target_qp`**: The frame global QP to encode with. * **`frame_type`**: Whether the frame should be a keyframe, startframe, or delta frame (`FrameType`). * **`temporal_id`**: The temporal ID for this frame. Must be set to 0 for keyframes. Provided as input for the CBR rate controller. * **`spatial_id`**: The spatial ID for this frame. * **`resolution`**: The resolution to encode the frame at. Note that any reference buffer used for prediction must have a resolution that is a multiple of the supported scaling factors. * **`reference_buffers`**: List of reference buffer slots the encoder is allowed to reference when encoding this frame. It may choose to use only a subset. * **`update_buffer`**: Which buffer that the encoded frame should be placed in. For keyframes and startframes this must be set. * **`effort_level`**: Indicates the amount of effort the encoder should spend encoding the frame. * **`frame_output`**: A `unique_ptr` that provides an allocator for the bitstream buffer (`GetBitstreamOutputBuffer`) and receives the callback when the encode completes (`EncodeComplete`). Note that an encoder MUST update exactly the buffer specified in the encode call (e.g., no internal repeated keyframe interval) if it supports spatial scalability. Otherwise, it breaks assumptions needed to construct some dependency structures. ### FrameOutput The encoder MUST provide feedback for each frame we attempt to encode. We must therefore have a mechanism to map the encoding requests to the feedback. By providing a callback and output buffer allocator interface (`FrameOutput`) via `std::unique_ptr` in `FrameEncodeSettings`, each frame will have its own unique feedback and output allocation path. ```cpp struct FrameOutput { virtual ~FrameOutput() = default; virtual std::span GetBitstreamOutputBuffer(DataSize size) = 0; virtual void EncodeComplete(const EncodeResult& encode_result) = 0; }; ``` * **`GetBitstreamOutputBuffer(DataSize size)`**: The encoder calls this to request a target buffer of specified `size` for the resulting bitstream data. The caller allocates or provides the buffer (as a `std::span`), allowing the encoder to write directly to it and eliminate extra copies. * **`EncodeComplete(const EncodeResult& encode_result)`**: Called when the frame has completed encoding. ```cpp struct EncodingError {}; struct EncodedData { FrameType frame_type; int encoded_qp; }; using EncodeResult = std::variant; ``` The feedback contains either: * **`EncodingError`**: Some error occurred, the encoder is no longer usable. * **`EncodedData`**: * **`frame_type`**: Certain encoders have a fixed keyframe interval, and to support them the encoded frame type is signaled back to the application. * **`encoded_qp`**: The average frame QP. --- ## Encode ```cpp virtual void Encode(scoped_refptr frame_buffer, const TemporalUnitSettings& settings, std::vector frame_settings) = 0; ``` A `VideoEncoderInterface` instance has just a single method for interacting with it: `Encode()`. It takes three arguments: * **`frame_buffer`**: The input frame to be encoded (`scoped_refptr`). * **`settings`**: Settings common to the temporal units (`TemporalUnitSettings`). * **`frame_settings`**: One settings struct per output frame expected from the encoder (`FrameEncodeSettings`). --- ## A note on references A frame for which all referenced frames have been decoded MUST itself be decodable. References do not only express dependencies on other frames used for inter-frame prediction, they express any form of dependency on prior state. What this means is that frames that have no references MUST be decodable without relying on any prior state.