# Corruption Detection **Name:** "Corruption Detection"; "Extension for Automatic Detection of Video Corruptions" **Formal name:** **Status:** This extension is defined here to allow for experimentation. **Contact:** NOTE: This explainer is a work in progress and may change without notice. The Corruption Detection (sometimes referred to as automatic corruption detection or ACD) extension is intended to be a part of a system that allows estimating a likelihood that a video transmission is in a valid state. That is, the input to the video encoder on the send side corresponds to the output of the video decoder on the receive side with the only difference being the expected distortions from lossy compression. The goal is to be able to detect outright coding errors caused by things such as bugs in encoder/decoders, malformed packetization data, incorrect relay decisions in SFU-type servers, incorrect handling of packet loss/reordering, and so forth. We want to accomplish this with a high signal-to-noise ratio while consuming a minimum of resources in terms of bandwidth and/or computation. It should be noted that it is _not_ a goal to be able to e.g. gauge general video quality using this method. This explainer contains two parts: 1) A definition of the RTP header extension itself and how it is to be parsed. 2) The intended usage and implementation details for a WebRTC sender and receiver respectively. If this extension has been negotiated, all the client behavior outlined in this doc MUST be adhered to. ## RTP Header Extension Format ### Data Layout Overview The message format of the header extension: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |B| seq index | std dev | Y err | UV err| sample 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sample 1 | sample 2 | … up to sample <=12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ### Data Layout Details * B (1 bit): If the sequence number should be interpreted as the MSB or LSB of the full size 14 bit sequence index described in the next point. * seq index (7 bits): The index into the Halton sequence (used to locate where the samples should be drawn from). * If B is set: the 7 most significant bits of the true index. The 7 least significant bits of the true index shall be interpreted as 0. This is because this is the point where we can guarantee that the sender and receiver has the same full index. B MUST be set on keyframes. On droppable frames B MUST NOT be set. * If B is not set: The 7 LSB of the true index. The 7 most significant bits should be inferred based on the most recent message. * std dev (8 bits): The standard deviation of the Gaussian filter used to weigh the samples. The value is scaled using a linear map: 0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using just the sample value at the desired coordinate, without any weighting. * Y err (4 bits): The allowed error for the luma channel. * UV err (4 bits): The allowed error for the chroma channels. * Sample N (8 bits): The N:th filtered sample from the input image. Each sample represents a new point in one of the image planes, the plane and coordinates being determined by index into the Halton sequence (starting at seq# index and is incremented by one for each sample). Each sample has gone through a Gaussian filter with the std dev specified above. The samples have been floored to the nearest integer. A special case is the so-called "synchronization" message. Such a message only contains the first byte. They are used to keep the sender and receiver in sync even if no "full" message has been received for a while. Such messages MUST NOT be sent on droppable frames. ### A note on encryption Privacy and security are core parts of nearly every WebRTC-based application, which means that some sort of encryption needs to be present. The most common form of encryption is SRTP, defined in RFC 3711. However, as mentioned in section 9.4 of that RFC, RTP header extensions are considered part of the header and are thus not encrypted. The automatic corruption detection header extension is different from most other header extensions in that it provides not only metadata about the media stream being transmitted but in practice comprises an extremely sparse representation of the actual video stream itself. Given a static scene and enough time, a crude image of the encrypted video can rather trivially be constructed. As such, most applications should use this extension with SRTP only if additional security is present to protect it. That could be for example in the form of explicit header extension encryption provided by RFC 6904/RFC 9335, or by encapsulating the entire RTP stream in an additional layer such as IPSec. ## Usage & Guidelines In this section we’ll first look at a general overview of the intended usage of this header extensions, followed by more details around the expected implementation. ### Overview The premise of the extension described here is that we can validate the state of the video pipeline by quasi-randomly selecting a few samples from the raw input frame to an encoder, and then checking them against the output of a decoder. Assuming that a lossless codec is used we can follow these steps: 1) In an image that is to be encoded, quasi-randomly select N sampling positions and store the samples values for those positions from the raw input image. 2) Encode the image, and attach the selected sample values to the RTP packets containing the encoded bitstream of that image. 3) Transmit the RTP packets to a remote receiver. 4) At the receiver, collect the attached sample values from the RTP packets when assembling the frame, and then pass the bitstream to a decoder. 5) Using the same quasi-random sequence as in (1), calculate the corresponding N sampling positions. 6) Take the output of the decoder and check the values of the samples from the RTP packets. If they differ significantly, it is likely that an image corruption has occurred. Lossless encoding is however rarely used in practice, and that introduces problems for the above algorithm. * Quantization causes values to be different from the desired value. * Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion vectors. * Inaccuracies caused by in-loop or post-process filtering. * etc. We must therefore take these distortions into consideration, as they are merely a natural side-effect of the compression and their effect is not to be considered an “invalid state”. We aim to accomplish this using two tools. First, instead of a sample being a single raw sample value let it be a filtered one: a weighted average of samples in the vicinity of the desired location, with the weights being a 2D Gaussian centered at that location and the variance adjusted depending on the magnitude of the expected distortions (higher distortion => higher variance). This smoothes out inaccuracies caused by both quantization and motion compensation. Secondly, even with a very large filter kernel the new sample might not converge towards the exact desired value. For that reason, set an “allowed error threshold” that removes small magnitude differences. Since chroma and luma channels have different scales, separate error thresholds are available for them. ### Sequence Index Handling The quasi-random sequence of choice for this extension is a 2D [Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence). The index into the Halton Sequence is indicated by the header extension and results in a 14 bit unsigned integer which on overflow will wrap around back to 0. For each sample contained within the extension, the sequence index should be considered to be incremented by one. Thus the sequence index at the start of the header should be considered “the sequence index for the next sample to be drawn”. The ACD extension may be sent containing either the 7 most significant bits (B = true) or the 7 least significant bits (B = false) of the sequence index. Key-frames MUST be populated with the ACD extension, and those MUST use B = true indicating only the 7 most significant bits are transmitted. The sender may choose any arbitrary starting point. The biggest reason to not always start with (B = true, seq index = 0) is that with frequent/periodic keyframes you might end up always sampling the same small subset of image locations over and over. If B = false and the LSB seq index + number of samples exceeds the capacity of the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit sequence counter should be considered to be implicitly incremented by the overflow. Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for example temporal layering using the [L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario, key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are droppable. For non-droppable frames, B MAY be set to true even though there is often little utility for it. For droppable frames B MUST NOT be set to true, since a receiver could otherwise easily end up out of sync with the sender. A receiver must store a state containing the last sequence index used. If an ACD extension is receiver with B = false but the LSB does not match the last known sequence index state, this indicates that an instrumented frame has been dropped. The receiver should recover from this by incrementing the last known sequence index until the 7 least significant bits match. Because of this, the sender MUST send ACD messages on non-droppable frames such that the delta between their sequence indexing (from the last sample of the previous packet to the first of the next) indexing does not exceed 0x7F. A synchronization message may be used for this purpose if there is no wish to instrument the non-droppable frame. It is not required to add the ACD extension to every frame. Indeed, for performance reasons it may be reasonable to only instrument a small subset of frames, for example using only one frame per second. Additionally, when encoding a structure that has independent decode targets (e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence per target resolution so that a receiver can validate the state of the sub-stream they receive. // TODO: Add concrete examples. ### Sample Selection As mentioned above, a Halton Sequence is used to generate sampling coordinates. Base 2 is used for selecting the rows, and base 3 is used for selecting columns. Each sample in the ACD extension represents a single image sample, meaning it belongs to a single channel rather than e.g. being an RGB pixel. The initial version of the ACD extension supports only the I420 chroma subsampling format. When determining which plane a location belongs to, it is easiest to visualize it as the chroma planes being “stacked” to the side of the luma plane: +------+---+ | | U | + Y +---+ | | V | +------+---+ In pseudo code: ``` row = GetHaltonSequence(seq_index, /*base=*/2) * image_height; col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5; if (col < image_width) { HandleSample(Y_PLANE, row, col); } else if (row < image_height / 2) { HandleSample(U_PLANE, row, col - image_width); } else { HandleSample(V_PLANE, row - (image_height / 2), col - image_width); } seq_index++; ``` Support for other layout types may be added in later versions of this extension. Note that the image dimensions are not explicitly a part of the ACD extension - that has to be inferred from the raw image itself. ### Sample Filtering As mentioned above, when filtering a sample we create a weighted average around the desired location. Only samples in the same plane are considered. The weighting consists of a 2D Gaussian centered on the desired location, with the standard deviation specified in the ACD extension header. If the standard deviation is specified as 0.0 - we consider only a singular sample. Otherwise, we first determine a cutoff distance below which the weights are considered too small to matter. For now, we have set the weight cutoff to 0.2 - meaning the maximum distance from the center sample we need to consider is max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1. Any samples outside the plane are considered to have weight 0. In pseudo-code, that means we get the following: ``` sample_sum = 0; weight_sum = 0; for (y = max(0, row - max_d) to min(plane_height, row + max_d) { for (x = max(0, col - max_d) to min(plane_width, col + max_d) { weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2)); sample_sum += SampleAt(x, y) * weight; weight_sum += weight; } } filtered_sample = sample_sum / weight_sum; ``` ### Receive Side Considerations When a frame has been decoded and an ACD message is present, the receiver performs the following steps: * Update the sequence index so that it is consistent with the ACD message. * Calculate the sample positions from the Halton sequence. * Filter each sample of the decoded image using the standard deviation provided in the ACD message. We then need to compare the actual samples present in the ACD message and the samples generated from the locally decoded frame, and take the allowed error into account: ``` for (i = 0 to num_samples) { // Allowed error from ACD message, depending on which plane sample i is in. allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR; delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error); } ``` It is then up to the receiver how to interpret these deltas. A suggested method is to calculate a “corruption score” by calculating sum(delta(i)^2), where delta(i) is the delta for i:th sample in the message, and then scaling and capping that result to a maximum of 1.0. By squaring the sample, we make sure that even singular samples that are way outside their expected values cause a noticeable shift in the score. Another possible way is to calculate the distance and cap it using a sigmoid function. This extension message format does not make recommendations about what a receiver should do with the corruption scores, but some possibilities are: * Expose it as a statistics connected to the video receive stream. Let the application decide what to do with the information. * Let the WebRTC application use a corruption signal to take proactive measures. E.g. request a key-frame in order to recover, or try to switch to another codec type or implementation. ### Determining Filter Settings & Error Thresholds It is up to the sender to estimate how large the filter kernel and the allowed error thresholds should be. One method to do this is to analyze example outputs from different encoders and map the average frame QP to suitable settings. There will of course have to be different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to get “tighter” values with knowledge of the exact implementation used. E.g. a mapping designed just for libaom encoder version X running with speed setting Y. Another method is to use the actual reconstructor state from the encoder. That of course means the encoder has to expose that state, which is not common. A benefit of doing it that way is that the filter size and allowed error can be very small (really only post-processing could introduce distortions in that scenario). A drawback is if the reconstructed state already contains corruption due to an encoder bug - then we would not be able to detect that corruption at all. There are also possibly more accurate but probably much more costly alternatives as well, such as training an ML model to determine the settings based on both the content of the source frame and any metadata present in the encoded bitstream. Regardless of method, the implementation at the send side SHOULD strive to set the filter size and error thresholds such that 99.5% of filtered samples end up with a delta <= the error threshold for that plane, based on a representative set of test clips and bandwidth constraints. Notes: The extension must not be present in more than 1 packet per video frame.