# Implementation details of 2d-IDCT and reinterpreting-DCT First of all, note that we only need to generate Nx2N, NxN, 2NxN transforms for most sizes. The one exception are 8x32/32x8 IDCTs and 1x4/4x1 reinterpreting-DCTs. The 4x1/1x4 reinterpreting-DCTs are very small and don't need special considerations. Large transforms use the same implementation strategy, but avoid increasing code size by using size-generic code. ## Code generation Code is generated with python scripts. The following bash snippet generates the relevant files: ```bash for i in 2 4 8 16 32 do python3 gen_idct.py $i > src/idct$i.rs done for i in 2 4 8 16 32 do python3 gen_reinterpreting_dct.py $i > src/reinterpreting_dct$i.rs done python3 gen_idct2d.py > src/idct2d.rs python3 gen_reinterpreting_dct2d.py > src/reinterpreting_dct2d.rs cargo fmt ``` ## SIMD type selection The compiler generates suboptimal code when mixing different vector sizes. Thus, as a first step we "downgrade" to the largest vector size that divides both sizes of the transform. ## DCT/IDCT Implementation Both the DCT and the IDCT use the same recursive algorithm used in libjxl to compute a vector worth of DCTs/IDCTs. ## 2d transforms The code is written to minimize transposition cost while still ensuring we load full vectors at a time. We don't use any additional memory to store transposes. Let K be vector length (which divides both sides of the DCT as per above). ### N x 2N transforms and 8x32 IDCT For those transforms, the final output should be the same shape as the input. Thus, we logically need to transpose, DCT, transpose and DCT. However, we can instead first do a set of row-DCTs on K rows, transposing every KxK sub-matrix in place in advance, then do a column-DCT on the first K columns, and finally transpose the KxK sub-matrices in the columns again. ### N x N transforms Square transforms are easy: we can do column-DCTs, then swap KxK blocks between lower and upper triangular part of the block-matrix, going K columns by K columns and transposing during the swap, and do a column-DCT after each group of columns is complete. ### 2N x N IDCTs and 32x8 IDCT For these transforms, we have a special implementation of 1D-IDCT that does part of the transpose. In particular, we transpose NxN blocks as in the square case. We are then left with doing the row-DCT and interleaving blocks so that they go from stacked horizontally to stacked vertically. Since that can be done by just reshuffling individual columns of vectors, we merge that operation with the DCT. ### 2N x N DCTs This is basically the same as the IDCTs, but in reverse order. Thus, the "special" DCT applies a different permutation.