Description:

- 20230828-134459-affine-grid-sampler-PR
Torch version: 2.1.0a0+git52598e9
Torch config: PyTorch built with:
  - GCC 9.4
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_61,code=sm_61
  - CuDNN 8.9
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.0, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=0, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, 

Triton version: 2.1.0+e6216047b8

- 20230825-171504-affine-grid-sampler-Nightly
Torch version: 2.1.0a0+gitcf76938
Torch config: PyTorch built with:
  - GCC 9.4
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_61,code=sm_61
  - CuDNN 8.9
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.0, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=0, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, 

Triton version: 2.1.0+e6216047b8


[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)         

Times are in microseconds (us).