{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accessing File with GDS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisite\n",
    "\n",
    "[NVIDIA® GPUDirect® Storage (GDS)](https://developer.nvidia.com/gpudirect-storage) needs to be installed to use GDS feature (Since CUDA Toolkit 11.4, GDS client package has been available.)\n",
    "\n",
    "File access APIs would still work without the GDS host (kernel) packages but you won't see the speed up.\n",
    "Please follow the [release note](https://docs.nvidia.com/gpudirect-storage/release-notes/index.html) or the [installation guide](https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#abstract) to install GDS in your host system.\n",
    "\n",
    "- Note:: During the GDS prerequisite installation ([installation guide](https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html)), you would need MOFED (Mellanox OpenFabrics Enterprise Distribution) installed. MOFED is available at https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed.\n",
    "\n",
    "\n",
    "\n",
    "The following examples assumes that files loaded are mounted on the NVMe storage device and assumes that CuPy and PyTorch packages are installed.\n",
    "\n",
    "Please execute the following commands to install the dependent libraries.\n",
    "\n",
    "```\n",
    "!conda install -c pytorch -c conda-forge pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0\n",
    "(If executing `import torch; torch.cuda.is_available()` doesn't show 'True', use `pip` PyTorch installation method.)\n",
    "```\n",
    "\n",
    "\n",
    "or\n",
    "```\n",
    "!pip install cupy-cuda110\n",
    "!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!conda install -c pytorch -c conda-forge pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0\n",
    "\n",
    "# or\n",
    "\n",
    "#!pip install cupy-cuda110\n",
    "#!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Open File\n",
    "\n",
    "You can use either `CuFileDriver` class or `open` method in `cucim.clara.filesystem` package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Opening/Closing file with CuFileDriver\n",
    "\n",
    "A file descriptor would be needed to create a CuFileDriver instance.\n",
    "\n",
    "To use GDS, the file needs to be opened with `os.O_DIRECT`. See [NVIDIA GPUDirect Storage O_DIRECT Requirements Guide](https://docs.nvidia.com/gpudirect-storage/o-direct-guide/index.html).\n",
    "\n",
    "Please also see [os.open()](https://docs.python.org/3/library/os.html#os.open) for the detailed options available.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on instancemethod in module cucim.clara._cucim.filesystem:\n",
      "\n",
      "__init__(...)\n",
      "    __init__(self: cucim.clara._cucim.filesystem.CuFileDriver, fd: int, no_gds: bool = False, use_mmap: bool = False, file_path: str = '') -> None\n",
      "    \n",
      "    Constructor of CuFileDriver.\n",
      "    \n",
      "    Args:\n",
      "        fd: A file descriptor (in `int` type) which is available through `os.open()` method.\n",
      "        no_gds: If True, use POSIX APIs only even when GDS can be supported for the file.\n",
      "        use_mmap: If True, use memory-mapped IO. This flag is supported only for the read-only file descriptor. Default value is `False`.\n",
      "        file_path: A file path for the file descriptor. It would retrieve the absolute file path of the file descriptor if not specified.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from cucim.clara.filesystem import CuFileDriver\n",
    "\n",
    "fno = os.open( \"input/image.tif\", os.O_RDONLY | os.O_DIRECT)\n",
    "fno2 = os.dup(fno)\n",
    "\n",
    "fd = CuFileDriver(fno, False)\n",
    "fd.close()\n",
    "os.close(fno)\n",
    "\n",
    "# Do not use GDS even when GDS can be supported for the file.\n",
    "fd2 = CuFileDriver(fno2, True)\n",
    "fd2.close()\n",
    "os.close(fno2)\n",
    "\n",
    "help(CuFileDriver.__init__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Opening file with `open()` method in cucim.clara.filesystem package\n",
    "\n",
    "`cucim.clara.filesystem.open()` method accepts the three parameters (`file_path`, `flags`, `mode`).\n",
    "\n",
    "\n",
    "#### file_path\n",
    "\n",
    "A string for the file path.\n",
    "\n",
    "#### flags\n",
    "\n",
    "`flags` can be one of the following flag string:\n",
    "\n",
    "- **\"r\"**  : `os.O_RDONLY`\n",
    "- **\"r+\"** : `os.O_RDWR`\n",
    "- **\"w\"**  : `os.O_RDWR`   | `os.O_CREAT` | `os.O_TRUNC`\n",
    "- **\"a\"**  : `os.O_RDWR`   | `os.O_CREAT`\n",
    "\n",
    "In addition to above flags, the method append `os.O_CLOEXEC` and `os.O_DIRECT` by default.\n",
    "\n",
    "The following is optional flags that can be added to above string:\n",
    "- **'p'**: Use POSIX APIs only (first try to open with O_DIRECT). It does not use GDS.\n",
    "- **'n'**: Do not add O_DIRECT flag.\n",
    "- **'m'**: Use memory-mapped file. This flag is supported only for the read-only file descriptor.\n",
    "\n",
    "When **'m'** is used, `PROT_READ` and `MAP_SHARED` are used for the parameter of [mmap()](https://man7.org/linux/man-pages/man2/mmap.2.html) function.\n",
    "\n",
    "#### mode\n",
    "\n",
    "A file mode. Default value is `0o644`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import cucim.clara.filesystem as fs\n",
    "\n",
    "fd = fs.open(\"input/image.tif\", \"r\")\n",
    "fs.close(fd)  # same with fd.close()\n",
    "\n",
    "# Open file without using GDS\n",
    "fd2 = fs.open(\"input/image.tif\", \"rp\")\n",
    "fs.close(fd2)  # same with fd2.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read/Write File\n",
    "\n",
    "You can use `pread()`/`pwrite()` method in either `CuFileDriver` class or `cucim.clara.filesystem` package.\n",
    "\n",
    "Those methods are similar to POSIX [pread()](https://man7.org/linux/man-pages/man2/pread.2.html)&[pwrite()](https://man7.org/linux/man-pages/man2/pwrite.2.html) methods which requires `buf`, `count`, and `offset`(`file_offset`) parameters.\n",
    "\n",
    "However, for user's convenient, an optional `buf_offset` parameter (default value: `0`) is also added to specify an offset of the input/output buffer and it would have `0` if not specified."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using CPU memory\n",
    "\n",
    "Any Python object supporting [\\_\\_array_interface__](https://numpy.org/doc/stable/reference/arrays.interface.html) (such as numpy.array or numpy.ndarray) can be used for `buf` parameter.\n",
    "Or, any pointer address (`int` type) can be used for `buf` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "np_arr     cnt: 8  content: [  1   2 101 102 103 104 105 106 107 108]\n",
      "np_arr     cnt: 10  content: [101 102 103 104 105 106 107 108 109 110]\n",
      "torch_arr  cnt: 7  content: tensor([104, 105, 106, 107, 108, 109, 110, 108, 109, 110], dtype=torch.uint8)\n",
      "output.raw cnt: 10  content: [0, 0, 0, 0, 0, 104, 105, 106, 107, 108, 109, 110, 108, 109, 110]\n",
      "\n",
      "np_arr     cnt: 10  content: [  0   0   0   0   0 104 105 106 107 108]\n",
      "np_arr     cnt: 10  content: [  0   0   0   0   0 104 105 106 107 108]\n"
     ]
    }
   ],
   "source": [
    "from cucim.clara.filesystem import CuFileDriver\n",
    "import cucim.clara.filesystem as fs\n",
    "\n",
    "import os, numpy as np, torch\n",
    "\n",
    "# Write a file with size 10 (in bytes)\n",
    "with open(\"input.raw\", \"wb\") as input_file:\n",
    "    input_file.write(bytearray([101, 102, 103, 104, 105, 106, 107, 108, 109, 110]))\n",
    "\n",
    "# Create an array with size 10 (in bytes)\n",
    "np_arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.uint8)\n",
    "torch_arr = torch.from_numpy(np_arr) # Note: np_arr shares internal data with torch_arr\n",
    "\n",
    "# Using CuFileDriver\n",
    "fno = os.open( \"input.raw\", os.O_RDONLY)\n",
    "fd = CuFileDriver(fno)\n",
    "read_count = fd.pread(np_arr, 8, 0, 2)      # read 8 bytes starting from file offset 0 into buffer offset 2\n",
    "print(\"{:10} cnt: {}  content: {}\".format(\"np_arr\", read_count, np_arr))\n",
    "read_count = fd.pread(np_arr, 10, 0)      # read 10 bytes starting from file offset 0\n",
    "print(\"{:10} cnt: {}  content: {}\".format(\"np_arr\", read_count, np_arr))\n",
    "read_count = fd.pread(torch_arr.data_ptr(), 10, 3)      # read 10 bytes starting from file offset 3\n",
    "print(\"{:10} cnt: {}  content: {}\".format(\"torch_arr\", read_count, torch_arr))\n",
    "fd.close()\n",
    "os.close(fno)\n",
    "\n",
    "fno = os.open(\"output.raw\", os.O_RDWR | os.O_CREAT | os.O_TRUNC)\n",
    "fd = CuFileDriver(fno)\n",
    "write_count = fd.pwrite(np_arr, 10, 5)      # write 10 bytes from np_array to file starting from offset 5\n",
    "fd.close()\n",
    "os.close(fno)\n",
    "print(\"{:10} cnt: {}  content: {}\".format(\"output.raw\", write_count, list(open(\"output.raw\", \"rb\").read())))\n",
    "\n",
    "\n",
    "print()\n",
    "# Using filesystem package\n",
    "fd = fs.open(\"output.raw\", \"r\")\n",
    "read_count = fs.pread(fd, np_arr, 10, 0)  # read 10 bytes starting from offset 0\n",
    "print(\"{:10} cnt: {}  content: {}\".format(\"np_arr\", read_count, np_arr))\n",
    "fs.close(fd)                              # same with fd.close()\n",
    "\n",
    "# Using 'with' statement\n",
    "with fs.open(\"output.raw\", \"r\") as fd:\n",
    "    read_count = fd.pread(np_arr, 10, 0)  # read 10 bytes starting from offset 0\n",
    "    print(\"{:10} cnt: {}  content: {}\".format(\"np_arr\", read_count, np_arr))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using GPU memory\n",
    "\n",
    "Any Python object supporting [\\_\\_cuda_array_interface__](http://numba.pydata.org/numba-doc/latest/cuda/cuda_array_interface.html) (such as cupy.array, cupy.ndarray, or Pytorch Cuda Tensor) can be used for `buf` parameter.\n",
    "Or, any pointer address (`int` type) can be used for `buf` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "np_arr               cnt: 8  content: [  1   2 101 102 103 104 105 106 107 108]\n",
      "cp_arr               cnt: 10  content: [  0   0   0   0   0 104 105 106 107 108]\n",
      "torch_arr            cnt: 7  content: tensor([104, 105, 106, 107, 108, 109, 110, 108, 109, 110], device='cuda:0',\n",
      "       dtype=torch.uint8)\n",
      "output.raw           cnt: 10  content: [0, 0, 0, 0, 0, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110]\n",
      "\n",
      "cp_arr               cnt: 10  content: [  0   0   0   0   0 101 102 103 104 105]\n",
      "np_arr     cnt: 10  content: [  0   0   0   0   0 101 102 103 104 105]\n"
     ]
    }
   ],
   "source": [
    "from cucim.clara.filesystem import CuFileDriver\n",
    "import cucim.clara.filesystem as fs\n",
    "\n",
    "import os\n",
    "import cupy as cp\n",
    "import torch\n",
    "\n",
    "# Write a file with size 10 (in bytes)\n",
    "with open(\"input.raw\", \"wb\") as input_file:\n",
    "    input_file.write(bytearray([101, 102, 103, 104, 105, 106, 107, 108, 109, 110]))\n",
    "\n",
    "# Create an array with size 10 (in bytes)\n",
    "cp_arr = cp.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=cp.uint8)\n",
    "\n",
    "cuda0 = torch.device('cuda:0')\n",
    "torch_arr = torch.zeros(10, dtype=torch.uint8, device=cuda0)\n",
    "\n",
    "# Using CuFileDriver\n",
    "fno = os.open( \"input.raw\", os.O_RDONLY | os.O_DIRECT)\n",
    "fd = CuFileDriver(fno)\n",
    "\n",
    "read_count = fd.pread(cp_arr, 8, 0, 2)      # read 8 bytes starting from file offset 0 into buffer offset 2\n",
    "print(\"{:20} cnt: {}  content: {}\".format(\"np_arr\", read_count, cp_arr))\n",
    "read_count = fd.pread(cp_arr, 10, 0)      # read 10 bytes starting from offset 0\n",
    "print(\"{:20} cnt: {}  content: {}\".format(\"cp_arr\", read_count, np_arr))\n",
    "read_count = fd.pread(torch_arr, 10, 3)      # read 10 bytes starting from offset 3\n",
    "print(\"{:20} cnt: {}  content: {}\".format(\"torch_arr\", read_count, torch_arr))\n",
    "fd.close()\n",
    "os.close(fno)\n",
    "\n",
    "fno = os.open(\"output.raw\", os.O_RDWR | os.O_CREAT | os.O_TRUNC)\n",
    "fd = CuFileDriver(fno)\n",
    "write_count = fd.pwrite(cp_arr, 10, 5)      # write 10 bytes from np_array to file starting from offset 5\n",
    "fd.close()\n",
    "os.close(fno)\n",
    "print(\"{:20} cnt: {}  content: {}\".format(\"output.raw\", write_count, list(open(\"output.raw\", \"rb\").read())))\n",
    "\n",
    "print()\n",
    "# Using filesystem package\n",
    "fd = fs.open(\"output.raw\", \"r\")\n",
    "read_count = fs.pread(fd, cp_arr, 10, 0)  # read 10 bytes starting from offset 0\n",
    "print(\"{:20} cnt: {}  content: {}\".format(\"cp_arr\", read_count, cp_arr))\n",
    "fs.close(fd)                              # same with fd.close()\n",
    "\n",
    "# Using 'with' statement\n",
    "with fs.open(\"output.raw\", \"r\") as fd:\n",
    "    read_count = fd.pread(cp_arr, 10, 0)  # read 10 bytes starting from offset 0\n",
    "    print(\"{:10} cnt: {}  content: {}\".format(\"np_arr\", read_count, cp_arr))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'shape': (10,),\n",
       " 'typestr': '|u1',\n",
       " 'descr': [('', '|u1')],\n",
       " 'stream': 1,\n",
       " 'version': 3,\n",
       " 'strides': None,\n",
       " 'data': (139779203137536, False)}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cp_arr.__cuda_array_interface__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'typestr': '|u1',\n",
       " 'shape': (10,),\n",
       " 'strides': None,\n",
       " 'data': (139776277610496, False),\n",
       " 'version': 2}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch_arr.__cuda_array_interface__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discarding system (page) cache for a file\n",
    "\n",
    "You can use `discard_page_cache()` method for discarding system (page) cache for the given file, before any performance measurement on a file.\n",
    "\n",
    "```python\n",
    "import cucim.clara.filesystem as fs\n",
    "\n",
    "fs.discard_page_cache(\"input/image.tif\")\n",
    "# ... file APIs on `input/image.tif`\n",
    "```\n",
    "\n",
    "Its implementation looks like below\n",
    "```C++\n",
    "bool discard_page_cache(const char* file_path)\n",
    "{\n",
    "    int fd = ::open(file_path, O_RDONLY);\n",
    "    if (fd < 0)\n",
    "    {\n",
    "        return false;\n",
    "    }\n",
    "    if (::fdatasync(fd) < 0)\n",
    "    {\n",
    "        return false;\n",
    "    }\n",
    "    if (::posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) < 0)\n",
    "    {\n",
    "        return false;\n",
    "    }\n",
    "    if (::close(fd) < 0)\n",
    "    {\n",
    "        return false;\n",
    "    }\n",
    "    return true;\n",
    "}\n",
    "```\n",
    "\n",
    "It helps measure accurate file access performance without the effect of the page cache."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experiments (for a big file such as .mhd)\n",
    "\n",
    "Conducted experiments on Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz with Samsung SSD 970 PRO (NVMe SSD, 1TB).\n",
    "\n",
    "This is for reading 10GB of data\n",
    "```\n",
    "  second method(posix + cudamemcpy)         : 5.031040154863149\n",
    "  second method(posix+odirect + cudamemcpy) : 4.7419630330987275\n",
    "  second method(gds)                        : 4.235773948952556\n",
    "```\n",
    "Performance gain: 1.19x\n",
    "\n",
    "\n",
    "This is for reading 2GB of data\n",
    "\n",
    "```bash\n",
    "  second method(posix)         : 1.0681836600415409\n",
    "  second method(posix+odirect) : 0.9496012150775641\n",
    "  second method(gds)           : 0.8406150250229985\n",
    "```\n",
    "Performance gain: 1.27x\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}