--- title: cuda issues permalink: cuda-issues date: 2024-07-23T08:33:43-04:00 tags: cuda --- > First of all apologies for the rushed writing, but I'm too happy that I > finally resolved my CUDA problem after days of trying. Maybe after I calm down > a bit I'll come back to make it more pleasant to read. I'm trying to run Nvidia Docker containers on my Linux Mint 21.3 machine. This solves the `Failed to initialize NVML: Unknown Error` problem. However, contrary to the other posts about `Failed to initialize NVML: Unknown Error`, where the GPU goes offline after a certain period of time, my GPU wasn't detected at all! CUDA works great on bare metal, but once containerized there's always a problem of the GPU not being detected in Docker. For my solution, check the last entry. But first: neofetch ``` ...-:::::-... user@computer .-MMMMMMMMMMMMMMM-. ---------------- .-MMMM`..-:::::::-..`MMMM-. OS: Linux Mint 21.3 x86_64 .:MMMM.:MMMMMMMMMMMMMMM:.MMMM:. Host: MS-7D51 1.0 -MMM-M---MMMMMMMMMMMMMMMMMMM.MMM- Kernel: 5.15.0-113-generic `:MMM:MM` :MMMM:....::-...-MMMM:MMM:` Uptime: 52 mins :MMM:MMM` :MM:` `` `` `:MMM:MMM: Packages: 3113 (dpkg) .MMM.MMMM` :MM. -MM. .MM- `MMMM.MMM. Shell: zsh 5.8.1 :MMM:MMMM` :MM. -MM- .MM: `MMMM-MMM: Resolution: 2560x1440 :MMM:MMMM` :MM. -MM- .MM: `MMMM:MMM: DE: Cinnamon 6.0.4 :MMM:MMMM` :MM. -MM- .MM: `MMMM-MMM: WM: Mutter (Muffin) .MMM.MMMM` :MM:--:MM:--:MM: `MMMM.MMM. WM Theme: Mint-Y-Dark-Aqua (Mint-Y) :MMM:MMM- `-MMMMMMMMMMMM-` -MMM-MMM: Theme: Mint-Y-Aqua [GTK2/3] :MMM:MMM:` `:MMM:MMM: Icons: Mint-Y-Sand [GTK2/3] .MMM.MMMM:--------------:MMMM.MMM. Terminal: gnome-terminal '-MMMM.-MMMMMMMMMMMMMMM-.MMMM-' CPU: AMD Ryzen 7 3700X (16) @ 3.600GHz '.-MMMM``--:::::--``MMMM-.' GPU: NVIDIA GeForce RTX 3090 '-MMMMMMMMMMMMM-' Memory: 8751MiB / 32004MiB ``-:::::-`` ``` ## What we're trying to solve: ```sh sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >> Failed to initialize NVML: Unknown Error ``` Ideally, we want it to print out an nvidia-smi screen. ## My troubleshooting steps: ### Cgroupfs: - this is probably not a problem anymore (fixed), but might as well do it because anything to do with CUDA is black magic - Relevant Links: - https://github.com/NVIDIA/nvidia-docker/issues/1730#issue-1573551271 - https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1740502744 - tl;dr Set `/etc/docker/daemon.json` to ```json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } }, "exec-opts": ["native.cgroupdriver=cgroupfs"] } ``` ### Passing in the devices, in docker compose - Relevant Links: - https://github.com/AbdBarho/stable-diffusion-webui-docker/issues/389#issuecomment-1571340508 - tl;dr Add the following to your `docker-compose.yaml`, under a service: ```yaml service: ServiceName # ... deploy: resources: reservations: devices: - driver: nvidia capabilities: ["gpu", "compute", "utility"] # "gpu" may or may not be present depending on the video card ``` ### Pass in the GPU using --gpus all - When you run something in Docker, tack on `--gpus all` ```sh sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >> Failed to initialize NVML: Unknown Error ``` ...drat no change ### Passing in Devices (my problem): - My problem: - For some reason, the symlinks for all the different devices aren't made - Had to add all the `--device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0`, otherwise it wouldn't be detected - Next up is to try ```sh sudo nvidia-ctk system create-dev-char-symlinks \ --create-all ``` - Test if you have this problem: - `sudo docker run --rm --runtime=nvidia --gpus all --device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0 ubuntu nvidia-smi` - I only needed to add `--device=/dev/nvidiactl --device=/dev/nvidia0`, but ymmv - It works~ ```sh sudo docker run --rm --runtime=nvidia --gpus all --device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0 ubuntu nvidia-smi >> Tue Jul 23 14:08:38 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:2D:00.0 On | N/A | | 36% 34C P2 106W / 350W | 587MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ ``` - Relevant links: - https://github.com/NVIDIA/nvidia-docker/issues/1730#issue-1573551271 (used the repro steps here to finally figure out that the GPU was indeed getting detected)