[TOC] | 日期 | 作者 | | | ---------- | ----------------- | ---- | | 2019-11-29 | xuchaoo@gmail.com | | | | | | ### 0. 环境 硬件:i58400 | 16G ddr4-2400 | 华硕P106-6G 矿卡,约等于GTX1060~GTX1070(大约350元,性能更好的P104约800,还有P102等) OS: xubuntu-18.04 ### 1. P106矿卡安装 设备断电情况下,矿卡插入PC的卡槽内,开机。 在命令行执行(依赖第2步,先配置下仓库) ```bash ubuntu-drivers devices ``` 如果没有任何输出那么说明矿卡硬件没有插好或者**硬件本身有问题**。 正常情况输出样例:(倒数第四行是推荐你安装的驱动) ```text cxu@cxu-pc:~$ ubuntu-drivers devices == /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 == modalias : pci:v000010DEd00001C07sv00001043sd000085FCbc03sc02i00 vendor : NVIDIA Corporation model : GP106 [P106-100] driver : nvidia-driver-430 - distro non-free driver : nvidia-driver-410 - third-party free driver : nvidia-driver-415 - third-party free driver : nvidia-driver-440 - third-party free recommended driver : nvidia-driver-390 - third-party free driver : nvidia-driver-435 - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin ``` ### 2. 矿卡驱动仓库配置 ```bash sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update ``` 配置仓库目的是从命令行安装驱动。 当然也能从 下载打包好的官方驱动,搜索框按照下面的填写:产品类型:GeForce|产品系列:GeForce 10 Series|产品家族:GeForce GTX1060|.. 我搜出来的是: ### 3. 安装矿卡驱动 #### 安装驱动 ```bash apt install nvidia-driver-440 ``` 还有一种方法是 ```bash ubuntu-drivers autoinstall ``` 后面的`nvidia-driver-xxx`需要看`ubuntu-drivers devices`命令的输出,里面有可以安装的驱动。此处选择了recommended的,也就是nvidia-driver-440。 #### 测试驱动 ```bash nvidia-smi ``` 正常输出如下: ```text cxu@cxu-pc:~$ nvidia-smi Fri Nov 29 23:10:53 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.26 Driver Version: 440.26 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 P106-100 Off | 00000000:01:00.0 Off | N/A | | 50% 23C P8 8W / 120W | 57MiB / 6080MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 867 G /usr/lib/xorg/Xorg 56MiB | +-----------------------------------------------------------------------------+ ``` #### 卸载驱动(如果要) ```bash sudo nvidia-uninstall sudo apt purge "*nvidia*" # 然后要重启一下,目的是把内核已经加载的驱动模块卸载掉 # ===以下是备注==== # apt clean 删除下载的缓存软件包 # apt remove 删除安装的软件包 # apt purge 删除安装的软件包和相关的配置文件 ``` #### 常见错误 > 错误1 NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. > ——————————————————————————— > > 原因我遇到了2个: > > 1)硬件没插好,或者也许你买的卡本身就是坏的。 > > 2)安装CUDA时,没看清楚选择了安装display driver, 这个driver和这一步骤安装的驱动产生冲突。判断是否冲突可以运行nvidia-smi, 如果冲突了【NVIDIA-SMI 440.26 Driver Version: 440.26 】,这个地方的版本(440.26)会不一样。 > 错误2: Failed to initialize NVML: Driver/library version mismatch > > —————————————————————————— > > 原因是卸载驱动之后内核里的还没有卸载掉,特别是先安装了高版驱动,然后又安装低版本驱动的情况下。最简单的办法就是重启一下。 > > > 问题3:有的时候再安装CUDA的时候不小心又安装了一次驱动,会造成很奇怪的问题,运行nvidia-smi可能会出现nvidia-smi版本和驱动版本对不上 > > ———————— > > 查看驱动版本 cat /proc/driver/nvidia/version, 看下和自己手工安装的能不能对上,如果对不上就出现驱动版本混乱了。需要重装。 > > 需要用安装cuda时自带的命令 nvidia-uninstall来卸载,他会把内核里的东西全部清除。 > > 卸载cuda运行 cuda/bin里的uninstall-xxx.pl > > 如果你是从网站下载的驱动包,那么卸载还需要用这个驱动包来卸载 ./xxxxxx-driver.run --uninstall > > 然后重新按照流程从头安装。 ### 4. 安装CUDA(N卡的并行计算框架) #### CUDA介绍 CUDA的介绍 #### CUDA安装 CUDA的下载地址 。 我选择安装CUDA 10.0 ```bash wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux # 10.0的补丁 wget http://developer.download.nvidia.com/compute/cuda/10.0/Prod/patches/1/cuda_10.0.130.1_linux.run chmod 755 cuda_10.0.130_410.48_linux chmod 755 cuda_10.0.130.1_linux.run sudo service lightdm stop #这行不能忽略 sudo ./cuda_10.0.130_410.48_linux sudo ./cuda_10.0.130.1_linux.run sudo service lightdm start ``` 在安装的时候一定不要再次安装驱动,看清楚这个选项,否则会造成执行`nvidia-smi`的时候出现错误。 ```bash Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? (y)es/(n)o/(q)uit: n #这个地方一定要选n ``` #### CUDA配置 CUDA安装之后终端出现提示 ```text =========== = Summary = =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-10.0 Samples: Installed in /home/cxu, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA. Logfile is /tmp/cuda_install_1788.log ``` 根据上面输出的提示做一些配置, 打开`/etc/profile` ```bash export PATH=${PATH}:/usr/local/cuda/bin: export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64 ``` 由于我安装cuda时候选择创建了软链接cuda到cuda-10.0因此配置里可以直接用cuda目录。 #### 测试CUDA ##### 命令行测试 ```bash cxu@cxu-pc:~$ source /etc/profile cxu@cxu-pc:~$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130 ``` ##### 运行CUDA自带的例子 安装CUDA的时候会问你是否要安装例子,我选了y, 然后就再我家目录下生成了一个目录 `NVIDIA_CUDA-10.0_Samples` 先编译 ```bash cd NVIDIA_CUDA-10.0_Samples make all -j6 #编译完成生成了一个bin目录 cd bin/x86_64/linux/release/ ./deviceQuery #也可以执行其他的 ``` 输出 ```text ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "P106-100" CUDA Driver Version / Runtime Version 10.2 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 6081 MBytes (6375997440 bytes) (10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores GPU Max Clock rate: 1709 MHz (1.71 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 192-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.0, NumDevs = 1 Result = PASS ``` #### 常见错误 > 错误1:It appears that an X server is running. Please exit X before installation. > > —————————————————————— > > 解决办法,安装前运行 sudo service lightdm stop > > 安装之后运行 sudo service lightdm start > > ### 5. 安装cuDNN #### cuDNN介绍 **cuDNN**(CUDA Deep Neural Network library):是NVIDIA打造的针对深度神经网络的加速库,是一个用于深层神经网络的GPU加速库。如果你要用GPU训练模型,cuDNN不是必须的,但是一般会采用这个加速库。 #### cuDNN版本选择与安装 cuDNN需要寻找和CUDA版本匹配的进行安装: 下载的时候要注册并登陆,下载会有多个选择,只需要安装 `cuDNN Library for Linux`, 我下载到的文件名字是`cudnn-10.0-linux-x64-v7.6.4.38.tgz` 解压这个文件内容如下: ```text cxu@cxu-pc:~$ tree cuda cuda ├── include │   └── cudnn.h ├── lib64 │   ├── libcudnn.so -> libcudnn.so.7 │   ├── libcudnn.so.7 -> libcudnn.so.7.6.4 │   ├── libcudnn.so.7.6.4 │   └── libcudnn_static.a └── NVIDIA_SLA_cuDNN_Support.txt 2 directories, 6 files ``` 然后执行 ```bash sudo cp cuda/include/cudnn.h /usr/local/cuda/include/ sudo cp -d cuda/lib64/lib* /usr/local/cuda/lib64 sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn* ``` #### 删除cuDNN(如果要) 如果要删除cuDNN,请执行: ```bash sudo rm /usr/local/cuda/include/cudnn.h sudo rm -r /usr/local/cuda/lib64/libcudnn* ``` ### 6. 测试tensorFlow 至此GPU有关环境安装完毕,接下来测试一下tf能不能正常。 ```bash virtualenv -p /usr/bin/python3 venv source venv/bin/activate pip install tensorflow-gpu ``` 打开ipython/python控制台 ```python >> import tensorflow as tf >> tf.test.gpu_device_name() ``` 输出(看第7行): ```text 2019-11-29 22:51:19.282039: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-29 22:51:19.282540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: P106-100 major: 6 minor: 1 memoryClockRate(GHz): 1.7085 pciBusID: 0000:01:00.0 2019-11-29 22:51:19.282569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-11-29 22:51:19.282578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-11-29 22:51:19.282587: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-11-29 22:51:19.282596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-11-29 22:51:19.282610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-11-29 22:51:19.282619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-11-29 22:51:19.285274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-11-29 22:51:19.285360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-29 22:51:19.285905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-29 22:51:19.286288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-11-29 22:51:19.287356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-11-29 22:51:19.288043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-29 22:51:19.288056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-11-29 22:51:19.288061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-11-29 22:51:19.288716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-29 22:51:19.289252: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-29 22:51:19.289638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 5658 MB memory) -> physical GPU (device: 0, name: P106-100, pci bus id: 0000:01:00.0, compute capability: 6.1) Out[3]: '/device:GPU:0' ``` ### 其他问题 - 安装CUDA时,它自带了一个驱动,一般不用,因为我安装了这个驱动之后执行nvidia-smi没有反应,可能配套的管理命令不如自己安装那么全面。 - 这里有个问题没有解决:我先安装了nvidia-driver-440, 然后卸载想安装nvidia-driver-410,但每次重启还是440,还没找到原因。