linux显卡环境安装

前置工作按需修改

  1. 修改固定 IP
1
2
3
4
5
6
7
8
9
10
11
12
13
$ cat /etc/netplan/00-installer-config.yaml
network:
ethernets:
ens33: # 配置的网卡的名称
addresses: [192.168.2.84/24] # 配置的静态ip地址和掩码
dhcp4: false # 关闭dhcp4
optional: true
gateway4: 192.168.2.1 # 网关地址
nameservers:
addresses: [192.168.2.1,114.114.114.114] # DNS服务器地址,多个DNS服务器地址需要用英文逗号分隔开,可不配置
version: 2

$ netplan apply
  1. 下载包和所需依赖
    1
    2
    3
    4
    5
    apt install -y aptitude

    aptitude install sshpass openssh-server --download-only

    ## yum install sshpass --downloadonly --downloaddir=/tmp

修改配置

  1. 允许 root 桌面登录

/etc/gdm3/daemon.conf 添加内容

1
2
3
4
5
6
7
[daemon]

AllowRoot=true

[security]

AllowRoot=true

注释 /etc/pam.d/gdm-password 这行内容

1
# auth  required        pam_succeed_if.so user != root quiet_success  # 注释

  1. 备份 /etc/X11/xorg.conf

    1
    cp  /etc/X11/xorg.conf  /etc/X11/xorg.conf.bak20240601
  2. 使用 wayland (x11 有问题时)
    /usr/lib/udev/rules.d/61-gdm.rules

注释掉下方第二行

1
2
3
LABEL="gdm_disable_wayland"
#RUN+="/usr/libexec/gdm-runtime-config set daemon WaylandEnable false"
GOTO="gdm_end"

  1. 关闭 nouveau

    1
    2
    3
    4
    5
    6
    7
    8
    cat > /etc/modprobe.d/blacklist-nouveau.conf <<EOF
    blacklist nouveau
    options nouveau modeset=0
    EOF

    update-initramfs -u

    lsmod | grep nouveau # 需要不显示
  2. 远程桌面

    1
    2
    3
    4
    apt install xrdp

    # 日志 /var/log
    xrdp-sesman.log xrdp.log

官网下载安装

驱动: https://www.nvidia.cn/Download/index.aspx?lang=zh-cn

cuda: https://developer.nvidia.com/cuda-toolkit-archive

下载内容 eg:

1
2
3
wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda_12.5.0_555.42.02_linux.run

wget https://cn.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run

测试直接执行一般都会失败,需要先 配置好环境。如果用到 cuda ,可以直接安装 cuda,安装过程会安装驱动。

需要提前准备

1
apt install gcc make

仓库里的 linux-headers 可能不匹配,内核也需要升级

1
2
apt upgrade
apt-get install linux-headers-$(uname -r)

cuda 执行安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
root@imwl:~/# ./cuda_12.5.0_555.42.02_linux.run

┌──────────────────────────────────────────────────────────────────────────────┐
│ CUDA Installer │
│ - [X] Driver │
│ [X] 555.42.02 │
│ + [X] CUDA Toolkit 12.5 │
│ [X] CUDA Demo Suite 12.5 │
│ [X] CUDA Documentation 12.5 │
│ - [ ] Kernel Objects │
│ [ ] nvidia-fs │
│ Options │
│ Install │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ Up/Down: Move | Left/Right: Expand | 'Enter': Select | 'A': Advanced options │
└──────────────────────────────────────────────────────────────────────────────┘


===========
= Summary =
===========

Driver: Installed
Toolkit: Installed in /usr/local/cuda-12.5/

Please make sure that
- PATH includes /usr/local/cuda-12.5/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-12.5/lib64, or, add /usr/local/cuda-12.5/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.5/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

卸载方式

1
2
3
/usr/local/cuda-12.5/bin/cuda-uninstaller

nvidia-uninstall

可以将 /usr/local/cuda-12.5/bin 加入 PATH

1
2
3
4
5
6
root@imwl:~/snap# /usr/local/cuda-12.5/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

不安装 cuda, 只安装驱动

1
2
chmod +x NVIDIA-Linux-*.run
./NVIDIA-Linux-*.run

验证

1
2
lspci |grep -i nvidia
nvidia-smi

卸载

1
nvidia-uninstall

遇到的问题

1
2
3
4
ERROR: You appear to be running an X server; please exit X before installing

注销桌面登录,CTRL + ALT +F3 命令行操作,或者跳过检查
./NVIDIA-Linux-*.run -no-x-check # --kernel-source-path=/usr/src/kernels/4.4.242-1.el7.elrepo.x86_64 -k $(uname -r)

apt 直接安装

推荐使用 ,一般也不会特别旧的版本

1
2
3
4
5
6
apt install nvidia-driver  

apt install nvidia-cuda-toolkit # 推荐使用,会安装 nvidia 驱动
# 验证
nvidia-smi
nvcc -V

卸载

1
2
apt remove --purge nvidia*
apt autoremove

遇到的问题

  1. NVIDIA-SMI has failed because it couldn’t communicate with the NIVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我的是开启了安全启动,关闭安全启动后正常

有些是升级了 版本后者内核之类的,需要执行如下操作,或者重新安装驱动

1
2
3
4
5
6
7
8
9
apt install dkms 
dkms install -m nvidia -v 555.42.02 #对应版本,如下面


root@imwl:~# ls -l /usr/src/
total 12
drwxr-xr-x 25 root root 4096 Jun 30 11:53 linux-headers-5.15.0-113
drwxr-xr-x 7 root root 4096 Jun 30 11:53 linux-headers-5.15.0-113-generic
drwxr-xr-x 8 root root 4096 Jun 27 03:15 nvidia-555.42.02

  1. 卸载时文件删除

    1
    2
    3
    4
    5
    sudo rm /etc/X11/xorg.conf
    sudo rm -rf /etc/X11/xorg.conf.d/10-nvidia.conf
    sudo rm -rf /usr/share/X11/xorg.conf.d/10-nvidia.conf
    sudo rm -rf /lib/modprobe.d/nvidia.conf
    sudo rm -rf /etc/modprobe.d/nvidia.conf
  2. python 测试

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
     pip3 install pycuda -i https://pypi.tuna.tsinghua.edu.cn/simple
    error: externally-managed-environment

    × This environment is externally managed
    ╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.

    If you wish to install a non-Debian-packaged Python package,
    create a virtual environment using python3 -m venv path/to/venv.
    Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
    sure you have python3-full installed.

    If you wish to install a non-Debian packaged Python application,
    it may be easiest to use pipx install xyz, which will manage a
    virtual environment for you. Make sure you have pipx installed.

    See /usr/share/doc/python3.11/README.venv for more information.

    note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
    hint: See PEP 668 for the detailed specification.

    apt install python3-pycuda # 或者虚拟环境中安装

python3 test.py & # 多次执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pycuda.autoinit
from pycuda.compiler import SourceModule
import time
mod = SourceModule("""
#include <stdio.h>
__global__ void work()
{
printf("Manage GPU success!\\n");
}
""")
func = mod.get_function("work")


func(block=(1,1,1))
time.sleep(30)

验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
root@test:~# nvidia-smi
Sun Jun 9 14:04:51 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 48C P0 N/A / N/A | 348MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4559 G /usr/lib/xorg/Xorg 2MiB |
| 0 N/A N/A 35606 C python3 30MiB |
| 0 N/A N/A 35622 C python3 30MiB |
| 0 N/A N/A 35634 C python3 30MiB |
| 0 N/A N/A 35648 C python3 30MiB |
| 0 N/A N/A 35686 C python3 30MiB |
| 0 N/A N/A 35705 C python3 30MiB |
| 0 N/A N/A 35718 C python3 30MiB |
| 0 N/A N/A 35730 C python3 30MiB |
| 0 N/A N/A 35742 C python3 30MiB |
| 0 N/A N/A 35756 C python3 30MiB |
| 0 N/A N/A 35769 C python3 30MiB |
+-----------------------------------------------------------------------------+
  1. Nouveau 开启导致失败

执行关闭操作还是不行,最后重启机器后完成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Would you like nvidia-installer to attempt to create these modprobe configuration files for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written. You will need to reboot your system and possibly rebuild the initramfs before these changes can take effect. Note if you later wish to reenable Nouveau, you will need to delete these files: /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
-> nvidia-installer is not able to perform some of the sanity checks which detect potential installation problems while Nouveau is loaded. Would you like to continue installation without these sanity checks, or abort installation, confirm that Nouveau has been properly disabled, and attempt installation again later? (Answer: Abort installation)
-> Nouveau detected in initramfs
-> Initramfs scan complete.
-> The initramfs will likely need to be rebuilt due to the following condition(s):
* nvidia-installer attempted to disable Nouveau.
* Nouveau is present in the initramfs.

Would you like to rebuild the initramfs? (Answer: Rebuild initramfs)
-> /usr/sbin/update-initramfs requires a file path argument, but none was given.
-> Processing the initramfs:
-> Executing: /usr/sbin/update-initramfs -u
-> done
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

使用 DCGM

官网 https://docs.nvidia.com/datacenter/dcgm/2.2/dcgm-user-guide/getting-started.html

安装,当前版本

1
2
architecture=x86_64
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

当前 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/

1
2
3
4
5
6
7
8
9
10
11
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/$architecture /" | sudo tee /etc/apt/sources.list.d/cuda.list

# 文件可能会变,具体打开查看
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/$architecture/3bf863cc.pub

wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/$architecture/cuda-$distribution.pin
mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600

apt-get update
apt-get install -y datacenter-gpu-manager
systemctl --now enable nvidia-dcgm

使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
root@imwl:/usr/local/dcgm/scripts# dcgmi discovery -l
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA GeForce GTX 950M |
| | PCI Bus ID: 00000000:01:00.0 |
| | Device UUID: GPU-8eef98eb-0dbc-be7e-99d4-5992e5ad79e0 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
0 CPUs found.
+--------+----------------------------------------------------------------------+
| CPU ID | Device Information |
+--------+----------------------------------------------------------------------+
+--------+----------------------------------------------------------------------+
root@imwl:/usr/local/dcgm/scripts# dc
dcb dcgmi dcgmproftester10 dcgmproftester11 dcgmproftester12
root@imwl:/usr/local/dcgm/scripts# dc
dcb dcgmi dcgmproftester10 dcgmproftester11 dcgmproftester12
root@imwl:/usr/local/dcgm/scripts# dcgmi diag -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.3.5 |
| Driver Version Detected | 555.42.02 |
| GPU Device IDs Detected | 139a |
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Info | Persistence mode for GPU 0 is disabled. Enabl |
| | e persistence mode by running "nvidia-smi -i |
| | <gpuId> -pm 1 " as root. |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Skip |
+----- Integration -------+------------------------------------------------+
| PCIe | Skip - All |
+----- Hardware ----------+------------------------------------------------+
| GPU Memory | Skip - All |
+----- Stress ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+

开启 persistence mode

1
2
3
root@test:~# nvidia-smi -i 0 -pm 1
Enabled persistence mode for GPU 00000000:01:00.0.
All done.