前置工作按需修改
- 修改固定 IP
1 | $ cat /etc/netplan/00-installer-config.yaml |
- 下载包和所需依赖
1
2
3
4
5apt install -y aptitude
aptitude install sshpass openssh-server --download-only
## yum install sshpass --downloadonly --downloaddir=/tmp
修改配置
- 允许 root 桌面登录
/etc/gdm3/daemon.conf 添加内容1
2
3
4
5
6
7[daemon]
AllowRoot=true
[security]
AllowRoot=true
注释 /etc/pam.d/gdm-password 这行内容1
# auth required pam_succeed_if.so user != root quiet_success # 注释
备份 /etc/X11/xorg.conf
1
cp /etc/X11/xorg.conf /etc/X11/xorg.conf.bak20240601
使用 wayland (x11 有问题时)
/usr/lib/udev/rules.d/61-gdm.rules
注释掉下方第二行1
2
3LABEL="gdm_disable_wayland"
#RUN+="/usr/libexec/gdm-runtime-config set daemon WaylandEnable false"
GOTO="gdm_end"
关闭 nouveau
1
2
3
4
5
6
7
8cat > /etc/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF
update-initramfs -u
lsmod | grep nouveau # 需要不显示远程桌面
1
2
3
4apt install xrdp
# 日志 /var/log
xrdp-sesman.log xrdp.log
官网下载安装
驱动: https://www.nvidia.cn/Download/index.aspx?lang=zh-cn
cuda: https://developer.nvidia.com/cuda-toolkit-archive
下载内容 eg:1
2
3wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda_12.5.0_555.42.02_linux.run
wget https://cn.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run
测试直接执行一般都会失败,需要先 配置好环境。如果用到 cuda ,可以直接安装 cuda,安装过程会安装驱动。
需要提前准备1
apt install gcc make
仓库里的 linux-headers 可能不匹配,内核也需要升级1
2apt upgrade
apt-get install linux-headers-$(uname -r)
cuda 执行安装
1 | root@imwl:~/# ./cuda_12.5.0_555.42.02_linux.run |
卸载方式
1 | /usr/local/cuda-12.5/bin/cuda-uninstaller |
可以将 /usr/local/cuda-12.5/bin 加入 PATH
1 | root@imwl:~/snap# /usr/local/cuda-12.5/bin/nvcc -V |
不安装 cuda, 只安装驱动
1 | chmod +x NVIDIA-Linux-*.run |
验证1
2lspci |grep -i nvidia
nvidia-smi
卸载
1 | nvidia-uninstall |
遇到的问题
1 | ERROR: You appear to be running an X server; please exit X before installing |
apt 直接安装
推荐使用 ,一般也不会特别旧的版本1
2
3
4
5
6apt install nvidia-driver
apt install nvidia-cuda-toolkit # 推荐使用,会安装 nvidia 驱动
# 验证
nvidia-smi
nvcc -V
卸载1
2apt remove --purge nvidia*
apt autoremove
遇到的问题
- NVIDIA-SMI has failed because it couldn’t communicate with the NIVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
我的是开启了安全启动,关闭安全启动后正常
有些是升级了 版本后者内核之类的,需要执行如下操作,或者重新安装驱动1
2
3
4
5
6
7
8
9apt install dkms
dkms install -m nvidia -v 555.42.02 #对应版本,如下面
root@imwl:~# ls -l /usr/src/
total 12
drwxr-xr-x 25 root root 4096 Jun 30 11:53 linux-headers-5.15.0-113
drwxr-xr-x 7 root root 4096 Jun 30 11:53 linux-headers-5.15.0-113-generic
drwxr-xr-x 8 root root 4096 Jun 27 03:15 nvidia-555.42.02
卸载时文件删除
1
2
3
4
5sudo rm /etc/X11/xorg.conf
sudo rm -rf /etc/X11/xorg.conf.d/10-nvidia.conf
sudo rm -rf /usr/share/X11/xorg.conf.d/10-nvidia.conf
sudo rm -rf /lib/modprobe.d/nvidia.conf
sudo rm -rf /etc/modprobe.d/nvidia.confpython 测试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23pip3 install pycuda -i https://pypi.tuna.tsinghua.edu.cn/simple
error: externally-managed-environment
× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
python3-xyz, where xyz is the package you are trying to
install.
If you wish to install a non-Debian-packaged Python package,
create a virtual environment using python3 -m venv path/to/venv.
Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
sure you have python3-full installed.
If you wish to install a non-Debian packaged Python application,
it may be easiest to use pipx install xyz, which will manage a
virtual environment for you. Make sure you have pipx installed.
See /usr/share/doc/python3.11/README.venv for more information.
note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.
apt install python3-pycuda # 或者虚拟环境中安装
python3 test.py & # 多次执行
1 | import pycuda.autoinit |
验证
1 | root@test:~# nvidia-smi |
- Nouveau 开启导致失败
执行关闭操作还是不行,最后重启机器后完成
1 |
|
使用 DCGM
官网 https://docs.nvidia.com/datacenter/dcgm/2.2/dcgm-user-guide/getting-started.html
安装,当前版本1
2architecture=x86_64
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
当前 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/
1 | echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/$architecture /" | sudo tee /etc/apt/sources.list.d/cuda.list |
使用
1 | root@imwl:/usr/local/dcgm/scripts# dcgmi discovery -l |
开启 persistence mode1
2
3root@test:~# nvidia-smi -i 0 -pm 1
Enabled persistence mode for GPU 00000000:01:00.0.
All done.