docker及k8s使用nvidia显卡

前置条件

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

需要安装 cuda toolkit，具体可以参考官网

1	apt-get install -y nvidia-container-toolkit

配置 runtime

使用 nvidia-ctk 配置，官方推荐命令配置：

# docker
nvidia-ctk runtime configure --runtime=docker --config=/etc/docker/daemon.json

# containerd
nvidia-ctk runtime configure --runtime=containerd

修改的是 /etc/docker/daemon.json 和 /etc/containerd/config.toml 文件

"default-runtime": "nvidia",
"runtimes": {
    "nvidia": {
        "args": [],
        "path": "nvidia-container-runtime"
    }
},

以及


[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"
    CriuImagePath = ""
    CriuPath = ""
    CriuWorkPath = ""
    IoGid = 0
    IoUid = 0
    NoNewKeyring = false
    NoPivotRoot = false
    Root = ""
    ShimCgroup = ""
    SystemdCgroup = true

测试 containerd 的默认 runtime_name 还是 runc, 修改为 nvidia;

1 2	[plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia"

docker

未安装 nvidia-container-toolkit 时

1 2	root@debian:~# docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

安装后


root@debian:~# docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Fri Jun 14 06:38:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8    N/A /  N/A |      5MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

k8s

需要先安装 nvidia 插件

https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml

或者使用 https://github.com/NVIDIA/gpu-operator （可以自动安装驱动）

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

部署完，看日志有报错

#  kubectl logs -f -n kube-system nvidia-device-plugin-daemonset-fb7lc 
I0618 07:00:59.142415       1 main.go:178] Starting FS watcher.
I0618 07:00:59.142533       1 main.go:185] Starting OS watcher.
I0618 07:00:59.143158       1 main.go:200] Starting Plugins.
I0618 07:00:59.143180       1 main.go:257] Loading configuration.
I0618 07:00:59.144107       1 main.go:265] Updating config with default resource matching patterns.
I0618 07:00:59.144327       1 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0618 07:00:59.144338       1 main.go:279] Retrieving plugins.
I0618 07:00:59.145558       1 factory.go:104] Detected NVML platform: found NVML library
I0618 07:00:59.145613       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0618 07:00:59.161587       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0618 07:00:59.162167       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0618 07:00:59.176954       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

我的原因是 containerd 的默认 runtime 是 runc 改成 nvidia 后重启 containerd 及插件容器

 kubectl logs -f -n kube-system nvidia-device-plugin-daemonset-j4jdg 
I0619 07:42:06.034169       1 main.go:178] Starting FS watcher.
I0619 07:42:06.034445       1 main.go:185] Starting OS watcher.
I0619 07:42:06.035026       1 main.go:200] Starting Plugins.
I0619 07:42:06.035133       1 main.go:257] Loading configuration.
I0619 07:42:06.039835       1 main.go:265] Updating config with default resource matching patterns.
I0619 07:42:06.040882       1 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0619 07:42:06.040926       1 main.go:279] Retrieving plugins.
I0619 07:42:06.044786       1 factory.go:104] Detected NVML platform: found NVML library
I0619 07:42:06.044882       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0619 07:42:06.091857       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0619 07:42:06.102207       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0619 07:42:06.122055       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

describe node 节选

Allocatable:
  cpu:                4
  ephemeral-storage:  111498153800
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3424344Ki
  nvidia.com/gpu:     1
  pods:               110
System Info:
  Machine ID:                 5ec87fb30f1b4ebc80cf9547a632720b
  System UUID:                36444335-3830-5133-544c-705a0f19f5b0
  Boot ID:                    c60b92f0-c8ea-4526-876a-109d4bf4fdf7
  Kernel Version:             6.1.0-21-amd64
  OS Image:                   Debian GNU/Linux 12 (bookworm)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.23
  Kubelet Version:            v1.28.1
  Kube-Proxy Version:         v1.28.1
PodCIDR:                      172.20.0.0/24
PodCIDRs:                     172.20.0.0/24
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
        20h
  kube-system                 nvidia-device-plugin-daemonset-j4jdg          0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                475m (11%)  0 (0%)

  nvidia.com/gpu     0           0

我这个机器上只有一块显卡， k8s 的 GPU 调度申请的显卡只能为整数。可以使用 vGPU 或者 MIG 虚拟化，但需要显卡支持

测试

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod 
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
root@debian:~# cat testgpu2.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: gpu-demo-vectoradd
spec:
  restartPolicy: Never
  containers:
  - name: vectoradd
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    command:
    - bash
    - -c
    args:
    - |
        /tmp/vectorAdd
        nvidia-smi -L && nvidia-smi && sleep 60
    resources:
      limits:
        nvidia.com/gpu: 1

Running 期间，其他的 GPU 容器 Pending , 日志


kubectl logs -f gpu-demo-vectoradd 
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
GPU 0: NVIDIA GeForce GTX 950M (UUID: GPU-8eef98eb-0dbc-be7e-99d4-5992e5ad79e0)
Wed Jun 19 07:45:41 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P0    N/A /  N/A |      3MiB /  4096MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

直接使用 gpu-operator

k8s 上可以直接使用 gpu-operator，自动安装驱动，cuda 以及 plugin

驱动支持的操作系统有限，可以先去查看 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags

当前使用的 gpu-operator-v24.3.0, 最新安装的 ubuntu-server 22.04 版本

使用到的镜像，内网环境可能需要提前准备好镜像包，仅容器内有效, nvidia-driver-daemonset 容器删除重建后会重新编译驱动。所以最好提前在服务器上安装好驱动及插件。

nvcr.io/nvidia/cloud-native/gpu-operator-validator   v24.3.0                 
nvcr.io/nvidia/cloud-native/k8s-driver-manager       v0.6.8                  
nvcr.io/nvidia/gpu-operator                          v24.3.0                 
# nvcr.io/nvidia/k8s-device-plugin                     v0.15.0                 
nvcr.io/nvidia/k8s-device-plugin                     v0.15.0-ubi8            
nvcr.io/nvidia/k8s/container-toolkit                 v1.15.0-ubuntu20.04     
# nvcr.io/nvidia/k8s/cuda-sample                       vectoradd-cuda10.2      
nvcr.io/nvidia/k8s/dcgm-exporter                     3.3.5-3.4.1-ubuntu22.04 
registry.k8s.io/nfd/node-feature-discovery           v0.15.4                 
nvcr.io/nvidia/driver                                550.54.15-ubuntu22.04

使用 helm 安装

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator

执行完成后生成如下 pod

gpu-operator   gpu-feature-discovery-zd5qc                                       1/1     Running     0          69s
gpu-operator   gpu-operator-1719300178-node-feature-discovery-gc-c4b5bb74pcs28   1/1     Running     0          88s
gpu-operator   gpu-operator-1719300178-node-feature-discovery-master-664c5pbsj   1/1     Running     0          88s
gpu-operator   gpu-operator-1719300178-node-feature-discovery-worker-q544b       1/1     Running     0          88s
gpu-operator   gpu-operator-ff9fb8679-w5xqk                                      1/1     Running     0          88s
gpu-operator   nvidia-container-toolkit-daemonset-vn9dv                          1/1     Running     0          69s
gpu-operator   nvidia-cuda-validator-fv8vc                                       0/1     Completed   0          42s
gpu-operator   nvidia-dcgm-exporter-kfth8                                        1/1     Running     0          69s
gpu-operator   nvidia-device-plugin-daemonset-bc9q9                              1/1     Running     0          69s
gpu-operator   nvidia-operator-validator-lfkl2                                   1/1     Running     0          69s
gpu-operator   nvidia-driver-daemonset-ts2bf                                   1/1     Running     0          20s

这些 Pod 是 gpu-operator 组件的一部分，每个 Pod 执行特定的任务以确保 Kubernetes 集群能够正确地检测和使用 GPU。以下是每个 Pod 的作用：

gpu-feature-discovery：
- Pod 名称：gpu-feature-discovery-zd5qc
- 作用：检测节点上的 GPU 相关特性，并将这些特性报告给 Kubernetes，以便调度器能够根据这些特性做出更智能的调度决策。
node-feature-discovery：
- Pod 名称：
  - gpu-operator-1719300178-node-feature-discovery-gc-c4b5bb74pcs28
  - gpu-operator-1719300178-node-feature-discovery-master-664c5pbsj
  - gpu-operator-1719300178-node-feature-discovery-worker-q544b
- 作用：检测节点的硬件和软件特性，并将这些特性标签添加到节点上，以帮助 Kubernetes 调度器做出更好的调度决策。
gpu-operator：
- Pod 名称：gpu-operator-ff9fb8679-w5xqk
- 作用：管理和部署 GPU 相关的组件和资源，包括驱动程序、工具和插件。
nvidia-container-toolkit-daemonset：
- Pod 名称：nvidia-container-toolkit-daemonset-vn9dv
- 作用：提供在容器中使用 NVIDIA GPU 的支持，包含 NVIDIA Container Runtime。
nvidia-cuda-validator：
- Pod 名称：nvidia-cuda-validator-fv8vc
- 作用：验证 CUDA 的安装和功能，确保 GPU 资源可以正确使用。
nvidia-dcgm-exporter：
- Pod 名称：nvidia-dcgm-exporter-kfth8
- 作用：从 NVIDIA Data Center GPU Manager (DCGM) 收集 GPU 相关的监控数据，并将其导出到 Prometheus 格式。
nvidia-device-plugin-daemonset：
- Pod 名称：nvidia-device-plugin-daemonset-bc9q9
- 作用：为 Kubernetes 提供 NVIDIA GPU 的插件，使得 GPU 资源可以在 Kubernetes 中使用和管理。
nvidia-operator-validator：
- Pod 名称：nvidia-operator-validator-lfkl2
- 作用：验证 GPU Operator 组件的安装和功能，确保所有组件正确运行。
nvidia-driver-daemonset：
- Pod 名称：nvidia-driver-daemonset-ts2bf
- 作用：在每个节点上安装和管理 NVIDIA GPU 驱动程序。

这些 Pod 协同工作，确保 GPU 资源在 Kubernetes 集群中能够被正确检测、使用和管理。

编辑 kubectl edit clusterpolicies.nvidia.com, 可以看到默认的组件

ccManager:
  defaultMode: "off"
  enabled: false
  env: []
  image: k8s-cc-manager
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: v0.1.1
cdi:
  default: false
  enabled: false
daemonsets:
  labels:
    app.kubernetes.io/managed-by: gpu-operator
    helm.sh/chart: gpu-operator-v24.3.0
  priorityClassName: system-node-critical
  rollingUpdate:
    maxUnavailable: "1"
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  updateStrategy: RollingUpdate
dcgm:
  enabled: false
  hostPort: 5555
  image: dcgm
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: 3.3.5-1-ubuntu22.04
dcgmExporter:
  enabled: true
  env:
  - name: DCGM_EXPORTER_LISTEN
    value: :9400
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"
  - name: DCGM_EXPORTER_COLLECTORS
    value: /etc/dcgm-exporter/dcp-metrics-included.csv
  image: dcgm-exporter
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/k8s
  serviceMonitor:
    additionalLabels: {}
    enabled: false
    honorLabels: false
    interval: 15s
    relabelings: []
  version: 3.3.5-3.4.1-ubuntu22.04
devicePlugin:
  enabled: true
  env:
  - name: PASS_DEVICE_SPECS
    value: "true"
  - name: FAIL_ON_INIT_ERROR
    value: "true"
  - name: DEVICE_LIST_STRATEGY
    value: envvar
  - name: DEVICE_ID_STRATEGY
    value: uuid
  - name: NVIDIA_VISIBLE_DEVICES
    value: all
  - name: NVIDIA_DRIVER_CAPABILITIES
    value: all
  image: k8s-device-plugin
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia
  version: v0.15.0-ubi8
driver:
  certConfig:
    name: ""
  enabled: true
  image: driver
  imagePullPolicy: IfNotPresent
  kernelModuleConfig:
    name: ""
  licensingConfig:
    configMapName: ""
    nlsEnabled: true
  manager:
    env:
    - name: ENABLE_GPU_POD_EVICTION
      value: "true"
    - name: ENABLE_AUTO_DRAIN
      value: "false"
    - name: DRAIN_USE_FORCE
      value: "false"
    - name: DRAIN_POD_SELECTOR_LABEL
      value: ""
    - name: DRAIN_TIMEOUT_SECONDS
      value: 0s
    - name: DRAIN_DELETE_EMPTYDIR_DATA
      value: "false"
    image: k8s-driver-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.8
  rdma:
    enabled: false
    useHostMofed: false
  repoConfig:
    configMapName: ""
  repository: nvcr.io/nvidia
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  upgradePolicy:
    autoUpgrade: true
    drain:
      deleteEmptyDir: false
      enable: false
      force: false
      timeoutSeconds: 300
    maxParallelUpgrades: 1
    maxUnavailable: 25%
    podDeletion:
      deleteEmptyDir: false
      force: false
      timeoutSeconds: 300
    waitForCompletion:
      timeoutSeconds: 0
  useNvidiaDriverCRD: false
  useOpenKernelModules: false
  usePrecompiled: false
  version: 550.54.15
  virtualTopology:
    config: ""
gdrcopy:
  enabled: false
  image: gdrdrv
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: v2.4.1
gfd:
  enabled: true
  env:
  - name: GFD_SLEEP_INTERVAL
    value: 60s
  - name: GFD_FAIL_ON_INIT_ERROR
    value: "true"
  image: k8s-device-plugin
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia
  version: v0.15.0-ubi8
kataManager:
  config:
    artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
    runtimeClasses:
    - artifacts:
        pullSecret: ""
        url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.54.03
      name: kata-nvidia-gpu
      nodeSelector: {}
    - artifacts:
        pullSecret: ""
        url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.86.10-snp
      name: kata-nvidia-gpu-snp
      nodeSelector:
        nvidia.com/cc.capable: "true"
  enabled: false
  image: k8s-kata-manager
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: v0.2.0
mig:
  strategy: single
migManager:
  config:
    default: all-disabled
    name: default-mig-parted-config
  enabled: true
  env:
  - name: WITH_REBOOT
    value: "false"
  gpuClientsConfig:
    name: ""
  image: k8s-mig-manager
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: v0.7.0-ubuntu20.04
nodeStatusExporter:
  enabled: false
  image: gpu-operator-validator
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: v24.3.0
operator:
  defaultRuntime: docker
  initContainer:
    image: cuda
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: 12.4.1-base-ubi8
  runtimeClass: nvidia
psa:
  enabled: false
sandboxDevicePlugin:
  enabled: true
  image: kubevirt-gpu-device-plugin
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia
  version: v1.2.7
sandboxWorkloads:
  defaultWorkload: container
  enabled: false
toolkit:
  enabled: true
  image: container-toolkit
  imagePullPolicy: IfNotPresent
  installDir: /usr/local/nvidia
  repository: nvcr.io/nvidia/k8s
  version: v1.15.0-ubuntu20.04
validator:
  image: gpu-operator-validator
  imagePullPolicy: IfNotPresent
  plugin:
    env:
    - name: WITH_WORKLOAD
      value: "false"
  repository: nvcr.io/nvidia/cloud-native
  version: v24.3.0
vfioManager:
  driverManager:
    env:
    - name: ENABLE_GPU_POD_EVICTION
      value: "false"
    - name: ENABLE_AUTO_DRAIN
      value: "false"
    image: k8s-driver-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.8
  enabled: true
  image: cuda
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia
  version: 12.4.1-base-ubi8
vgpuDeviceManager:
  config:
    default: default
    name: ""
  enabled: true
  image: vgpu-device-manager
  imagePullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/cloud-native
  version: v0.2.6
vgpuManager:
  driverManager:
    env:
    - name: ENABLE_GPU_POD_EVICTION
      value: "false"
    - name: ENABLE_AUTO_DRAIN
      value: "false"
    image: k8s-driver-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.8
  enabled: false
  image: vgpu-manager
  imagePullPolicy: IfNotPresent

查看 node 信息

kubectl describe node
Name:               192.168.2.149
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.RTM_ALWAYS_ABORT=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.SRBDS_CTRL=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.VMX=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-cstate.enabled=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=false
                    feature.node.kubernetes.io/cpu-model.family=6
                    feature.node.kubernetes.io/cpu-model.id=94
                    feature.node.kubernetes.io/cpu-model.vendor_id=Intel
                    feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
                    feature.node.kubernetes.io/cpu-pstate.status=active
                    feature.node.kubernetes.io/cpu-pstate.turbo=true
                    feature.node.kubernetes.io/cpu-security.sgx.enabled=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true
                    feature.node.kubernetes.io/kernel-version.full=6.1.0-21-amd64
                    feature.node.kubernetes.io/kernel-version.major=6
                    feature.node.kubernetes.io/kernel-version.minor=1
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10ec.present=true
                    feature.node.kubernetes.io/pci-8086.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=debian
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=12
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=12
                    feature.node.kubernetes.io/usb-ef_05c8_0379.present=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=192.168.2.149
                    kubernetes.io/os=linux
                    kubernetes.io/role=master
                    nvidia.com/cuda.driver-version.full=525.147.05
                    nvidia.com/cuda.driver-version.major=525
                    nvidia.com/cuda.driver-version.minor=147
                    nvidia.com/cuda.driver-version.revision=05
                    nvidia.com/cuda.driver.major=525
                    nvidia.com/cuda.driver.minor=147
                    nvidia.com/cuda.driver.rev=05
                    nvidia.com/cuda.runtime-version.full=12.0
                    nvidia.com/cuda.runtime-version.major=12
                    nvidia.com/cuda.runtime-version.minor=0
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=0
                    nvidia.com/gfd.timestamp=1719300236
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=5
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=maxwell
                    nvidia.com/gpu.machine=HP-Pavilion-Gaming-Notebook
                    nvidia.com/gpu.memory=4096
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-GeForce-GTX-950M
                    nvidia.com/gpu.replicas=1
                    nvidia.com/gpu.sharing-strategy=none
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
                    nvidia.com/mps.capable=false
Annotations:        nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CMPXCHG8,cpu-cpuid.FLUSH_L1D,cpu-cpuid.FMA3,cpu-cpuid.FXSR,cpu-cpuid....
                    node.alpha.kubernetes.io/ttl: 0
                    nvidia.com/gpu-driver-upgrade-enabled: true
                    volumes.kubernetes.io/controller-managed-attach-detach: true

验证

root@test-ThinkPad-L14-Gen-2:~# kubectl logs -f -n gpu-operator  nvidia-operator-validator-lh5n2 -c toolkit-validation
time="2024-06-26T16:00:53Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
Wed Jun 26 16:00:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce MX450           On  |   00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8             N/A /   10W |       0MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@test-ThinkPad-L14-Gen-2:~# kubectl logs -f -n gpu-operator  nvidia-operator-validator-lh5n2 -c  cuda-validation
time="2024-06-26T16:00:55Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-06-26T16:00:55Z" level=info msg="pod nvidia-cuda-validator-zpxn7 is curently in Pending phase"
time="2024-06-26T16:01:00Z" level=info msg="pod nvidia-cuda-validator-zpxn7 is curently in Running phase"
time="2024-06-26T16:01:05Z" level=info msg="pod nvidia-cuda-validator-zpxn7 have run successfully"
root@test-ThinkPad-L14-Gen-2:~# kubectl logs -f -n gpu-operator  nvidia-operator-validator-lh5n2 -c  plugin-validation
time="2024-06-26T16:01:07Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
root@test-ThinkPad-L14-Gen-2:~# kubectl logs -f -n gpu-operator  nvidia-operator-validator-lh5n2
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
all validations are successful

同样执行上面的两个 pod

 kubectl logs  gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
root@test-ThinkPad-L14-Gen-2:~# kubectl logs  gpu-demo-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
GPU 0: NVIDIA GeForce MX450 (UUID: GPU-093c5145-c074-80d1-86ed-6433c461682d)
Wed Jun 26 15:49:07 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce MX450           On  |   00000000:01:00.0 Off |                  N/A |
| N/A   44C    P0             N/A /   10W |       0MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

遇到的问题

gcc 版本不对应，安装失败

warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[3]: *** [scripts/Makefile.build:251: /usr/src/nvidia-550.54.15/kernel/nvidia/nv.o] Error 1

解决方法

FROM nvcr.io/nvidia/driver:550.54.15-ubuntu22.04
RUN  apt-get update && \
    apt-get install -y --no-install-recommends gcc-12 g++-12 && \
    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12 && \
    update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 12 && \
    rm -rf /var/lib/apt/lists/*

ENTRYPOINT ["nvidia-driver", "init"]

重新构建镜像

1
2
3

docker build -t mydriver .
docker save -o mydriver.tar mydriver
ctr -i k8s.io images import mydriver.tar

修改 daemonset 使用 mydriver 镜像

prometheus 监控数据

可使用的面板之一 https://grafana.com/grafana/dashboards/21362-nvidia-dcgm-exporter-dashboard/

id： 21362

https://github.com/NVIDIA/dcgm-exporter

nvidia-dcgm-exporter 数据展示

curl nvidia-dcgm-exporter:port/metrics 当前版本的所有指标

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 300
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 405
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 43
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 1863
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-093c5145-c074-80d1-86ed-6433c461682d",device="nvidia0",modelName="NVIDIA GeForce MX450",Hostname="192.168.2.250",DCGM_FI_DRIVER_VERSION="550.54.15"} 0

指标解释

nvidia-dcgm-exporter 提供了许多关于 GPU 的关键性能指标，这些指标可以帮助监控和诊断 GPU 的健康和性能。以下是你提到的每个指标的详细解释：

DCGM_FI_DEV_SM_CLOCK
- 描述: SM（Streaming Multiprocessor）时钟频率，单位为 MHz。
- 类型: gauge（测量一个特定时间点的值）
- 示例: 300 表示 SM 时钟频率为 300 MHz。
DCGM_FI_DEV_MEM_CLOCK
- 描述: 内存时钟频率，单位为 MHz。
- 类型: gauge
- 示例: 405 表示内存时钟频率为 405 MHz。
DCGM_FI_DEV_MEMORY_TEMP
- 描述: 显卡内存温度，单位为摄氏度。
- 类型: gauge
- 示例: 0 表示当前内存温度为 0°C。
DCGM_FI_DEV_GPU_TEMP
- 描述: GPU 温度，单位为摄氏度。
- 类型: gauge
- 示例: 43 表示当前 GPU 温度为 43°C。
DCGM_FI_DEV_PCIE_REPLAY_COUNTER
- 描述: PCIe 重试次数总计。
- 类型: counter（累积值，随着时间增加）
- 示例: 0 表示没有发生 PCIe 重试。
DCGM_FI_DEV_GPU_UTIL
- 描述: GPU 利用率，单位为百分比。
- 类型: gauge
- 示例: 0 表示当前 GPU 利用率为 0%。
DCGM_FI_DEV_MEM_COPY_UTIL
- 描述: 内存复制利用率，单位为百分比。
- 类型: gauge
- 示例: 0 表示当前内存复制利用率为 0%。
DCGM_FI_DEV_ENC_UTIL
- 描述: 编码器利用率，单位为百分比。
- 类型: gauge
- 示例: 0 表示当前编码器利用率为 0%。
DCGM_FI_DEV_DEC_UTIL
- 描述: 解码器利用率，单位为百分比。
- 类型: gauge
- 示例: 0 表示当前解码器利用率为 0%。
DCGM_FI_DEV_XID_ERRORS
- 描述: 最近一次 XID 错误的值。
- 类型: gauge
- 示例: 0 表示没有 XID 错误。
DCGM_FI_DEV_FB_FREE
- 描述: 剩余显存，单位为 MiB。
- 类型: gauge
- 示例: 1863 表示当前剩余显存为 1863 MiB。
DCGM_FI_DEV_FB_USED
- 描述: 已使用显存，单位为 MiB。
- 类型: gauge
- 示例: 0 表示当前已使用显存为 0 MiB。
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
- 描述: NVLink 总带宽计数，所有通道的总和。
- 类型: counter
- 示例: 0 表示 NVLink 带宽总计为 0。
DCGM_FI_DEV_VGPU_LICENSE_STATUS
- 描述: vGPU 许可证状态。
- 类型: gauge
- 示例: 0 表示 vGPU 许可证状态为 0（无许可证）。

这些指标提供了关于 GPU 性能和健康的详细信息，帮助你实时监控和分析 GPU 的状态。在 Prometheus 中抓取这些指标后，可以在 Grafana 等可视化工具中创建仪表板，监控这些关键指标.

测试搭建

prometheus.yml

global:
  scrape_interval:
  external_labels:
    monitor: 'codelab-monitor'
# 这里表示抓取对象的配置
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s  #重写了全局抓取间隔时间，由15秒重写成5秒
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'nvidia-metrics'
    static_configs:
      - targets: ['192.168.2.31:30975']

docker

docker run -p 3000:3000 --name grafana \
-v /opt/prometheus/grafana/data:/var/lib/grafana \
-e "GF_SECURITY_ADMIN_PASSWORD=grafana123" \
-itd grafana/grafana



docker run -d \
  --name=prometheus \
  -v /opt/prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v /opt/prometheus/data:/prometheus \
  -p 9090:9090 \
  prom/prometheus

使用上面的模版

nvidia-metrics

更换面板

项目 https://github.com/utkuozdemir/nvidia_gpu_exporter

面板: https://grafana.com/grafana/dashboards/14574-nvidia-gpu-metrics/

面板id: 14574 中文 20622

$ docker run -d \
--name nvidia_smi_exporter \
--restart unless-stopped \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia0:/dev/nvidia0 \
-v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so \
-v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 \
-v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
-p 9835:9835 \
utkuozdemir/nvidia_gpu_exporter:1.1.0

效果查看

nvidia-metrics2