Skip to content

部署 GPU Operator

1. 提前准备国内镜像(解决 NFD 镜像拉取问题)

由于node-feature-discovery(NFD)镜像默认仓库在国外,需提前通过国内镜像站下载并打标签:

shell
export CONTAINERD_ADDRESS=/run/k3s/containerd/containerd.sock
ctr -n k8s.io images pull registry.cn-hangzhou.aliyuncs.com/smallqi/node-feature-discovery:v0.17.2
ctr -n k8s.io images tag registry.cn-hangzhou.aliyuncs.com/smallqi/node-feature-discovery:v0.17.2 registry.k8s.io/nfd/node-feature-discovery:v0.17.2

2. 安装 Helm

shell
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh

3. 配置 GPU Operator 安装参数

创建gpu-values.yaml,指定所有组件使用国内镜像仓库(避免拉取国外镜像失败):

shell
cat > gpu-values.yaml << EOF
toolkit:
env:
- name: CONTAINERD_SOCKET
  value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
  value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
  value: "true"
  version: v1.17.1-ubuntu20.04
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  validator:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  operator:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  initContainer:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  driver:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  manager:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  devicePlugin:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  dcgmExporter:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  gfd:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  migManager:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  vgpuDeviceManager:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  vfioManager:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
  driverManager:
  repository: registry.cn-hangzhou.aliyuncs.com/smallqi
EOF

4. 一键安装 GPU Operator

shell
# 添加NVIDIA Helm仓库并更新
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
# 安装GPU Operator(指定版本和配置文件,无需修改参数)
helm install gpu-operator -n gpu-operator --create-namespace \  
nvidia/gpu-operator --version=v25.3.0 -f gpu-values.yaml

提示:等待约 3-5 分钟,通过 kubectl get pod -n gpu-operator 确认所有 Pod 状态为Running。

5. 配置 RKE2 默认容器运行时

配置 RKE2 默认使用 nvidia 作为容器运行时

shell
cat > /etc/rancher/rke2/config.yaml << EOF
disable:
- rke2-ingress-nginx #禁用默认Ingress,会与Rainbond网关冲突
  system-default-registry: registry.cn-hangzhou.aliyuncs.com # 国内镜像仓库
  default-runtime: nvidia #指定 nvidia 为默认容器运行时
EOF
shell
# 重启RKE2使配置生效
$ systemctl restart rke2-server.service
# 等待5分钟,确保所有系统Pod重新启动完成

6. 验证 GPU 算力调度

创建测试 Pod,验证 GPU 是否正常被 K8s 识别和使用:

shell
# 生成测试YAML(运行CUDA示例程序)
cat > cuda-sample.yaml << EOF  
apiVersion: v1  
kind: Pod  
metadata:  
name: cuda-vectoradd  
spec:  
restartPolicy: OnFailure  
containers:
- name: cuda-vectoradd  
  image: registry.cn-hangzhou.aliyuncs.com/zqqq/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04  
  resources:  
  limits:  
  nvidia.com/gpu: 1  # 声明使用1张GPU  
EOF
shell
# 部署测试Pod
$ kubectl apply -f cuda-sample.yaml

查看日志(成功标志):

shell
$ kubectl logs -f cuda-vectoradd
# 输出包含以下内容则表示GPU调度正常:
[Vector addition of 50000 elements]  
...  
Test PASSED  #CUDA程序运行通过  
Done

至此,GPU 资源已成功接入 RKE2 集群,已经可以在 K8S 集群内实现 GPU 调度。

文章来源于自己总结和网络转载,内容如有任何问题,请大佬斧正!联系我