Skip to content

K8s GPU 虚拟化(HAMi)实践指南

参考文档:https://project-hami.io/zh/docs/get-started/nginx-example


1. 使用 Helm 安装 HAMi

  • 给 GPU 节点打标签(未打标签的节点不会被 HAMi 管理)
shell
kubectl label nodes {nodeid} gpu=on
  • 检查 Kubernetes 版本(用于匹配调度器镜像 tag)
shell
kubectl version
  • 添加 HAMi Helm 仓库并更新索引
shell
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm repo update
  • 安装 HAMi(将 vX.Y.Z 替换为集群服务器版本,例如 v1.16.8)
shell
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system
  • 验证安装
shell
kubectl -n kube-system get pods | grep -E "vgpu-(device-plugin|scheduler)"

运行正常时应看到 vgpu-device-pluginvgpu-scheduler 处于 Running 状态。


2. 使用 Helm 部署 HAMi-WebUI

WebUI 仅能通过本地主机访问,请确保本机 ~/.kube/config 已连接目标集群。

  • 先决条件

    • Kubectl(本地)
    • HAMi >= 2.4.0
    • Prometheus > 2.8.0(需能在集群内访问)
    • Helm > 3.0
  • 添加 HAMi-WebUI Helm 仓库并更新索引

shell
helm repo add hami-webui https://project-hami.github.io/HAMi-WebUI
helm repo update
  • 安装 HAMi-WebUI(请替换 Prometheus 地址为集群内可达地址)
shell
helm install my-hami-webui hami-webui/hami-webui \
  --set externalPrometheus.enabled=true \
  --set externalPrometheus.address="http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090" \
  -n kube-system
  • 验证安装
shell
kubectl get pods -n kube-system | grep webui

成功时应看到 hami-webuihami-webui-dcgm-exporter 处于 Running 状态。

  • 访问方式
    • WebUI 仅本地访问,可使用 kubectl port-forward 将 Service 暴露到本地。

3. 将 HAMi 运行模式切换为 MIG 模式

参考文档:https://project-hami.io/zh/docs/userguide/NVIDIA-device/dynamic-mig-support

3.1 查看当前 GPU 的 UUID

shell
root@node1:~/k8s/gptsovits# nvidia-smi -L
GPU 0: NVIDIA H200 (UUID: GPU-95b9c498-4e81-d7d9-c503-dd11abfbec71)
GPU 1: NVIDIA H200 (UUID: GPU-38d8e445-dc62-6a21-9007-4c6adf00512a)
GPU 2: NVIDIA H200 (UUID: GPU-5b565fde-ea9f-1c75-cbd9-591a74f47bcd)
GPU 3: NVIDIA H200 (UUID: GPU-f4c19b2a-d93a-1e7c-b150-b899ec7f84bb)
GPU 4: NVIDIA H200 (UUID: GPU-523862b8-d665-ff53-df84-4312007eb3ef)
GPU 5: NVIDIA H200 (UUID: GPU-28b6a508-8ebb-65be-1d28-71f96b4a287c)
GPU 6: NVIDIA H200 (UUID: GPU-378f5120-f2b3-5e3e-6213-45788cb33771)
GPU 7: NVIDIA H200 (UUID: GPU-a7a56bd5-24a7-5642-e437-e0fb947a6576)

3.2 修改 ConfigMap:hami-device-plugin(启用 MIG 并筛选设备)

将以下内容应用到 kube-system 命名空间中的 hami-device-plugin

yaml
apiVersion: v1
data:
  config.json: |
    {
      "nodeconfig": [
        {
          "name": "node1",     # 节点名称(需与实际节点标签匹配)
          "operatingmode": "mig",   # 关键:启用 MIG 模式
          "migstrategy": "none",
          "filterdevices": {
            "uuid": [
              "GPU-5b565fde-ea9f-1c75-cbd9-591a74f47bcd",
              "GPU-f4c19b2a-d93a-1e7c-b150-b899ec7f84bb",
              "GPU-523862b8-d665-ff53-df84-4312007eb3ef",
              "GPU-28b6a508-8ebb-65be-1d28-71f96b4a287c",
              "GPU-378f5120-f2b3-5e3e-6213-45788cb33771",
              "GPU-a7a56bd5-24a7-5642-e437-e0fb947a6576"
            ],
            "index": []
          }
        }
      ]
    }
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: hami
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: '2025-08-12T06:17:30Z'
  labels:
    app.kubernetes.io/component: hami-device-plugin
    app.kubernetes.io/instance: hami
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: hami
    app.kubernetes.io/version: 2.6.1
    helm.sh/chart: hami-2.6.1
  managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data: {}
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/component: {}
            f:app.kubernetes.io/instance: {}
            f:app.kubernetes.io/managed-by: {}
            f:app.kubernetes.io/name: {}
            f:app.kubernetes.io/version: {}
            f:helm.sh/chart: {}
      manager: helm
      operation: Update
      time: '2025-08-12T09:24:41Z'
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data:
          f:config.json: {}
      manager: rancher
      operation: Update
      time: '2025-08-13T01:30:27Z'
  name: hami-device-plugin
  namespace: kube-system
  resourceVersion: '3266709'
  uid: 3e728dfa-5b32-47bc-a3d0-92c718d261db

3.3 修改 ConfigMap: hami-scheduler-device

yaml
apiVersion: v1
data:
  device-config.yaml: |-
    nvidia:
      resourceCountName: nvidia.com/gpu
      resourceMemoryName: nvidia.com/gpumem
      resourceMemoryPercentageName: nvidia.com/gpumem-percentage
      resourceCoreName: nvidia.com/gpucores
      resourcePriorityName: nvidia.com/priority
      overwriteEnv: false
      defaultMemory: 0
      defaultCores: 0
      defaultGPUNum: 1
      deviceSplitCount: 10
      deviceMemoryScaling: 0.9
      deviceCoreScaling: 1
      gpuCorePolicy: default
      libCudaLogLevel: 1
      runtimeClassName: ""
      knownMigGeometries:
      - models: [ "NVIDIA H200"]
        allowedGeometries:
          - 
            - name: 1g.20gb
              memory: 20480
              count: 7
          - 
            - name: 1g.20gb
              memory: 20480
              count: 7
            - name: 7g.70gb
              memory: 71680
              count: 2

    cambricon:
      resourceCountName: cambricon.com/vmlu
      resourceMemoryName: cambricon.com/mlu.smlu.vmemory
      resourceCoreName: cambricon.com/mlu.smlu.vcore
    hygon:
      resourceCountName: hygon.com/dcunum
      resourceMemoryName: hygon.com/dcumem
      resourceCoreName: hygon.com/dcucores
    metax:
      resourceCountName: "metax-tech.com/gpu"
      resourceVCountName: metax-tech.com/sgpu
      resourceVMemoryName: metax-tech.com/vmemory
      resourceVCoreName: metax-tech.com/vcore
    enflame:
      resourceCountName: "enflame.com/vgcu"
      resourcePercentageName: "enflame.com/vgcu-percentage"
    mthreads:
      resourceCountName: "mthreads.com/vgpu"
      resourceMemoryName: "mthreads.com/sgpu-memory"
      resourceCoreName: "mthreads.com/sgpu-core"
    iluvatar:
      resourceCountName: iluvatar.ai/vgpu
      resourceMemoryName: iluvatar.ai/vcuda-memory
      resourceCoreName: iluvatar.ai/vcuda-core
    vnpus:
    - chipName: 910B
      commonWord: Ascend910A
      resourceName: huawei.com/Ascend910A
      resourceMemoryName: huawei.com/Ascend910A-memory
      memoryAllocatable: 32768
      memoryCapacity: 32768
      aiCore: 30
      templates:
        - name: vir02
          memory: 2184
          aiCore: 2
        - name: vir04
          memory: 4369
          aiCore: 4
        - name: vir08
          memory: 8738
          aiCore: 8
        - name: vir16
          memory: 17476
          aiCore: 16
    - chipName: 910B2
      commonWord: Ascend910B2
      resourceName: huawei.com/Ascend910B2
      resourceMemoryName: huawei.com/Ascend910B2-memory
      memoryAllocatable: 65536
      memoryCapacity: 65536
      aiCore: 24
      aiCPU: 6
      templates:
        - name: vir03_1c_8g
          memory: 8192
          aiCore: 3
          aiCPU: 1
        - name: vir06_1c_16g
          memory: 16384
          aiCore: 6
          aiCPU: 1
        - name: vir12_3c_32g
          memory: 32768
          aiCore: 12
          aiCPU: 3
    - chipName: 910B3
      commonWord: Ascend910B
      resourceName: huawei.com/Ascend910B
      resourceMemoryName: huawei.com/Ascend910B-memory
      memoryAllocatable: 65536
      memoryCapacity: 65536
      aiCore: 20
      aiCPU: 7
      templates:
        - name: vir05_1c_16g
          memory: 16384
          aiCore: 5
          aiCPU: 1
        - name: vir10_3c_32g
          memory: 32768
          aiCore: 10
          aiCPU: 3
    - chipName: 910B4
      commonWord: Ascend910B4
      resourceName: huawei.com/Ascend910B4
      resourceMemoryName: huawei.com/Ascend910B4-memory
      memoryAllocatable: 32768
      memoryCapacity: 32768
      aiCore: 20
      aiCPU: 7
      templates:
        - name: vir05_1c_8g
          memory: 8192
          aiCore: 5
          aiCPU: 1
        - name: vir10_3c_16g
          memory: 16384
          aiCore: 10
          aiCPU: 3
    - chipName: 310P3
      commonWord: Ascend310P
      resourceName: huawei.com/Ascend310P
      resourceMemoryName: huawei.com/Ascend310P-memory
      memoryAllocatable: 21527
      memoryCapacity: 24576
      aiCore: 8
      aiCPU: 7
      templates:
        - name: vir01
          memory: 3072
          aiCore: 1
          aiCPU: 1
        - name: vir02
          memory: 6144
          aiCore: 2
          aiCPU: 2
        - name: vir04
          memory: 12288
          aiCore: 4
          aiCPU: 4
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: hami
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: '2025-08-12T06:17:30Z'
  labels:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/instance: hami
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: hami
    app.kubernetes.io/version: 2.6.1
    helm.sh/chart: hami-2.6.1
  managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data: {}
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/component: {}
            f:app.kubernetes.io/instance: {}
            f:app.kubernetes.io/managed-by: {}
            f:app.kubernetes.io/name: {}
            f:app.kubernetes.io/version: {}
            f:helm.sh/chart: {}
      manager: helm
      operation: Update
      time: '2025-08-12T09:24:41Z'
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data:
          f:device-config.yaml: {}
      manager: rancher
      operation: Update
      time: '2025-08-12T10:49:02Z'
  name: hami-scheduler-device
  namespace: kube-system
  resourceVersion: '2994236'
  uid: 8e96de10-b6c0-4710-8573-23e3f793b735

3.4 重启 HAMi 组件

# 重启设备插件
kubectl rollout restart daemonset/hami-device-plugin -n kube-system

# 重启调度器
kubectl rollout restart deploy/hami-scheduler -n kube-system

3.5 验证组件状态

kubectl get pods -n kube-system | grep "hami-device-plugin\|hami-scheduler"

文章来源于自己总结和网络转载,内容如有任何问题,请大佬斧正!联系我