K8s GPU 虚拟化(HAMi)实践指南
参考文档:
https://project-hami.io/zh/docs/get-started/nginx-example
1. 使用 Helm 安装 HAMi
- 给 GPU 节点打标签(未打标签的节点不会被 HAMi 管理)
shell
kubectl label nodes {nodeid} gpu=on
- 检查 Kubernetes 版本(用于匹配调度器镜像 tag)
shell
kubectl version
- 添加 HAMi Helm 仓库并更新索引
shell
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm repo update
- 安装 HAMi(将
vX.Y.Z
替换为集群服务器版本,例如 v1.16.8)
shell
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system
- 验证安装
shell
kubectl -n kube-system get pods | grep -E "vgpu-(device-plugin|scheduler)"
运行正常时应看到 vgpu-device-plugin
与 vgpu-scheduler
处于 Running 状态。
2. 使用 Helm 部署 HAMi-WebUI
WebUI 仅能通过本地主机访问,请确保本机 ~/.kube/config
已连接目标集群。
先决条件
- Kubectl(本地)
- HAMi >= 2.4.0
- Prometheus > 2.8.0(需能在集群内访问)
- Helm > 3.0
添加 HAMi-WebUI Helm 仓库并更新索引
shell
helm repo add hami-webui https://project-hami.github.io/HAMi-WebUI
helm repo update
- 安装 HAMi-WebUI(请替换 Prometheus 地址为集群内可达地址)
shell
helm install my-hami-webui hami-webui/hami-webui \
--set externalPrometheus.enabled=true \
--set externalPrometheus.address="http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090" \
-n kube-system
- 验证安装
shell
kubectl get pods -n kube-system | grep webui
成功时应看到 hami-webui
与 hami-webui-dcgm-exporter
处于 Running 状态。
- 访问方式
- WebUI 仅本地访问,可使用
kubectl port-forward
将 Service 暴露到本地。
- WebUI 仅本地访问,可使用
3. 将 HAMi 运行模式切换为 MIG 模式
参考文档:
https://project-hami.io/zh/docs/userguide/NVIDIA-device/dynamic-mig-support
3.1 查看当前 GPU 的 UUID
shell
root@node1:~/k8s/gptsovits# nvidia-smi -L
GPU 0: NVIDIA H200 (UUID: GPU-95b9c498-4e81-d7d9-c503-dd11abfbec71)
GPU 1: NVIDIA H200 (UUID: GPU-38d8e445-dc62-6a21-9007-4c6adf00512a)
GPU 2: NVIDIA H200 (UUID: GPU-5b565fde-ea9f-1c75-cbd9-591a74f47bcd)
GPU 3: NVIDIA H200 (UUID: GPU-f4c19b2a-d93a-1e7c-b150-b899ec7f84bb)
GPU 4: NVIDIA H200 (UUID: GPU-523862b8-d665-ff53-df84-4312007eb3ef)
GPU 5: NVIDIA H200 (UUID: GPU-28b6a508-8ebb-65be-1d28-71f96b4a287c)
GPU 6: NVIDIA H200 (UUID: GPU-378f5120-f2b3-5e3e-6213-45788cb33771)
GPU 7: NVIDIA H200 (UUID: GPU-a7a56bd5-24a7-5642-e437-e0fb947a6576)
3.2 修改 ConfigMap:hami-device-plugin(启用 MIG 并筛选设备)
将以下内容应用到 kube-system
命名空间中的 hami-device-plugin
:
yaml
apiVersion: v1
data:
config.json: |
{
"nodeconfig": [
{
"name": "node1", # 节点名称(需与实际节点标签匹配)
"operatingmode": "mig", # 关键:启用 MIG 模式
"migstrategy": "none",
"filterdevices": {
"uuid": [
"GPU-5b565fde-ea9f-1c75-cbd9-591a74f47bcd",
"GPU-f4c19b2a-d93a-1e7c-b150-b899ec7f84bb",
"GPU-523862b8-d665-ff53-df84-4312007eb3ef",
"GPU-28b6a508-8ebb-65be-1d28-71f96b4a287c",
"GPU-378f5120-f2b3-5e3e-6213-45788cb33771",
"GPU-a7a56bd5-24a7-5642-e437-e0fb947a6576"
],
"index": []
}
}
]
}
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: hami
meta.helm.sh/release-namespace: kube-system
creationTimestamp: '2025-08-12T06:17:30Z'
labels:
app.kubernetes.io/component: hami-device-plugin
app.kubernetes.io/instance: hami
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: hami
app.kubernetes.io/version: 2.6.1
helm.sh/chart: hami-2.6.1
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data: {}
f:metadata:
f:annotations:
.: {}
f:meta.helm.sh/release-name: {}
f:meta.helm.sh/release-namespace: {}
f:labels:
.: {}
f:app.kubernetes.io/component: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:app.kubernetes.io/version: {}
f:helm.sh/chart: {}
manager: helm
operation: Update
time: '2025-08-12T09:24:41Z'
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
f:config.json: {}
manager: rancher
operation: Update
time: '2025-08-13T01:30:27Z'
name: hami-device-plugin
namespace: kube-system
resourceVersion: '3266709'
uid: 3e728dfa-5b32-47bc-a3d0-92c718d261db
3.3 修改 ConfigMap: hami-scheduler-device
yaml
apiVersion: v1
data:
device-config.yaml: |-
nvidia:
resourceCountName: nvidia.com/gpu
resourceMemoryName: nvidia.com/gpumem
resourceMemoryPercentageName: nvidia.com/gpumem-percentage
resourceCoreName: nvidia.com/gpucores
resourcePriorityName: nvidia.com/priority
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 0.9
deviceCoreScaling: 1
gpuCorePolicy: default
libCudaLogLevel: 1
runtimeClassName: ""
knownMigGeometries:
- models: [ "NVIDIA H200"]
allowedGeometries:
-
- name: 1g.20gb
memory: 20480
count: 7
-
- name: 1g.20gb
memory: 20480
count: 7
- name: 7g.70gb
memory: 71680
count: 2
cambricon:
resourceCountName: cambricon.com/vmlu
resourceMemoryName: cambricon.com/mlu.smlu.vmemory
resourceCoreName: cambricon.com/mlu.smlu.vcore
hygon:
resourceCountName: hygon.com/dcunum
resourceMemoryName: hygon.com/dcumem
resourceCoreName: hygon.com/dcucores
metax:
resourceCountName: "metax-tech.com/gpu"
resourceVCountName: metax-tech.com/sgpu
resourceVMemoryName: metax-tech.com/vmemory
resourceVCoreName: metax-tech.com/vcore
enflame:
resourceCountName: "enflame.com/vgcu"
resourcePercentageName: "enflame.com/vgcu-percentage"
mthreads:
resourceCountName: "mthreads.com/vgpu"
resourceMemoryName: "mthreads.com/sgpu-memory"
resourceCoreName: "mthreads.com/sgpu-core"
iluvatar:
resourceCountName: iluvatar.ai/vgpu
resourceMemoryName: iluvatar.ai/vcuda-memory
resourceCoreName: iluvatar.ai/vcuda-core
vnpus:
- chipName: 910B
commonWord: Ascend910A
resourceName: huawei.com/Ascend910A
resourceMemoryName: huawei.com/Ascend910A-memory
memoryAllocatable: 32768
memoryCapacity: 32768
aiCore: 30
templates:
- name: vir02
memory: 2184
aiCore: 2
- name: vir04
memory: 4369
aiCore: 4
- name: vir08
memory: 8738
aiCore: 8
- name: vir16
memory: 17476
aiCore: 16
- chipName: 910B2
commonWord: Ascend910B2
resourceName: huawei.com/Ascend910B2
resourceMemoryName: huawei.com/Ascend910B2-memory
memoryAllocatable: 65536
memoryCapacity: 65536
aiCore: 24
aiCPU: 6
templates:
- name: vir03_1c_8g
memory: 8192
aiCore: 3
aiCPU: 1
- name: vir06_1c_16g
memory: 16384
aiCore: 6
aiCPU: 1
- name: vir12_3c_32g
memory: 32768
aiCore: 12
aiCPU: 3
- chipName: 910B3
commonWord: Ascend910B
resourceName: huawei.com/Ascend910B
resourceMemoryName: huawei.com/Ascend910B-memory
memoryAllocatable: 65536
memoryCapacity: 65536
aiCore: 20
aiCPU: 7
templates:
- name: vir05_1c_16g
memory: 16384
aiCore: 5
aiCPU: 1
- name: vir10_3c_32g
memory: 32768
aiCore: 10
aiCPU: 3
- chipName: 910B4
commonWord: Ascend910B4
resourceName: huawei.com/Ascend910B4
resourceMemoryName: huawei.com/Ascend910B4-memory
memoryAllocatable: 32768
memoryCapacity: 32768
aiCore: 20
aiCPU: 7
templates:
- name: vir05_1c_8g
memory: 8192
aiCore: 5
aiCPU: 1
- name: vir10_3c_16g
memory: 16384
aiCore: 10
aiCPU: 3
- chipName: 310P3
commonWord: Ascend310P
resourceName: huawei.com/Ascend310P
resourceMemoryName: huawei.com/Ascend310P-memory
memoryAllocatable: 21527
memoryCapacity: 24576
aiCore: 8
aiCPU: 7
templates:
- name: vir01
memory: 3072
aiCore: 1
aiCPU: 1
- name: vir02
memory: 6144
aiCore: 2
aiCPU: 2
- name: vir04
memory: 12288
aiCore: 4
aiCPU: 4
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: hami
meta.helm.sh/release-namespace: kube-system
creationTimestamp: '2025-08-12T06:17:30Z'
labels:
app.kubernetes.io/component: hami-scheduler
app.kubernetes.io/instance: hami
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: hami
app.kubernetes.io/version: 2.6.1
helm.sh/chart: hami-2.6.1
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data: {}
f:metadata:
f:annotations:
.: {}
f:meta.helm.sh/release-name: {}
f:meta.helm.sh/release-namespace: {}
f:labels:
.: {}
f:app.kubernetes.io/component: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:app.kubernetes.io/version: {}
f:helm.sh/chart: {}
manager: helm
operation: Update
time: '2025-08-12T09:24:41Z'
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
f:device-config.yaml: {}
manager: rancher
operation: Update
time: '2025-08-12T10:49:02Z'
name: hami-scheduler-device
namespace: kube-system
resourceVersion: '2994236'
uid: 8e96de10-b6c0-4710-8573-23e3f793b735
3.4 重启 HAMi 组件
# 重启设备插件
kubectl rollout restart daemonset/hami-device-plugin -n kube-system
# 重启调度器
kubectl rollout restart deploy/hami-scheduler -n kube-system
3.5 验证组件状态
kubectl get pods -n kube-system | grep "hami-device-plugin\|hami-scheduler"