针对GPU智能体基于显存使用率配置HPA的需求其核心在于让Kubernetes能够采集并理解GPU显存使用率这一自定义指标并以此驱动Horizontal Pod Autoscaler的决策。以下是详细的配置方法与实践步骤。一、整体架构与组件实现基于GPU显存使用率的HPA需要构建一个完整的监控与指标转换链路数据采集由部署在每个GPU节点上的nvidia-dcgm-exporter或nvidia-gpu-exporterDaemonSet负责采集GPU各项指标包括显存使用率。指标存储与聚合PrometheusServer定期从Exporter拉取指标并存储。指标转换与适配prometheus-adapter将Prometheus中的原始指标转换为Kubernetes API能理解的custom.metrics.k8s.ioAPI格式。自动扩缩容决策HPA控制器通过查询custom.metrics.k8s.ioAPI获取显存使用率指标并根据预设规则计算所需的Pod副本数。graph TD subgraph K8s Node A[nvidia-dcgm-exporterbr/DaemonSet] --|暴露GPU指标| B(Prometheus Server); end B --|拉取并存储指标| C[Prometheus Adapter]; C --|转换指标格式| D[custom.metrics.k8s.io API]; E[HPA Controller] --|查询指标| D; E --|执行扩缩容| F[Deploymentbr/GPU智能体];二、详细配置步骤步骤1部署GPU指标导出器Exporter以nvidia-dcgm-exporter为例这是NVIDIA官方推荐的GPU监控组件。# nvidia-dcgm-exporter-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-dcgm-exporter namespace: monitoring # 建议放在monitoring命名空间 spec: selector: matchLabels: name: nvidia-dcgm-exporter template: metadata: labels: name: nvidia-dcgm-exporter spec: tolerations: # 容忍污点确保能调度到GPU节点 - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: nvidia-dcgm-exporter image: nvidia/dcgm-exporter:latest resources: limits: nvidia.com/gpu: 1 # 申请1个GPU用于监控 requests: nvidia.com/gpu: 1 securityContext: runAsNonRoot: false runAsUser: 0 ports: - name: metrics containerPort: 9400 protocol: TCP env: - name: DCGM_EXPORTER_INTERVAL # 采集间隔单位毫秒 value: 30000部署后可通过http://node-ip:9400/metrics验证指标是否正常暴露。关键指标包括DCGM_FI_DEV_MEMORY_USEDGPU显存使用量字节DCGM_FI_DEV_MEMORY_TOTALGPU显存总量字节DCGM_FI_DEV_GPU_UTILGPU计算利用率百分比步骤2配置Prometheus采集规则确保Prometheus Server的配置中包含了抓取nvidia-dcgm-exporter的Job。通常在prometheus.yml或通过Prometheus Operator的ServiceMonitor配置。示例通过ServiceMonitor配置使用Prometheus Operator时# service-monitor-gpu.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: nvidia-dcgm-exporter namespace: monitoring labels: release: prometheus-operator # 与Prometheus实例的selector匹配 spec: selector: matchLabels: name: nvidia-dcgm-exporter # 匹配DaemonSet的Pod标签 endpoints: - port: metrics interval: 30s namespaceSelector: matchNames: - monitoring步骤3部署与配置Prometheus Adapterprometheus-adapter负责将Prometheus指标转换为Kubernetes自定义指标API。其核心配置文件config.yaml定义了指标发现和转换规则。1. 创建适配器配置文件重点关注如何将DCGM_FI_DEV_MEMORY_USED和DCGM_FI_DEV_MEMORY_TOTAL转换为按Pod聚合的显存使用率。# prometheus-adapter-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-adapter-config namespace: custom-metrics data: config.yaml: | rules: custom: - seriesQuery: DCGM_FI_DEV_MEMORY_USED{gpu~.*,instance~.*,namespace!,pod!} resources: overrides: namespace: {resource: namespace} pod: {resource: pod} name: matches: DCGM_FI_DEV_MEMORY_USED as: gpu_memory_used_bytes # 定义指标名称 metricsQuery: sum by (namespace, pod) (DCGM_FI_DEV_MEMORY_USED{gpu~.*,instance~.*,namespace!,pod!}) - seriesQuery: DCGM_FI_DEV_MEMORY_TOTAL{gpu~.*,instance~.*,namespace!,pod!} resources: overrides: namespace: {resource: namespace} pod: {resource: pod} name: matches: DCGM_FI_DEV_MEMORY_TOTAL as: gpu_memory_total_bytes metricsQuery: sum by (namespace, pod) (DCGM_FI_DEV_MEMORY_TOTAL{gpu~.*,instance~.*,namespace!,pod!}) # 关键定义一个使用率指标作为HPA的目标 - seriesQuery: DCGM_FI_DEV_MEMORY_USED{gpu~.*,instance~.*,namespace!,pod!} resources: template: .Resource name: matches: DCGM_FI_DEV_MEMORY_USED as: gpu_memory_utilization # 最终HPA使用的指标名 metricsQuery: | sum by (namespace, pod) ( DCGM_FI_DEV_MEMORY_USED{gpu~.*,instance~.*,namespace!,pod!} ) / sum by (namespace, pod) ( DCGM_FI_DEV_MEMORY_TOTAL{gpu~.*,instance~.*,namespace!,pod!} ) * 1002. 部署Prometheus Adapter可以使用Helm Chart简化部署或直接使用Deployment。# prometheus-adapter-deployment.yaml (片段) apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-adapter namespace: custom-metrics spec: template: spec: containers: - name: prometheus-adapter image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.2 args: - --secure-port6443 - --tls-cert-file/var/run/serving-cert/tls.crt - --tls-private-key-file/var/run/serving-cert/tls.key - --logtostderrtrue - --prometheus-urlhttp://prometheus-server.monitoring.svc.cluster.local:9090 # Prometheus服务地址 - --metrics-relist-interval1m - --v6 - --config/etc/adapter/config.yaml volumeMounts: - name: config mountPath: /etc/adapter - name: serving-cert mountPath: /var/run/serving-cert volumes: - name: config configMap: name: prometheus-adapter-config - name: serving-cert secret: secretName: prometheus-adapter-tls --- apiVersion: v1 kind: Service metadata: name: prometheus-adapter namespace: custom-metrics spec: ports: - name: https port: 443 targetPort: 6443 selector: app: prometheus-adapter部署后验证自定义指标API是否正常工作kubectl get --raw /apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/gpu_memory_utilization | jq .步骤4创建基于GPU显存使用率的HPA假设GPU智能体部署在default命名空间名为gpu-agent-deployment。# hpa-gpu-memory.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: gpu-agent-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: gpu-agent-deployment minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: gpu_memory_utilization # 与adapter配置中定义的指标名一致 target: type: AverageValue averageValue: 75 # 目标所有Pod的平均显存使用率维持在75% behavior: # 扩缩容行为配置防止抖动 scaleDown: stabilizationWindowSeconds: 300 # 缩容冷却窗口300秒 policies: - type: Percent value: 10 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 60 # 扩容冷却窗口60秒 policies: - type: Percent value: 100 periodSeconds: 60三、关键配置解析与最佳实践指标选择与计算直接使用显存使用量 / 显存总量计算出的使用率作为HPA指标比单纯使用显存使用量更合理因为它消除了不同GPU型号如A100-40GB vs V100-16GB的差异使目标阈值如averageValue: 75具有通用性。目标值设定averageValue: 75表示HPA会努力将所有Pod实例的平均GPU显存使用率维持在75%。这个值需要根据实际业务负载和模型特性通过压测确定。设置过高如90%可能导致频繁OOM和扩容不及时设置过低如50%则会造成资源浪费。冷却时间与扩缩容策略behavior字段至关重要。GPU应用启动和模型加载耗时较长因此scaleUp.stabilizationWindowSeconds可以设置较短如60秒以便快速响应负载增长。scaleDown.stabilizationWindowSeconds应设置较长如300秒避免因短期负载下降导致Pod被过早回收引发服务抖动。多指标协同生产环境中建议将GPU显存使用率与QPS每秒查询数、请求延迟或GPU计算利用率结合配置多指标的HPA以实现更全面的弹性伸缩。例如即使显存使用率未达标但QPS持续过高也应触发扩容。metrics: - type: Pods pods: metric: name: gpu_memory_utilization target: type: AverageValue averageValue: 75 - type: Object object: metric: name: requests_per_second # 假设这是从应用暴露的QPS指标 describedObject: apiVersion: v1 kind: Service name: gpu-agent-service target: type: Value value: 100Pod资源请求必须准确HPA负责调整副本数但调度成功与否取决于Pod的resources.requests。务必为GPU智能体Pod准确设置limits.nvidia.com/gpu: 1和requests.memory确保Kubernetes调度器能找到有足够资源的节点。监控与告警配置Prometheus告警规则监控HPA状态如kube_horizontalpodautoscaler_status_current_replicas、无法扩缩容的事件以及GPU显存使用率的异常情况以便及时干预。通过以上步骤即可在Kubernetes中建立一个基于GPU显存使用率、响应灵敏且稳定的自动扩缩容机制有效提升GPU集群的资源利用率和服务可靠性。参考来源Nanbeige4.1-3B vLLM弹性伸缩K8s HPA基于QPS/显存使用率自动扩缩容Python智能体本地部署实战方案从零到上线全链路解析K8s中GPU智能体扩缩容的显存碎片优化AutoGPT部署难题破解高性能GPU资源按需供给化学研究智能体AI架构师必须掌握的负载均衡策略Amazon Bedrock WorkshopKubernetes自动扩缩HPA与推理负载指标