Kubernetes节点亲和性与模型服务调度踩坑记
最近在Kubernetes上部署TensorFlow Serving服务时,遇到了一个令人头疼的问题:模型服务总是调度到错误的节点上,导致推理性能下降。
问题背景
我们使用TensorFlow Serving + Docker容器化部署模型服务,期望通过节点亲和性将GPU节点上的模型服务调度到特定的GPU节点。但实际效果却很糟糕。
初步尝试
最初配置了简单的节点标签:
apiVersion: v1
kind: Pod
metadata:
labels:
app: tensorflow-serving
spec:
nodeSelector:
gpu-node: "true"
但发现Pod总是被调度到CPU节点上。
根本原因
通过kubectl describe pod <pod-name>排查,发现是节点标签未正确设置。正确的配置应该是:
# 给节点打标签
kubectl label nodes node-01 gpu-node=true
kubectl label nodes node-01 tensorflow-serving=true
完整解决方案
最终采用更精确的节点亲和性配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-serving
spec:
replicas: 3
selector:
matchLabels:
app: tensorflow-serving
template:
metadata:
labels:
app: tensorflow-serving
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-node
operator: In
values: ["true"]
- key: tensorflow-serving
operator: In
values: ["true"]
containers:
- name: serving
image: tensorflow/serving:latest
ports:
- containerPort: 8501
负载均衡配置
配合Ingress控制器实现负载均衡:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tf-serving-ingress
spec:
rules:
- host: model.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tensorflow-serving-svc
port:
number: 8501
这样配置后,服务终于能正确调度到指定节点并实现负载均衡了。

讨论