K8s集群节点故障处理:从监控到自动恢复的完整实践
在Kubernetes集群运维中,节点故障是不可避免的挑战。本文将分享一套完整的节点故障处理方案,包括监控、自动驱逐和恢复机制。
故障检测与告警
首先配置Prometheus监控节点状态:
# prometheus.yml
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
action: replace
target_label: instance
自动驱逐脚本
#!/bin/bash
# auto-drain-node.sh
NODE_NAME=$1
if [[ -z "$NODE_NAME" ]]; then
echo "Usage: $0 <node-name>"
exit 1
fi
kubectl drain $NODE_NAME --ignore-daemonsets --delete-local-data
kubectl uncordon $NODE_NAME
集群自愈配置
# node-health.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: node-health-config
namespace: kube-system
data:
health-check.sh: |
#!/bin/bash
if ! kubectl get nodes $NODE_NAME -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q True; then
echo "Node is not ready, initiating drain..."
kubectl drain $NODE_NAME --ignore-daemonsets --delete-local-data
fi
通过K8s原生机制实现自动恢复
配置Taint和Toleration,确保应用在节点故障后能重新调度:
# deployment.yaml
spec:
template:
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoSchedule"
这套方案可有效降低人工干预成本,提升集群稳定性。

讨论