在大模型部署实践中,容错能力是保障系统稳定性的关键要素。本文记录了在生产环境部署中构建容错机制的踩坑经验。
容错设计要点
1. 超时与重试机制
import time
import requests
from functools import wraps
def retry_with_backoff(max_retries=3, backoff_factor=2):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
wait_time = backoff_factor ** attempt
time.sleep(wait_time)
return None
return wrapper
return decorator
@retry_with_backoff(max_retries=3, backoff_factor=2)
def model_inference(prompt):
response = requests.post('http://localhost:8000/infer',
json={'prompt': prompt},
timeout=5)
return response.json()
2. 熔断器模式
from collections import deque
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failures = deque()
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self._is_open():
raise Exception('Circuit breaker is OPEN')
try:
result = func(*args, **kwargs)
self._record_success()
return result
except Exception as e:
self._record_failure()
raise
def _is_open(self):
if not self.failures or time.time() - self.last_failure_time > self.timeout:
return False
return len(self.failures) >= self.failure_threshold
部署建议
在实际部署中,建议将上述容错机制集成到API网关层,配合Prometheus监控指标,实现自动化故障检测与恢复。同时,配置合理的健康检查探针,确保服务状态的及时更新。

讨论