大模型部署中的容错能力构建

FatBone +0/-0 0 0 正常 2025-12-24T07:01:19 容错机制 · 生产环境

在大模型部署实践中,容错能力是保障系统稳定性的关键要素。本文记录了在生产环境部署中构建容错机制的踩坑经验。

容错设计要点

1. 超时与重试机制

import time
import requests
from functools import wraps

def retry_with_backoff(max_retries=3, backoff_factor=2):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except requests.exceptions.RequestException as e:
                    if attempt == max_retries - 1:
                        raise
                    wait_time = backoff_factor ** attempt
                    time.sleep(wait_time)
            return None
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3, backoff_factor=2)
def model_inference(prompt):
    response = requests.post('http://localhost:8000/infer', 
                          json={'prompt': prompt}, 
                          timeout=5)
    return response.json()

2. 熔断器模式

from collections import deque
import time

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = deque()
        self.last_failure_time = None
        
    def call(self, func, *args, **kwargs):
        if self._is_open():
            raise Exception('Circuit breaker is OPEN')
        
        try:
            result = func(*args, **kwargs)
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            raise
    
    def _is_open(self):
        if not self.failures or time.time() - self.last_failure_time > self.timeout:
            return False
        return len(self.failures) >= self.failure_threshold

部署建议

在实际部署中,建议将上述容错机制集成到API网关层,配合Prometheus监控指标,实现自动化故障检测与恢复。同时,配置合理的健康检查探针,确保服务状态的及时更新。

推广
广告位招租

讨论

0/2000
FierceDance
FierceDance · 2026-01-08T10:24:58
超时重试机制别只看次数,得结合业务场景调参,不然容易把下游搞崩。建议加个熔断器配合,避免雪崩。
后端思维
后端思维 · 2026-01-08T10:24:58
生产环境的容错设计真不能图省事,我之前就因为没做熔断,一个模型接口挂了导致整个服务瘫痪,教训深刻。