监控系统性能瓶颈定位

瓶颈识别指标

CPU使用率监控

# 使用top命令查看CPU使用率
watch -n 1 'top -b -n 1 | grep "Cpu(s)"'

# 告警配置：当CPU使用率超过85%时触发
threshold_cpu=85

内存消耗监控

# 监控内存使用情况
free -m | awk 'NR==2{printf "%.2f%%", $3*100/$2 }'

# 告警配置：内存使用率超过90%触发
threshold_mem=90

模型推理延迟

import time
import requests

# 推理时间监控
start_time = time.time()
response = requests.post('http://localhost:8000/predict', json={'data': [1,2,3]})
end_time = time.time()
latency = (end_time - start_time) * 1000  # 转换为毫秒

# 告警阈值：延迟超过500ms触发
threshold_latency=500

实际定位步骤

采集指标：每分钟收集CPU、内存、延迟数据
异常检测：使用滑动窗口计算标准差，超过3σ则标记异常
日志追踪：当监控指标超过阈值时，自动记录相关服务日志
告警通知：通过钉钉机器人发送告警信息至运维群

# 告警脚本示例
if [ $cpu_usage -gt $threshold_cpu ]; then
  curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=xxx" \
    -H "Content-Type: application/json" \
    -d '{"msgtype": "text", "text": {"content": "CPU使用率过高: $cpu_usage%"}}'
fi

监控系统性能瓶颈定位

监控系统性能瓶颈定位

瓶颈识别指标

实际定位步骤

讨论

选择表情