手把手搭建生产级MLOps流水线：Python+Docker+MLflow+FastAPI

最新推荐文章于 2026-06-17 14:05:14 发布

原创最新推荐文章于 2026-06-17 14:05:14 发布 · 500 阅读

6 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#MLOps #数据漂移 #模型注册

1. 项目概述：为什么一个数据科学家必须亲手搭一次MLOps流水线

你有没有过这样的经历：在Jupyter里调出一个0.92的AUC，兴奋地发给产品团队，结果对方回一句“模型什么时候能上线？API文档呢？”——然后你盯着自己本地那个没写日志、没做异常捕获、连requirements.txt都靠记忆手写的notebook，突然意识到：这根本不是生产就绪的模型，只是一份漂亮的实验报告。我带过的7个数据科学团队里，有5个在项目中期卡死在这个环节：模型开发和工程交付之间，横着一道看不见却极难逾越的鸿沟。MLOps不是什么高深莫测的新技术，它就是一套让机器学习项目从“能跑通”变成“可交付、可监控、可迭代”的实操方法论。核心关键词 Artificial Intelligence 在这里不是指算法本身，而是指整个AI系统在真实业务中持续运转的能力——它需要版本控制、自动化测试、环境隔离、性能追踪、漂移告警，这些都不是附加功能，而是AI系统存活的基本条件。这篇文章不讲抽象概念，只讲我用Python+Docker+MLflow+FastAPI在一台16G内存的MacBook Pro上，从零搭建一条端到端MLOps流水线的真实过程。它适合三类人：刚转行的数据科学家（想避开我踩过的坑）、带团队的技术负责人（需要可落地的最小可行方案）、以及被业务方催着上线却不知从何下手的算法工程师。整条流水线最终能实现：代码提交自动触发训练→模型自动注册→API服务一键部署→线上预测自动记录→指标异常实时告警。所有工具开源免费，配置文件我全部贴出来，你可以直接复制粘贴运行。

2. MLOps整体设计与思路拆解：为什么选这套组合而不是Kubeflow或SageMaker

2.1 核心矛盾：数据科学家的敏捷性 vs 工程师的稳定性

MLOps最常被误解的一点，是把它当成“给机器学习加个DevOps”。错。传统DevOps解决的是代码变更的稳定性问题，而MLOps要同时处理三类动态变量： 代码在变、数据在变、模型在变 。一个模型上线后，如果上游数据源字段类型悄悄从int变成了string，或者用户行为模式因季节变化发生偏移，模型性能可能一夜之间从0.92跌到0.65，而你的监控面板上连个告警都没有。我见过最惨的案例是一家电商公司，推荐模型在双十一大促期间准确率暴跌，排查三天才发现是物流状态字段新增了“已预约”这个值，而模型训练时根本没见过——这就是典型的 数据漂移（Data Drift） ，也是MLOps必须解决的第一道关卡。

所以我的设计原则非常明确： 先闭环，再扩展；先可观测，再自动化；先本地可复现，再上云 。很多团队一上来就冲Kubeflow，结果花两个月搭环境，连第一个模型都没跑通。而我的方案用4个开源工具就能覆盖80%的核心场景：MLflow管实验和模型注册，Docker管环境一致性，FastAPI管服务暴露，Prometheus+Grafana管监控。它们之间没有强耦合，每个组件都可以独立替换。比如你明天想换SageMaker做训练，只需改MLflow的backend store配置，其他部分完全不动。

2.2 工具链选型背后的硬核逻辑

为什么不用TensorFlow Extended（TFX）？它太重。TFX要求你把整个pipeline写成Component对象，对刚入门的数据科学家极其不友好。我试过用TFX重构一个简单的销售预测项目，光是写 CsvExampleGen 和 StatisticsGen 的配置就花了两天，而用MLflow+Python脚本，30行代码搞定。

为什么不用Airflow调度训练任务？Airflow的UI虽然漂亮，但它的dag定义本质是Python代码，一旦训练脚本要改参数，就得改dag文件再重启scheduler——这违背了“数据科学家自主迭代”的初衷。而MLflow的CLI命令 mlflow run . --experiment-name sales-forecast ，配合Git commit hook，就能实现真正的自助式触发。

最关键的选择是 模型服务层 。很多人直接用MLflow自带的 mlflow models serve ，但它本质是Flask封装，不支持并发、无健康检查、无请求日志。我坚持用FastAPI，原因有三：第一，它原生支持异步IO，单实例轻松扛住500QPS；第二，Pydantic模型验证能自动拦截非法输入（比如传了个字符串给期望float的price字段），避免模型崩溃；第三，OpenAPI文档自动生成，产品同学点开链接就能看到完整API说明，省去写文档的时间。这个选择让我在后续对接BI系统时少写了300行胶水代码。

2.3 架构图不是画出来的，是踩坑踩出来的

这张架构图我画过7版。第一版是纯理论：数据源→特征工程→训练→部署→监控。第二版加了版本控制：Git+DVC。第三版发现DVC在Windows上兼容性差，换成Git LFS。第四版加入模型注册中心，但发现MLflow的model registry在本地文件系统下无法跨机器访问，于是引入PostgreSQL作为backend store。第五版想加A/B测试，结果发现FastAPI的中间件机制比专门的流量分发工具更轻量，直接用header路由实现。第六版尝试集成Evidently做数据漂移检测，但发现它生成的HTML报告不适合嵌入Grafana，最后改用其Python API提取关键指标写入Prometheus。第七版，也就是你现在看到的终版，所有组件都经过至少3个真实项目验证。它没有炫技的组件，每个模块都解决一个具体痛点：MLflow解决“谁在什么时候用了什么数据训了什么模型”，Docker解决“为什么在你电脑上能跑在我电脑上报错”，FastAPI解决“API怎么测怎么文档化”，Prometheus解决“模型现在到底好不好”。

3. 核心细节解析与实操要点：从代码提交到API可用的12个关键决策点

3.1 实验管理：为什么MLflow的autolog不够用，必须手写log_metrics

MLflow的 mlflow.sklearn.autolog() 看起来很美，自动记录参数、指标、模型。但实际用起来全是坑。它会把所有sklearn内置评估函数的结果都记下来，包括一些你根本不需要的中间指标（比如 precision_recall_fscore_support 返回的tuple），导致UI里指标列表杂乱无章。更致命的是，autolog无法记录 业务指标 ——比如电商场景下的“GMV提升率”，金融场景下的“坏账率下降百分点”。这些指标需要你用业务逻辑计算，autolog根本不知道。

我的解决方案是彻底弃用autolog，改用手动记录。在训练脚本 train.py 里，我这样写：

import mlflow
from sklearn.metrics import roc_auc_score, f1_score

# 训练完模型后
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# 核心指标必须显式命名，带业务前缀
mlflow.log_metric("auc", roc_auc_score(y_test, y_pred_proba))
mlflow.log_metric("f1", f1_score(y_test, y_pred))
mlflow.log_metric("business_gmv_lift_pct", calculate_gmv_lift(y_test, y_pred_proba, order_data))

# 关键参数也手动记录，避免autolog漏掉自定义参数
mlflow.log_param("feature_version", "v2.1")
mlflow.log_param("data_window_days", 90)

这里有个血泪教训： calculate_gmv_lift 函数必须放在 train.py 同目录下，不能放在utils包里。因为MLflow的 mlflow run 命令会把当前目录打包上传，如果函数在外部包里，远程执行时会报 ModuleNotFoundError 。我为此重构了三次项目结构，最终采用“扁平化”布局：所有训练相关代码都在 src/ 下， train.py 、 preprocess.py 、 evaluate.py 平级存放，避免任何跨包引用。

3.2 模型注册：如何用MLflow Model Registry实现真正的模型治理

MLflow Model Registry不是简单的模型存储，它是模型生命周期的“交通警察”。很多团队把模型注册当成存档，结果注册了20个模型却不知道哪个该上生产。我的做法是强制执行 三环境策略 ： Staging 、 Production 、 Archived 。每个模型注册后，必须由数据科学家发起transition request，经MLOps工程师审批才能进入 Production 。这个流程不是靠人工邮件，而是用MLflow的API自动触发：

# 在CI/CD脚本中
client = mlflow.tracking.MlflowClient()
model_name = "sales-forecast-xgboost"
version = get_latest_staging_version(model_name)

# 自动审批规则：如果staging环境连续3天AUC>0.85且无数据漂移告警，则升级
if check_staging_metrics(model_name, version) and not check_drift_alerts():
    client.transition_model_version_stage(
        name=model_name,
        version=version,
        stage="Production",
        archive_existing_versions=True
    )

注意 archive_existing_versions=True 这个参数。它确保每次新模型上线，旧的Production版本自动归档，避免线上同时跑多个版本引发混乱。这个细节在官方文档里藏得很深，但却是防止线上事故的关键开关。

3.3 环境一致性：Dockerfile里藏着的5个魔鬼细节

Docker是MLOps的基石，但一个写得不好的Dockerfile能让整个流水线崩盘。我总结出5个必须写死的细节：

基础镜像必须指定sha256 ： FROM python:3.9-slim@sha256:abc123 。不写sha256，某天基础镜像更新，Python版本从3.9.1变成3.9.2，可能触发numpy编译失败。
pip安装必须加 --no-cache-dir 和 --upgrade pip ：缓存会导致不同机器构建出不同依赖版本；不升级pip，旧版pip可能无法解析pyproject.toml。
WORKDIR必须是绝对路径 ： WORKDIR /app ，不能 WORKDIR app 。相对路径在某些Docker版本下行为不一致。
COPY指令必须分层 ：先COPY requirements.txt，pip install，再COPY源码。这样只要requirements不变，Docker build就能复用缓存层，提速5倍以上。
ENTRYPOINT必须用exec形式 ： ENTRYPOINT ["uvicorn", "api:app", "--host", "0.0.0.0:8000"] ，不能 ENTRYPOINT uvicorn api:app --host 0.0.0.0:8000 。后者会启动shell进程，导致信号无法正确传递，容器无法优雅退出。

我的标准Dockerfile长这样（已删减注释）：

FROM python:3.9-slim@sha256:4a5e5b5c5d5e5f5e5a5b5c5d5e5f5a5b5c5d5e5f5a5b5c5d5e5f5a5b5c5d5e5f

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

COPY src/ .

EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["uvicorn", "api:app", "--host", "0.0.0.0:8000", "--port", "8000", "--workers", "4"]

特别注意 HEALTHCHECK 指令。它让Kubernetes或Docker Swarm能自动检测服务是否真正就绪，而不是端口开了就认为健康。我见过太多团队因为没写healthcheck，容器启动后API返回500错误，但编排系统还认为服务正常，导致流量持续打过去。

3.4 API服务设计：FastAPI里的3个反直觉但关键的设计

FastAPI的 @app.post("/predict") 看着简单，但生产环境必须处理三个反直觉问题：

第一，输入验证不是锦上添花，而是安全防线 。不能假设前端传来的数据一定是合法的。我定义Pydantic模型时，强制字段约束：

from pydantic import BaseModel, Field
from typing import List

class PredictionRequest(BaseModel):
    features: List[List[float]] = Field(
        ...,
        min_items=1,
        max_items=1000,  # 单次最多预测1000条，防DDoS
        description="2D array of features, shape (n_samples, n_features)"
    )
    model_version: str = Field(
        default="latest",
        regex=r"^v\d+\.\d+\.\d+$|^latest$",  # 只允许语义化版本或latest
        description="Model version to use, e.g., v1.2.0 or latest"
    )

@app.post("/predict")
def predict(request: PredictionRequest):
    # 这里request.features已保证是合法的List[List[float]]
    pass

第二，响应体必须包含元数据，不只是预测结果 。业务方需要知道“这个预测是谁做的、用的什么模型、置信度多少”。我的响应结构是：

class PredictionResponse(BaseModel):
    predictions: List[float]
    model_info: dict = Field(
        default_factory=lambda: {
            "name": "sales-forecast-xgboost",
            "version": "v2.1.0",
            "training_date": "2023-07-20"
        }
    )
    latency_ms: float
    timestamp: str = Field(default_factory=lambda: datetime.now().isoformat())

第三，异常处理必须分级 。不是所有错误都该返回500。我定义了三层：

HTTPException(status_code=400) ：客户端错误，如输入格式不对（Pydantic自动抛）
HTTPException(status_code=404) ：模型不存在，比如请求了v999.0.0
HTTPException(status_code=503) ：服务不可用，比如模型加载失败或GPU显存不足

这样前端可以针对性处理：400提示用户改输入，404提示切换模型版本，503显示“服务暂时不可用，请稍后再试”。

4. 实操过程与核心环节实现：从零开始搭建端到端流水线的完整步骤

4.1 环境准备：3分钟初始化你的MLOps工作台

别急着写代码，先搭好地基。我用一个shell脚本 setup.sh 完成所有初始化，确保团队新人3分钟内拥有完全一致的环境：

#!/bin/bash
# setup.sh - 运行一次，终身受益

# 1. 创建项目目录结构
mkdir -p mlops-demo/{src/{data,models,api},notebooks,tests,deploy,docs}

# 2. 初始化Git仓库并忽略敏感文件
cd mlops-demo
git init
cat > .gitignore << 'EOF'
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.log
.DS_Store
*.swp
*.swo
EOF

# 3. 安装核心工具（macOS示例，Linux需改brew为apt）
brew install docker docker-compose mlflow python@3.9
pip3 install --upgrade pip
pip3 install mlflow fastapi uvicorn pandas scikit-learn numpy

# 4. 启动本地MLflow服务器（使用PostgreSQL backend）
docker run -d --name mlflow-postgres -e POSTGRES_PASSWORD=mlflow -p 5432:5432 -v $(pwd)/mlflow-data:/var/lib/postgresql/data postgres:13
sleep 5
mlflow server --backend-store-uri postgresql://mlflow:mlflow@localhost:5432/mlflow --default-artifact-root ./mlruns --host 0.0.0.0 --port 5000

echo "✅ 环境初始化完成！访问 http://localhost:5000 查看MLflow UI"

运行这个脚本后，你会得到：

清晰的目录结构， src/data 放原始数据， src/models 放训练脚本， src/api 放FastAPI代码
Git仓库已初始化， .gitignore 已预置常见忽略项
PostgreSQL数据库运行在本地5432端口，MLflow服务运行在5000端口
所有工具版本锁定，避免“在我机器上能跑”的悲剧

提示： .gitignore 里特意加了 *.swp 和 *.swo ，这是Vim的交换文件。我曾因没忽略它，把一个10MB的.swp文件提交到Git，导致克隆仓库速度慢了10倍。这种细节才是老手和新手的区别。

4.2 数据版本控制：为什么Git LFS比DVC更适合中小团队

DVC很强大，但对中小团队来说，学习成本太高。Git LFS（Large File Storage）用起来就像Git一样简单，却能解决90%的数据版本问题。我的实践是： 原始数据用Git LFS，处理后数据用MLflow artifacts 。

第一步，安装并初始化LFS：

git lfs install
git lfs track "*.csv"
git lfs track "*.parquet"
git add .gitattributes

第二步，把数据文件加入LFS：

# 假设你有sales_data_202307.csv
git add sales_data_202307.csv
git commit -m "add raw data for July 2023"
git push origin main

Git LFS会在远程仓库只存一个文本指针，真实文件存在LFS服务器上。这样 git clone 时不会下载大文件，但 git lfs pull 可以按需获取。关键优势是： 数据版本和代码版本完全同步 。当你checkout到某个commit， git lfs pull 就自动拿到当时对应的原始数据，无需额外管理数据版本映射表。

注意：不要把 processed/ 目录用LFS。处理后的数据应该由训练脚本生成，并通过 mlflow.log_artifact("processed/train.parquet") 存到MLflow artifact store。这样既能追溯数据血缘（哪个模型用了哪个处理后的数据），又避免LFS服务器存储爆炸。

4.3 训练流水线：一个可复现的端到端训练脚本

src/models/train.py 是整个流水线的心脏。它必须做到： 一次运行，全程可追溯 。我的标准模板如下：

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import joblib
import sys
import os

# 设置MLflow跟踪URI（本地开发用file，生产用http）
mlflow.set_tracking_uri("http://localhost:5000")

def load_data(data_path: str) -> pd.DataFrame:
    """从Git LFS路径加载原始数据"""
    return pd.read_parquet(data_path)

def preprocess(df: pd.DataFrame) -> tuple:
    """特征工程，必须可复现"""
    # 这里所有操作都要有确定性，不能用np.random.seed(42)以外的随机
    df = df.dropna()
    X = df.drop("target", axis=1)
    y = df["target"]
    return X, y

def train_model(X_train, y_train, params: dict):
    """训练模型，参数必须显式传入"""
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    return model

def evaluate_model(model, X_test, y_test):
    """评估模型，返回字典格式指标"""
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    return {
        "accuracy": report["accuracy"],
        "precision": report["1"]["precision"],
        "recall": report["1"]["recall"],
        "f1": report["1"]["f1-score"]
    }

def main():
    # 1. 解析命令行参数（支持CI/CD传参）
    if len(sys.argv) > 1:
        data_path = sys.argv[1]
        model_params = {"n_estimators": int(sys.argv[2]), "max_depth": int(sys.argv[3])}
    else:
        data_path = "../data/sales_data_202307.parquet"
        model_params = {"n_estimators": 100, "max_depth": 5}

    # 2. 开始MLflow run
    with mlflow.start_run(run_name="sales-forecast-train"):
        # 3. 记录所有输入参数
        mlflow.log_param("data_path", data_path)
        mlflow.log_params(model_params)

        # 4. 加载并预处理数据
        df = load_data(data_path)
        X, y = preprocess(df)
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42, stratify=y
        )

        # 5. 训练模型
        model = train_model(X_train, y_train, model_params)

        # 6. 评估并记录指标
        metrics = evaluate_model(model, X_test, y_test)
        for k, v in metrics.items():
            mlflow.log_metric(k, v)

        # 7. 保存模型和预处理器（如果有的话）
        mlflow.sklearn.log_model(model, "model")
        
        # 8. 保存原始数据快照（关键！）
        mlflow.log_artifact(data_path, "data_raw")

        # 9. 记录代码版本
        mlflow.log_artifact("../.git/HEAD", "git_head")
        mlflow.log_artifact("../.git/ref:refs/heads/main", "git_ref")

if __name__ == "__main__":
    main()

这个脚本的精妙之处在于第8、9步： mlflow.log_artifact(data_path, "data_raw") 把原始数据存进MLflow， mlflow.log_artifact("../.git/HEAD") 记录当前commit hash。这样在MLflow UI里点开任意一次run，你能看到：

用的什么数据（点击artifact下载）
用的什么代码版本（commit hash可跳转GitHub）
用的什么超参数
得到什么指标

这才是真正的可复现性。我曾用这个能力，在客户投诉“模型不准”时，30分钟内定位到是上游数据ETL脚本改了字段名，而不是模型问题。

4.4 模型服务化：FastAPI服务的生产级部署

src/api/api.py 是模型服务的入口。它必须做到： 启动快、响应稳、日志全 。

from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import List, Dict, Any
import joblib
import mlflow
import pandas as pd
import time
import logging
from datetime import datetime

# 配置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="Sales Forecast API", version="1.0")

# 全局模型缓存（避免每次请求都加载）
_model_cache = {}

class PredictionRequest(BaseModel):
    features: List[List[float]]
    model_version: str = "latest"

class PredictionResponse(BaseModel):
    predictions: List[float]
    model_info: Dict[str, Any]
    latency_ms: float
    timestamp: str

def load_model(version: str):
    """从MLflow加载模型，带缓存"""
    if version in _model_cache:
        return _model_cache[version]
    
    try:
        # 从MLflow下载模型
        model_uri = f"models:/{version}/Production" if version != "latest" else "models:/sales-forecast-xgboost/Production"
        model = mlflow.sklearn.load_model(model_uri)
        _model_cache[version] = model
        logger.info(f"Loaded model {version} from MLflow")
        return model
    except Exception as e:
        logger.error(f"Failed to load model {version}: {e}")
        raise HTTPException(status_code=404, detail=f"Model {version} not found")

@app.get("/health")
def health_check():
    return {"status": "ok", "timestamp": datetime.now().isoformat()}

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest, background_tasks: BackgroundTasks):
    start_time = time.time()
    
    try:
        # 1. 加载模型
        model = load_model(request.model_version)
        
        # 2. 转换输入为DataFrame（适配sklearn）
        X = pd.DataFrame(request.features)
        
        # 3. 预测
        predictions = model.predict(X).tolist()
        
        # 4. 计算延迟
        latency_ms = (time.time() - start_time) * 1000
        
        # 5. 构建响应
        response = PredictionResponse(
            predictions=predictions,
            model_info={
                "name": "sales-forecast-xgboost",
                "version": request.model_version,
                "loaded_at": datetime.now().isoformat()
            },
            latency_ms=round(latency_ms, 2),
            timestamp=datetime.now().isoformat()
        )
        
        # 6. 异步记录日志到Prometheus（后面章节详述）
        background_tasks.add_task(log_prediction, request, response)
        
        return response
        
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

def log_prediction(request: PredictionRequest, response: PredictionResponse):
    """异步记录预测日志，不影响主流程"""
    # 这里会写入Prometheus，代码略
    pass

部署时，我用 docker-compose.yml 统一管理：

version: '3.8'
services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      - mlflow
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
  mlflow:
    image: "mlflow:1.30.0"
    ports:
      - "5000:5000"
    command: "server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns --host 0.0.0.0"
    volumes:
      - ./mlruns:/mlruns

运行 docker-compose up -d ，5秒后访问 http://localhost:8000/docs ，Swagger UI自动生成，连测试按钮都给你备好了。

5. 监控与告警：让模型自己告诉你它什么时候生病了

5.1 模型性能监控：为什么不能只看AUC，必须监控业务指标

AUC从0.92降到0.85，业务方可能毫无感觉；但“预测销量”和“实际销量”的绝对误差从±5%扩大到±20%，采购部门立刻就会打电话来。所以我的监控体系分三层：

第一层：技术指标（MLflow自动采集）

auc , f1 , precision , recall —— 每次训练后自动记录

第二层：业务指标（手动注入）

mape （平均绝对百分比误差）： abs((pred - actual) / actual).mean()
stockout_rate （缺货率）：预测销量<实际销量的订单占比
overstock_cost （库存成本）：预测销量>实际销量带来的仓储成本

第三层：数据质量指标（Evidently计算）

dataset_drift （数据漂移分数）：0-1，>0.5触发告警
feature_drift （单特征漂移）：每个数值特征的KS检验p值
target_drift （标签漂移）：分类任务中正样本比例变化

我在 src/monitoring/monitor.py 里把这些指标统一推送到Prometheus：

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import mlflow
import pandas as pd

registry = CollectorRegistry()

# 定义指标
auc_gauge = Gauge('model_auc', 'AUC score', ['model_name', 'version'], registry=registry)
mape_gauge = Gauge('model_mape', 'MAPE score', ['model_name', 'version'], registry=registry)
drift_gauge = Gauge('data_drift_score', 'Dataset drift score', ['model_name'], registry=registry)

def push_metrics_to_prometheus():
    # 从MLflow获取最新Production模型的指标
    client = mlflow.tracking.MlflowClient()
    runs = client.search_runs(
        experiment_ids=["1"],  # 实验ID
        filter_string="tags.mlflow.runName = 'sales-forecast-train'",
        order_by=["attributes.start_time DESC"],
        max_results=1
    )
    
    if not runs:
        return
        
    run = runs[0]
    auc = run.data.metrics.get("auc", 0)
    mape = run.data.metrics.get("mape", 0)
    
    # 从artifact读取漂移报告
    drift_report = pd.read_json(client.download_artifact(run.info.run_id, "drift_report.json"))
    drift_score = drift_report["dataset_drift"]["drift_score"]
    
    # 推送指标
    auc_gauge.labels(model_name="sales-forecast", version=run.data.params.get("model_version", "unknown")).set(auc)
    mape_gauge.labels(model_name="sales-forecast", version=run.data.params.get("model_version", "unknown")).set(mape)
    drift_gauge.labels(model_name="sales-forecast").set(drift_score)
    
    # 推送到Pushgateway
    push_to_gateway('localhost:9091', job='mlops-monitor', registry=registry)

5.2 告警策略：基于时间窗口的动态阈值，而不是固定数字

固定阈值告警是反模式。AUC=0.85在训练集上可能是优秀，在线上可能是灾难。我的告警规则基于 滚动窗口对比 ：

如果过去7天AUC均值是0.90，标准差0.01，那么今天AUC<0.87（均值-3σ）就触发P1告警
如果过去30天 stockout_rate 均值是5%，今天突然跳到12%，触发P2告警
如果 dataset_drift_score 连续3次>0.5，触发P1告警

我在Grafana里用PromQL实现：

# AUC异常检测
avg_over_time(mlflow_model_auc{model_name="sales-forecast"}[7d]) - 3 * stddev_over_time(mlflow_model_auc{model_name="sales-forecast"}[7d])
<
mlflow_model_auc{model_name="sales-forecast"}

# 缺货率突增检测
increase(mlflow_stockout_rate{model_name="sales-forecast"}[1h]) > 0.05

实操心得：告警必须带 根因建议 。当Grafana告警弹出时，我的通知消息里会写：“AUC下降，建议检查：1. 查看 /drift-report 接口确认数据漂移；2. 检查上游ETL日志，确认 order_status 字段是否有新值；3. 回滚到v2.0.0版本临时恢复”。这样运维同学不用猜，直接执行。

5.3 漂移检测实战：用Evidently在5分钟内定位数据问题

Evidently不是黑盒，它生成的报告必须能指导行动。我写了一个 detect_drift.py 脚本，每天凌晨自动运行：

from evidently.report import Report
from evidently.metrics import DatasetDriftMetric, ColumnDriftMetric
from pandas import read_parquet
import json

# 加载当前生产数据和基准数据
current_data = read_parquet("data/current.parquet")
reference_data = read_parquet("data/reference.parquet")  # 通常是训练时的数据

# 构建漂移报告
report = Report(metrics=[
    DatasetDriftMetric(),
    ColumnDriftMetric(column_name="user_age"),
    ColumnDriftMetric(column_name="order_amount"),
    ColumnDriftMetric(column_name="product_category")
])

report.run(reference_data=reference_data, current_data=current_data)

# 提取关键信息
report_dict = report.as_dict()
drift_score = report_dict["metrics"][0]["result"]["dataset_drift"]
drifted_columns = [
    m["result"]["column_name"] 
    for m in report_dict["metrics"][1:] 
    if m["result"]["drift_detected"]
]

# 写入JSON供Prometheus读取
with open("drift_report.json", "w") as f:
    json.dump({
        "dataset_drift": {"drift_score": drift_score},
        "drifted_columns": drifted_columns
    }, f)

这个脚本的关键输出是 drifted_columns 列表。当它返回 ["user_age", "product_category"] 时，我立刻知道：用户年龄分布变了（可能新上线了银发族频道），商品类目分布变了（可能大促新增了家电品类）。这比看一个0.62的漂移分数有用100倍。

6. 常见问题与排查技巧实录：那些文档里不会写的血泪经验

6.1 “模型在本地能跑，Docker里报ImportError”——环境地狱的终极解法

这是MLOps新手最高频的问题。根本原因不是Dockerfile写错了，而是 Python包的隐式依赖没声明 。比如你用 xgboost ，它依赖 numpy ，但 xgboost 的setup.py没把 numpy 列为install_requires，导致pip install xgboost时numpy可能不装或装错版本。

我的解法是： 永远用pip-tools生成锁文件 。

# 1. 写requirements.in，只写顶层依赖
echo "xgboost==1.7.5" > requirements.in
echo "pandas==1.5.3" >> requirements.in

# 2. 生成精确的requirements.txt
pip-compile requirements.in --output-file requirements.txt

# 3. Dockerfile里用这个锁文件
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

pip-compile 会递归解析所有依赖，生成带sha256校验的 requirements.txt 。这样无论在哪台机器