# Backup / Restore Runbook (Pre-Prod & Prod) ## 1. Scope 适用于 `quyun_v2` 的以下状态数据: - PostgreSQL(业务主数据) - 对象存储目录(本地存储或 S3 兼容对象) - 关键运行配置快照(不含明文 secret) 本 Runbook 目标: 1. 能稳定执行备份 2. 能在预发环境完成恢复 3. 有明确 RTO / RPO 验证步骤 --- ## 2. Preconditions - 拥有数据库备份权限(`pg_dump` / `psql`) - 拥有对象存储读写权限(本地目录或 S3 API) - 预发环境可用并与生产版本兼容 - 已确认以下变量(示例): ```bash export QY_DB_HOST=127.0.0.1 export QY_DB_PORT=5432 export QY_DB_NAME=quyun_v2 export QY_DB_USER=postgres export QY_DB_PASSWORD='***' ``` --- ## 3. PostgreSQL Backup ### 3.1 创建备份目录 ```bash mkdir -p /tmp/quyun-backup ``` ### 3.2 导出数据库(自定义格式) ```bash PGPASSWORD="$QY_DB_PASSWORD" \ pg_dump -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \ -F c -d "$QY_DB_NAME" \ -f "/tmp/quyun-backup/${QY_DB_NAME}_$(date +%Y%m%d_%H%M%S).dump" ``` ### 3.3 备份完整性校验 ```bash pg_restore -l /tmp/quyun-backup/.dump >/tmp/quyun-backup/restore.list ``` 验收标准:命令退出码为 0,且 `restore.list` 非空。 --- ## 4. Object Storage Backup ## 4.1 本地存储(`Storage.Type=local`) ```bash tar -czf "/tmp/quyun-backup/storage_$(date +%Y%m%d_%H%M%S).tar.gz" ./backend/storage ``` ### 4.2 S3/MinIO(`Storage.Type=s3`) 使用 `mc`(MinIO Client)示例: ```bash mc alias set quyun-s3 http://127.0.0.1:9000 "$STORAGE_ACCESS_KEY" "$STORAGE_SECRET_KEY" mc mirror quyun-s3/quyun-01 "/tmp/quyun-backup/s3_quyun-01_$(date +%Y%m%d_%H%M%S)" ``` 验收标准:目标目录文件数量 > 0,且抽样对象可读取。 --- ## 5. Restore Procedure (Pre-Prod Drill) ### 5.1 预发库准备 ```bash PGPASSWORD="$QY_DB_PASSWORD" \ psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \ -c "DROP DATABASE IF EXISTS ${QY_DB_NAME}_restore;" PGPASSWORD="$QY_DB_PASSWORD" \ psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \ -c "CREATE DATABASE ${QY_DB_NAME}_restore;" ``` ### 5.2 恢复数据库 ```bash PGPASSWORD="$QY_DB_PASSWORD" \ pg_restore -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \ -d "${QY_DB_NAME}_restore" --clean --if-exists \ "/tmp/quyun-backup/.dump" ``` ### 5.3 恢复后校验 ```bash PGPASSWORD="$QY_DB_PASSWORD" \ psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \ -c "SELECT COUNT(*) FROM users;" PGPASSWORD="$QY_DB_PASSWORD" \ psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \ -c "SELECT COUNT(*) FROM audit_logs;" ``` 验收标准: - 核心表(`users`, `orders`, `audit_logs`, `contents`)有合理数据量 - 抽样业务查询无语法或权限错误 --- ## 6. Service Verification After Restore 启动服务后执行: ```bash curl -f -sS http://127.0.0.1:18080/healthz curl -f -sS http://127.0.0.1:18080/readyz ``` 验收标准:两个端点均返回 2xx。 --- ## 7. RTO / RPO Recording 每次演练记录: - Backup start/end time - Restore start/end time - Data validation result - Incident / blockers 建议目标: - RTO <= 30 分钟 - RPO <= 24 小时(按日备份基线) --- ## 8. Failure Handling - `pg_dump` 失败:检查网络/权限/磁盘空间,重试一次 - `pg_restore` 失败:保留日志,回退至原预发库,不进行覆盖发布 - 对象恢复失败:仅允许在“非阻断业务路径”条件下继续演练,否则中止 --- ## 9. Evidence Requirement 每次演练需归档到: - `docs/release-evidence/.md` 最少包含: 1. 执行人、时间窗 2. 命令与退出码 3. 核心校验 SQL 输出 4. healthz/readyz 结果 5. 结论(PASS/FAIL)