3.8 KiB
3.8 KiB
Backup / Restore Runbook (Pre-Prod & Prod)
1. Scope
适用于 quyun_v2 的以下状态数据:
- PostgreSQL(业务主数据)
- 对象存储目录(本地存储或 S3 兼容对象)
- 关键运行配置快照(不含明文 secret)
本 Runbook 目标:
- 能稳定执行备份
- 能在预发环境完成恢复
- 有明确 RTO / RPO 验证步骤
2. Preconditions
- 拥有数据库备份权限(
pg_dump/psql) - 拥有对象存储读写权限(本地目录或 S3 API)
- 预发环境可用并与生产版本兼容
- 已确认以下变量(示例):
export QY_DB_HOST=127.0.0.1
export QY_DB_PORT=5432
export QY_DB_NAME=quyun_v2
export QY_DB_USER=postgres
export QY_DB_PASSWORD='***'
3. PostgreSQL Backup
3.1 创建备份目录
mkdir -p /tmp/quyun-backup
3.2 导出数据库(自定义格式)
PGPASSWORD="$QY_DB_PASSWORD" \
pg_dump -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
-F c -d "$QY_DB_NAME" \
-f "/tmp/quyun-backup/${QY_DB_NAME}_$(date +%Y%m%d_%H%M%S).dump"
3.3 备份完整性校验
pg_restore -l /tmp/quyun-backup/<backup-file>.dump >/tmp/quyun-backup/restore.list
验收标准:命令退出码为 0,且 restore.list 非空。
4. Object Storage Backup
4.1 本地存储(Storage.Type=local)
tar -czf "/tmp/quyun-backup/storage_$(date +%Y%m%d_%H%M%S).tar.gz" ./backend/storage
4.2 S3/MinIO(Storage.Type=s3)
使用 mc(MinIO Client)示例:
mc alias set quyun-s3 http://127.0.0.1:9000 "$STORAGE_ACCESS_KEY" "$STORAGE_SECRET_KEY"
mc mirror quyun-s3/quyun-01 "/tmp/quyun-backup/s3_quyun-01_$(date +%Y%m%d_%H%M%S)"
验收标准:目标目录文件数量 > 0,且抽样对象可读取。
5. Restore Procedure (Pre-Prod Drill)
5.1 预发库准备
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
-c "DROP DATABASE IF EXISTS ${QY_DB_NAME}_restore;"
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
-c "CREATE DATABASE ${QY_DB_NAME}_restore;"
5.2 恢复数据库
PGPASSWORD="$QY_DB_PASSWORD" \
pg_restore -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
-d "${QY_DB_NAME}_restore" --clean --if-exists \
"/tmp/quyun-backup/<backup-file>.dump"
5.3 恢复后校验
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
-c "SELECT COUNT(*) FROM users;"
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
-c "SELECT COUNT(*) FROM audit_logs;"
验收标准:
- 核心表(
users,orders,audit_logs,contents)有合理数据量 - 抽样业务查询无语法或权限错误
6. Service Verification After Restore
启动服务后执行:
curl -f -sS http://127.0.0.1:18080/healthz
curl -f -sS http://127.0.0.1:18080/readyz
验收标准:两个端点均返回 2xx。
7. RTO / RPO Recording
每次演练记录:
- Backup start/end time
- Restore start/end time
- Data validation result
- Incident / blockers
建议目标:
- RTO <= 30 分钟
- RPO <= 24 小时(按日备份基线)
8. Failure Handling
pg_dump失败:检查网络/权限/磁盘空间,重试一次pg_restore失败:保留日志,回退至原预发库,不进行覆盖发布- 对象恢复失败:仅允许在“非阻断业务路径”条件下继续演练,否则中止
9. Evidence Requirement
每次演练需归档到:
docs/release-evidence/<date>.md
最少包含:
- 执行人、时间窗
- 命令与退出码
- 核心校验 SQL 输出
- healthz/readyz 结果
- 结论(PASS/FAIL)