Files
quyun-v2/docs/backup_restore_runbook.md

3.8 KiB
Raw Permalink Blame History

Backup / Restore Runbook (Pre-Prod & Prod)

1. Scope

适用于 quyun_v2 的以下状态数据:

  • PostgreSQL业务主数据
  • 对象存储目录(本地存储或 S3 兼容对象)
  • 关键运行配置快照(不含明文 secret

本 Runbook 目标:

  1. 能稳定执行备份
  2. 能在预发环境完成恢复
  3. 有明确 RTO / RPO 验证步骤

2. Preconditions

  • 拥有数据库备份权限(pg_dump / psql
  • 拥有对象存储读写权限(本地目录或 S3 API
  • 预发环境可用并与生产版本兼容
  • 已确认以下变量(示例):
export QY_DB_HOST=127.0.0.1
export QY_DB_PORT=5432
export QY_DB_NAME=quyun_v2
export QY_DB_USER=postgres
export QY_DB_PASSWORD='***'

3. PostgreSQL Backup

3.1 创建备份目录

mkdir -p /tmp/quyun-backup

3.2 导出数据库(自定义格式)

PGPASSWORD="$QY_DB_PASSWORD" \
pg_dump -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
  -F c -d "$QY_DB_NAME" \
  -f "/tmp/quyun-backup/${QY_DB_NAME}_$(date +%Y%m%d_%H%M%S).dump"

3.3 备份完整性校验

pg_restore -l /tmp/quyun-backup/<backup-file>.dump >/tmp/quyun-backup/restore.list

验收标准:命令退出码为 0restore.list 非空。


4. Object Storage Backup

4.1 本地存储(Storage.Type=local

tar -czf "/tmp/quyun-backup/storage_$(date +%Y%m%d_%H%M%S).tar.gz" ./backend/storage

4.2 S3/MinIOStorage.Type=s3

使用 mcMinIO Client示例

mc alias set quyun-s3 http://127.0.0.1:9000 "$STORAGE_ACCESS_KEY" "$STORAGE_SECRET_KEY"
mc mirror quyun-s3/quyun-01 "/tmp/quyun-backup/s3_quyun-01_$(date +%Y%m%d_%H%M%S)"

验收标准:目标目录文件数量 > 0且抽样对象可读取。


5. Restore Procedure (Pre-Prod Drill)

5.1 预发库准备

PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
  -c "DROP DATABASE IF EXISTS ${QY_DB_NAME}_restore;"

PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
  -c "CREATE DATABASE ${QY_DB_NAME}_restore;"

5.2 恢复数据库

PGPASSWORD="$QY_DB_PASSWORD" \
pg_restore -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
  -d "${QY_DB_NAME}_restore" --clean --if-exists \
  "/tmp/quyun-backup/<backup-file>.dump"

5.3 恢复后校验

PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
  -c "SELECT COUNT(*) FROM users;"

PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
  -c "SELECT COUNT(*) FROM audit_logs;"

验收标准:

  • 核心表(users, orders, audit_logs, contents)有合理数据量
  • 抽样业务查询无语法或权限错误

6. Service Verification After Restore

启动服务后执行:

curl -f -sS http://127.0.0.1:18080/healthz
curl -f -sS http://127.0.0.1:18080/readyz

验收标准:两个端点均返回 2xx。


7. RTO / RPO Recording

每次演练记录:

  • Backup start/end time
  • Restore start/end time
  • Data validation result
  • Incident / blockers

建议目标:

  • RTO <= 30 分钟
  • RPO <= 24 小时(按日备份基线)

8. Failure Handling

  • pg_dump 失败:检查网络/权限/磁盘空间,重试一次
  • pg_restore 失败:保留日志,回退至原预发库,不进行覆盖发布
  • 对象恢复失败:仅允许在“非阻断业务路径”条件下继续演练,否则中止

9. Evidence Requirement

每次演练需归档到:

  • docs/release-evidence/<date>.md

最少包含:

  1. 执行人、时间窗
  2. 命令与退出码
  3. 核心校验 SQL 输出
  4. healthz/readyz 结果
  5. 结论PASS/FAIL