Files
quyun-v2/docs/backup_restore_runbook.md

169 lines
3.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Backup / Restore Runbook (Pre-Prod & Prod)
## 1. Scope
适用于 `quyun_v2` 的以下状态数据:
- PostgreSQL业务主数据
- 对象存储目录(本地存储或 S3 兼容对象)
- 关键运行配置快照(不含明文 secret
本 Runbook 目标:
1. 能稳定执行备份
2. 能在预发环境完成恢复
3. 有明确 RTO / RPO 验证步骤
---
## 2. Preconditions
- 拥有数据库备份权限(`pg_dump` / `psql`
- 拥有对象存储读写权限(本地目录或 S3 API
- 预发环境可用并与生产版本兼容
- 已确认以下变量(示例):
```bash
export QY_DB_HOST=127.0.0.1
export QY_DB_PORT=5432
export QY_DB_NAME=quyun_v2
export QY_DB_USER=postgres
export QY_DB_PASSWORD='***'
```
---
## 3. PostgreSQL Backup
### 3.1 创建备份目录
```bash
mkdir -p /tmp/quyun-backup
```
### 3.2 导出数据库(自定义格式)
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
pg_dump -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
-F c -d "$QY_DB_NAME" \
-f "/tmp/quyun-backup/${QY_DB_NAME}_$(date +%Y%m%d_%H%M%S).dump"
```
### 3.3 备份完整性校验
```bash
pg_restore -l /tmp/quyun-backup/<backup-file>.dump >/tmp/quyun-backup/restore.list
```
验收标准:命令退出码为 0`restore.list` 非空。
---
## 4. Object Storage Backup
## 4.1 本地存储(`Storage.Type=local`
```bash
tar -czf "/tmp/quyun-backup/storage_$(date +%Y%m%d_%H%M%S).tar.gz" ./backend/storage
```
### 4.2 S3/MinIO`Storage.Type=s3`
使用 `mc`MinIO Client示例
```bash
mc alias set quyun-s3 http://127.0.0.1:9000 "$STORAGE_ACCESS_KEY" "$STORAGE_SECRET_KEY"
mc mirror quyun-s3/quyun-01 "/tmp/quyun-backup/s3_quyun-01_$(date +%Y%m%d_%H%M%S)"
```
验收标准:目标目录文件数量 > 0且抽样对象可读取。
---
## 5. Restore Procedure (Pre-Prod Drill)
### 5.1 预发库准备
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
-c "DROP DATABASE IF EXISTS ${QY_DB_NAME}_restore;"
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
-c "CREATE DATABASE ${QY_DB_NAME}_restore;"
```
### 5.2 恢复数据库
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
pg_restore -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
-d "${QY_DB_NAME}_restore" --clean --if-exists \
"/tmp/quyun-backup/<backup-file>.dump"
```
### 5.3 恢复后校验
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
-c "SELECT COUNT(*) FROM users;"
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
-c "SELECT COUNT(*) FROM audit_logs;"
```
验收标准:
- 核心表(`users`, `orders`, `audit_logs`, `contents`)有合理数据量
- 抽样业务查询无语法或权限错误
---
## 6. Service Verification After Restore
启动服务后执行:
```bash
curl -f -sS http://127.0.0.1:18080/healthz
curl -f -sS http://127.0.0.1:18080/readyz
```
验收标准:两个端点均返回 2xx。
---
## 7. RTO / RPO Recording
每次演练记录:
- Backup start/end time
- Restore start/end time
- Data validation result
- Incident / blockers
建议目标:
- RTO <= 30 分钟
- RPO <= 24 小时(按日备份基线)
---
## 8. Failure Handling
- `pg_dump` 失败:检查网络/权限/磁盘空间,重试一次
- `pg_restore` 失败:保留日志,回退至原预发库,不进行覆盖发布
- 对象恢复失败:仅允许在“非阻断业务路径”条件下继续演练,否则中止
---
## 9. Evidence Requirement
每次演练需归档到:
- `docs/release-evidence/<date>.md`
最少包含:
1. 执行人、时间窗
2. 命令与退出码
3. 核心校验 SQL 输出
4. healthz/readyz 结果
5. 结论PASS/FAIL