chore: harden production readiness gates and runbooks

This commit is contained in:
2026-02-09 11:27:23 +08:00
parent 05a0d07dbb
commit f1412a371d
15 changed files with 1001 additions and 322 deletions

View File

@@ -0,0 +1,168 @@
# Backup / Restore Runbook (Pre-Prod & Prod)
## 1. Scope
适用于 `quyun_v2` 的以下状态数据:
- PostgreSQL业务主数据
- 对象存储目录(本地存储或 S3 兼容对象)
- 关键运行配置快照(不含明文 secret
本 Runbook 目标:
1. 能稳定执行备份
2. 能在预发环境完成恢复
3. 有明确 RTO / RPO 验证步骤
---
## 2. Preconditions
- 拥有数据库备份权限(`pg_dump` / `psql`
- 拥有对象存储读写权限(本地目录或 S3 API
- 预发环境可用并与生产版本兼容
- 已确认以下变量(示例):
```bash
export QY_DB_HOST=127.0.0.1
export QY_DB_PORT=5432
export QY_DB_NAME=quyun_v2
export QY_DB_USER=postgres
export QY_DB_PASSWORD='***'
```
---
## 3. PostgreSQL Backup
### 3.1 创建备份目录
```bash
mkdir -p /tmp/quyun-backup
```
### 3.2 导出数据库(自定义格式)
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
pg_dump -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
-F c -d "$QY_DB_NAME" \
-f "/tmp/quyun-backup/${QY_DB_NAME}_$(date +%Y%m%d_%H%M%S).dump"
```
### 3.3 备份完整性校验
```bash
pg_restore -l /tmp/quyun-backup/<backup-file>.dump >/tmp/quyun-backup/restore.list
```
验收标准:命令退出码为 0`restore.list` 非空。
---
## 4. Object Storage Backup
## 4.1 本地存储(`Storage.Type=local`
```bash
tar -czf "/tmp/quyun-backup/storage_$(date +%Y%m%d_%H%M%S).tar.gz" ./backend/storage
```
### 4.2 S3/MinIO`Storage.Type=s3`
使用 `mc`MinIO Client示例
```bash
mc alias set quyun-s3 http://127.0.0.1:9000 "$STORAGE_ACCESS_KEY" "$STORAGE_SECRET_KEY"
mc mirror quyun-s3/quyun-01 "/tmp/quyun-backup/s3_quyun-01_$(date +%Y%m%d_%H%M%S)"
```
验收标准:目标目录文件数量 > 0且抽样对象可读取。
---
## 5. Restore Procedure (Pre-Prod Drill)
### 5.1 预发库准备
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
-c "DROP DATABASE IF EXISTS ${QY_DB_NAME}_restore;"
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d postgres \
-c "CREATE DATABASE ${QY_DB_NAME}_restore;"
```
### 5.2 恢复数据库
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
pg_restore -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" \
-d "${QY_DB_NAME}_restore" --clean --if-exists \
"/tmp/quyun-backup/<backup-file>.dump"
```
### 5.3 恢复后校验
```bash
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
-c "SELECT COUNT(*) FROM users;"
PGPASSWORD="$QY_DB_PASSWORD" \
psql -h "$QY_DB_HOST" -p "$QY_DB_PORT" -U "$QY_DB_USER" -d "${QY_DB_NAME}_restore" \
-c "SELECT COUNT(*) FROM audit_logs;"
```
验收标准:
- 核心表(`users`, `orders`, `audit_logs`, `contents`)有合理数据量
- 抽样业务查询无语法或权限错误
---
## 6. Service Verification After Restore
启动服务后执行:
```bash
curl -f -sS http://127.0.0.1:18080/healthz
curl -f -sS http://127.0.0.1:18080/readyz
```
验收标准:两个端点均返回 2xx。
---
## 7. RTO / RPO Recording
每次演练记录:
- Backup start/end time
- Restore start/end time
- Data validation result
- Incident / blockers
建议目标:
- RTO <= 30 分钟
- RPO <= 24 小时(按日备份基线)
---
## 8. Failure Handling
- `pg_dump` 失败:检查网络/权限/磁盘空间,重试一次
- `pg_restore` 失败:保留日志,回退至原预发库,不进行覆盖发布
- 对象恢复失败:仅允许在“非阻断业务路径”条件下继续演练,否则中止
---
## 9. Evidence Requirement
每次演练需归档到:
- `docs/release-evidence/<date>.md`
最少包含:
1. 执行人、时间窗
2. 命令与退出码
3. 核心校验 SQL 输出
4. healthz/readyz 结果
5. 结论PASS/FAIL