Merge download workflow into run script

2026-02-02 18:26:25 +08:00
parent 047e5f2766
commit 8831882462
3 changed files with 332 additions and 419 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -6,7 +6,7 @@ This repository has no existing conventions. These guidelines apply to all work
 ## Network capture and assets
 - Resource scope is limited to: audio, video, images, and viewport screenshots; do not collect other asset types unless explicitly requested.
 - When a task asks to harvest page resources, prioritize what the user requests (e.g., only media or only core assets). Ask for scope if unclear.
- If the user provides a URL like `https://h5.rrx.cn/storeview/<page-id>.html`, extract `<page-id>`. Open a blank tab first, apply viewport override (width 390, height 844, devicePixelRatio 3, mobile: true, hasTouch: true), then navigate that tab to `https://ca.rrx.cn/v/<page-id>?rrxsrc=2&iframe=1&tpl=1`. Equivalent automation: call DevTools/Emulation to override device metrics with `{width:390,height:844,deviceScaleFactor:3, mobile:true, hasTouch:true}` before navigation to avoid double-loading assets.
+- If the user provides a URL like `https://h5.rrx.cn/storeview/<page-id>.html`, extract `<page-id>`. Open a blank tab first, apply viewport override (width 375, height 667, devicePixelRatio 3, mobile: true, hasTouch: true), then navigate that tab to `https://ca.rrx.cn/v/<page-id>?rrxsrc=2&iframe=1&tpl=1`. Equivalent automation: call DevTools/Emulation to override device metrics with `{width:375,height:667,deviceScaleFactor:3, mobile:true, hasTouch:true}` before navigation to avoid double-loading assets.
 - Use DevTools network captures to list requests; identify media by MIME or URL suffix.
 - Save assets under `downloads/<date>-<title>-<page-id>/media/` (title from current page; date format `YYYYMMDD`) with clean filenames (strip query strings and `@!` size suffixes; keep proper extensions). After download, rename any files still containing size tokens or missing extensions to the original base name + proper extension.
 - Also save the source page URL(s) provided by the user into the folder root as `downloads/<date>-<title>-<page-id>/urls.txt`.
@@ -15,16 +15,15 @@ This repository has no existing conventions. These guidelines apply to all work
 - After collecting all requested resources and screenshots, close any additional tabs/pages opened for capture. This is mandatory; do not leave capture tabs open.

 ## Download script usage
- Primary workflow: run `node run.mjs <page-url>` to capture network requests, screenshot, and download media in one step. This script uses Playwright Chromium to open a browser with mobile viewport (390×844 @ dpr 3), navigate to the page, capture audio/video/image URLs, take a viewport screenshot, then call `download.mjs` to batch download assets. For remote debugging, pass `--cdp ws://host:port/devtools/browser/<id>` or `--cdp http://host:port` (or set `ENV_CDP`) to resolve and connect to a Chrome DevTools endpoint.
- For manual control: use `python download.py --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` (Python) or `node download.mjs --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` (Node.js) to batch download assets. The script generates `<date>` using format `YYYYMMDD`.
+- Primary workflow: run `node run.mjs <page-url>` to capture network requests, screenshot, and download media in one step. This script uses Playwright Chromium to open a browser with mobile viewport (375×667 @ dpr 3), navigate to the page, capture audio/video/image URLs, take a viewport screenshot, then download assets directly (no separate `download.mjs`). For remote debugging, pass `--cdp ws://host:port/devtools/browser/<id>` or `--cdp http://host:port` (or set `ENV_CDP`) to resolve and connect to a Chrome DevTools endpoint.
+- For manual control: use `python download.py --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` to batch download assets. The script generates `<date>` using format `YYYYMMDD`.
 - `urls.txt` should list the target asset URLs (one per line) already filtered to the requested scope (e.g., media only).
 - Downloads go to `downloads/<date>-<title>-<page-id>/media/`; filenames are cleaned (query/`@!` removed) and extensions retained/guessed; duplicates get numeric suffixes.
 - After the batch finishes, the script deletes 0-byte files, compares against the planned list, retries missing items up to 2 times, and reports any still-missing resources.
 - `urls.txt` is written to `downloads/<date>-<title>-<page-id>/urls.txt` to record user-provided page URLs.
- The script also deletes the `--urls` input file upon completion.

 ## Screenshots
- Default viewport for screenshots: width 390, height 844, devicePixelRatio 3 (mobile portrait). Do not change unless the user explicitly requests another size.
+- Default viewport for screenshots: width 375, height 667, devicePixelRatio 3 (mobile portrait). Do not change unless the user explicitly requests another size.
 - Match the screenshot to the user’s requested viewport. If they mention a size, emulate it and verify with `window.innerWidth/innerHeight` and `devicePixelRatio`.
 - Capture screenshots with Chrome DevTools (device emulation per above) and save to `downloads/<date>-<title>-<page-id>/index.png` (title from current page; date format `YYYYMMDD`); use full-page only when explicitly asked.