Files
rrx-downloader/AGENTS.md
Rogee dc6cb1513b Add Node.js workflow implementation with Puppeteer integration
- Introduce run.mjs to orchestrate full workflow: Puppeteer opens a browser with mobile viewport (390x844 @ dpr 3), navigates to capture URL, records network requests for media, takes a viewport screenshot, then calls download.mjs to batch download assets.
- Implement download.mjs to handle asset downloading with same behavior as Python version (retry, zero-byte cleanup, unique naming, save urls.txt, delete input file).
- Update AGENTS.md to document the new primary workflow and keep manual download instructions for both Python and Node.js.
- Support title extraction from page title for folder naming.
2026-02-02 17:42:38 +08:00

37 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AGENTS
## Scope
This repository has no existing conventions. These guidelines apply to all work in this repo.
## Network capture and assets
- Resource scope is limited to: audio, video, images, and viewport screenshots; do not collect other asset types unless explicitly requested.
- When a task asks to harvest page resources, prioritize what the user requests (e.g., only media or only core assets). Ask for scope if unclear.
- If the user provides a URL like `https://h5.rrx.cn/storeview/<page-id>.html`, extract `<page-id>`. Open a blank tab first, apply viewport override (width 390, height 844, devicePixelRatio 3, mobile: true, hasTouch: true), then navigate that tab to `https://ca.rrx.cn/v/<page-id>?rrxsrc=2&iframe=1&tpl=1`. Equivalent automation: call DevTools/Emulation to override device metrics with `{width:390,height:844,deviceScaleFactor:3, mobile:true, hasTouch:true}` before navigation to avoid double-loading assets.
- Use DevTools network captures to list requests; identify media by MIME or URL suffix.
- Save assets under `downloads/<date>-<title>-<page-id>/media/` (title from current page; date format `YYYYMMDD`) with clean filenames (strip query strings and `@!` size suffixes; keep proper extensions). After download, rename any files still containing size tokens or missing extensions to the original base name + proper extension.
- Also save the source page URL(s) provided by the user into the folder root as `downloads/<date>-<title>-<page-id>/urls.txt`.
- Prefer direct downloads (e.g., curl) if DevTools bodies are unavailable or truncated.
- After batch downloading, delete any 0-byte files, verify against the planned download list, and retry missing items up to 2 times; if still failing, stop and report the missing resources.
- After collecting all requested resources and screenshots, close any additional tabs/pages opened for capture. This is mandatory; do not leave capture tabs open.
## Download script usage
- Primary workflow: run `node run.mjs <page-url>` to capture network requests, screenshot, and download media in one step. This script uses Puppeteer to open a browser with mobile viewport (390×844 @ dpr 3), navigate to the page, capture audio/video/image URLs, take a viewport screenshot, then call `download.mjs` to batch download assets.
- For manual control: use `python download.py --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` (Python) or `node download.mjs --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` (Node.js) to batch download assets. The script generates `<date>` using format `YYYYMMDD`.
- `urls.txt` should list the target asset URLs (one per line) already filtered to the requested scope (e.g., media only).
- Downloads go to `downloads/<date>-<title>-<page-id>/media/`; filenames are cleaned (query/`@!` removed) and extensions retained/guessed; duplicates get numeric suffixes.
- After the batch finishes, the script deletes 0-byte files, compares against the planned list, retries missing items up to 2 times, and reports any still-missing resources.
- `urls.txt` is written to `downloads/<date>-<title>-<page-id>/urls.txt` to record user-provided page URLs.
- The script also deletes the `--urls` input file upon completion.
## Screenshots
- Default viewport for screenshots: width 390, height 844, devicePixelRatio 3 (mobile portrait). Do not change unless the user explicitly requests another size.
- Match the screenshot to the users requested viewport. If they mention a size, emulate it and verify with `window.innerWidth/innerHeight` and `devicePixelRatio`.
- Capture screenshots with Chrome DevTools (device emulation per above) and save to `downloads/<date>-<title>-<page-id>/index.png` (title from current page; date format `YYYYMMDD`); use full-page only when explicitly asked.
## Communication and confirmation
- Do not ask for pre-work confirmation; proceed with default scope (media + viewport screenshot) unless the user explicitly specifies otherwise.
- After completion, briefly confirm collected assets (paths + key filenames); do not prompt for extra formats unless the user asks.
## Safety and precision
- Avoid downloading unrequested resources. If download failures occur, retry and report any missing items clearly.