- Introduce run.mjs to orchestrate full workflow: Puppeteer opens a browser with mobile viewport (390x844 @ dpr 3), navigates to capture URL, records network requests for media, takes a viewport screenshot, then calls download.mjs to batch download assets. - Implement download.mjs to handle asset downloading with same behavior as Python version (retry, zero-byte cleanup, unique naming, save urls.txt, delete input file). - Update AGENTS.md to document the new primary workflow and keep manual download instructions for both Python and Node.js. - Support title extraction from page title for folder naming.
4.2 KiB
4.2 KiB
AGENTS
Scope
This repository has no existing conventions. These guidelines apply to all work in this repo.
Network capture and assets
- Resource scope is limited to: audio, video, images, and viewport screenshots; do not collect other asset types unless explicitly requested.
- When a task asks to harvest page resources, prioritize what the user requests (e.g., only media or only core assets). Ask for scope if unclear.
- If the user provides a URL like
https://h5.rrx.cn/storeview/<page-id>.html, extract<page-id>. Open a blank tab first, apply viewport override (width 390, height 844, devicePixelRatio 3, mobile: true, hasTouch: true), then navigate that tab tohttps://ca.rrx.cn/v/<page-id>?rrxsrc=2&iframe=1&tpl=1. Equivalent automation: call DevTools/Emulation to override device metrics with{width:390,height:844,deviceScaleFactor:3, mobile:true, hasTouch:true}before navigation to avoid double-loading assets. - Use DevTools network captures to list requests; identify media by MIME or URL suffix.
- Save assets under
downloads/<date>-<title>-<page-id>/media/(title from current page; date formatYYYYMMDD) with clean filenames (strip query strings and@!size suffixes; keep proper extensions). After download, rename any files still containing size tokens or missing extensions to the original base name + proper extension. - Also save the source page URL(s) provided by the user into the folder root as
downloads/<date>-<title>-<page-id>/urls.txt. - Prefer direct downloads (e.g., curl) if DevTools bodies are unavailable or truncated.
- After batch downloading, delete any 0-byte files, verify against the planned download list, and retry missing items up to 2 times; if still failing, stop and report the missing resources.
- After collecting all requested resources and screenshots, close any additional tabs/pages opened for capture. This is mandatory; do not leave capture tabs open.
Download script usage
- Primary workflow: run
node run.mjs <page-url>to capture network requests, screenshot, and download media in one step. This script uses Puppeteer to open a browser with mobile viewport (390×844 @ dpr 3), navigate to the page, capture audio/video/image URLs, take a viewport screenshot, then calldownload.mjsto batch download assets. - For manual control: use
python download.py --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt(Python) ornode download.mjs --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt(Node.js) to batch download assets. The script generates<date>using formatYYYYMMDD. urls.txtshould list the target asset URLs (one per line) already filtered to the requested scope (e.g., media only).- Downloads go to
downloads/<date>-<title>-<page-id>/media/; filenames are cleaned (query/@!removed) and extensions retained/guessed; duplicates get numeric suffixes. - After the batch finishes, the script deletes 0-byte files, compares against the planned list, retries missing items up to 2 times, and reports any still-missing resources.
urls.txtis written todownloads/<date>-<title>-<page-id>/urls.txtto record user-provided page URLs.- The script also deletes the
--urlsinput file upon completion.
Screenshots
- Default viewport for screenshots: width 390, height 844, devicePixelRatio 3 (mobile portrait). Do not change unless the user explicitly requests another size.
- Match the screenshot to the user’s requested viewport. If they mention a size, emulate it and verify with
window.innerWidth/innerHeightanddevicePixelRatio. - Capture screenshots with Chrome DevTools (device emulation per above) and save to
downloads/<date>-<title>-<page-id>/index.png(title from current page; date formatYYYYMMDD); use full-page only when explicitly asked.
Communication and confirmation
- Do not ask for pre-work confirmation; proceed with default scope (media + viewport screenshot) unless the user explicitly specifies otherwise.
- After completion, briefly confirm collected assets (paths + key filenames); do not prompt for extra formats unless the user asks.
Safety and precision
- Avoid downloading unrequested resources. If download failures occur, retry and report any missing items clearly.