Add Node.js workflow implementation with Puppeteer integration
- Introduce run.mjs to orchestrate full workflow: Puppeteer opens a browser with mobile viewport (390x844 @ dpr 3), navigates to capture URL, records network requests for media, takes a viewport screenshot, then calls download.mjs to batch download assets. - Implement download.mjs to handle asset downloading with same behavior as Python version (retry, zero-byte cleanup, unique naming, save urls.txt, delete input file). - Update AGENTS.md to document the new primary workflow and keep manual download instructions for both Python and Node.js. - Support title extraction from page title for folder naming.
This commit is contained in:
@@ -15,11 +15,13 @@ This repository has no existing conventions. These guidelines apply to all work
|
||||
- After collecting all requested resources and screenshots, close any additional tabs/pages opened for capture. This is mandatory; do not leave capture tabs open.
|
||||
|
||||
## Download script usage
|
||||
- Use `python download.py --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` to batch download assets. The script generates `<date>` using format `YYYYMMDD`.
|
||||
- Primary workflow: run `node run.mjs <page-url>` to capture network requests, screenshot, and download media in one step. This script uses Puppeteer to open a browser with mobile viewport (390×844 @ dpr 3), navigate to the page, capture audio/video/image URLs, take a viewport screenshot, then call `download.mjs` to batch download assets.
|
||||
- For manual control: use `python download.py --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` (Python) or `node download.mjs --page-id <id> --title "<title>" --urls urls.txt --sources source_urls.txt` (Node.js) to batch download assets. The script generates `<date>` using format `YYYYMMDD`.
|
||||
- `urls.txt` should list the target asset URLs (one per line) already filtered to the requested scope (e.g., media only).
|
||||
- Downloads go to `downloads/<date>-<title>-<page-id>/media/`; filenames are cleaned (query/`@!` removed) and extensions retained/guessed; duplicates get numeric suffixes.
|
||||
- After the batch finishes, the script deletes 0-byte files, compares against the planned list, retries missing items up to 2 times, and reports any still-missing resources.
|
||||
- `urls.txt` is written to `downloads/<date>-<title>-<page-id>/urls.txt` to record user-provided page URLs.
|
||||
- The script also deletes the `--urls` input file upon completion.
|
||||
|
||||
## Screenshots
|
||||
- Default viewport for screenshots: width 390, height 844, devicePixelRatio 3 (mobile portrait). Do not change unless the user explicitly requests another size.
|
||||
|
||||
Reference in New Issue
Block a user