quiz-crawler
Extract all questions/answer options from marketing quiz funnels and record final redirect/product URL + screenshots. Use when a user gives a quiz URL and asks to collect questions/answers or capture the offer page. Use output format specified in assets/output.md
SKILL.md
| Name | quiz-crawler |
| Description | Extract all questions/answer options from marketing quiz funnels and record final redirect/product URL + screenshots. Use when a user gives a quiz URL and asks to collect questions/answers or capture the offer page. Use output format specified in assets/output.md |
name: quiz-crawler description: "Extract all questions/answer options from marketing quiz funnels and record final redirect/product URL + screenshots. Use when a user gives a quiz URL and asks to collect questions/answers or capture the offer page. Use output format specified in assets/output.md"
Workflow (reproducible pipeline)
This skill is for building a swipe database: store raw artifacts + a clean Notion page.
1) Crawl (deterministic path)
Run:
node scripts/crawl.js "<quiz_url>"
Popup handling:
- Auto-accepts JS dialogs (alert/confirm/prompt).
- Attempts to click modal CTAs like Continue/Submit/OK when a popup blocks progress.
Optional:
OUT_DIR=/path/to/outputMAX_STEPS=60(default 40)MAX_TIME_MS=420000(default 300000)JPG_QUALITY=80FILL_FORMS=true|false(default true)
Artifacts written to OUT_DIR:
results.json(steps + pickedAnswer + stopReason + finalUrl)step-XX.jpg(screenshot per interaction)step-XX.html(HTML snapshot when available)
2) Sync artifacts to R2 (deterministic)
Run:
SRC_DIR=<OUT_DIR> node scripts/r2_sync.js
Required env:
R2_ENDPOINTR2_BUCKETR2_PUBLIC_BASE(e.g.https://pub-....r2.dev)R2_ACCESS_KEY_IDR2_SECRET_ACCESS_KEY
Optional env:
R2_PREFIX(defaultquiz)RUN_ID(otherwise derived fromSRC_DIR)QUIZ_KEY(stable key for the funnel; otherwise derived fromSRC_DIR)INCLUDE_HTML=true|false(default true)
Output:
- writes
r2-manifest.jsonintoSRC_DIRwith public URLs.
Setup (minimal deps)
-
OCR uses the system
tesseractbinary (no JS OCR deps). -
If
tesseractis missing, install it:- Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y tesseract-ocr - macOS (brew):
brew install tesseract
- Ubuntu/Debian:
-
R2 sync requires these env vars. Recommended: store them in a skill-local
.envfile at~/skills/quiz-crawler/.env(the scripts auto-load it).R2_ENDPOINTR2_BUCKETR2_PUBLIC_BASER2_ACCESS_KEY_IDR2_SECRET_ACCESS_KEY
-
Notion publish requires:
NOTION_API_KEYNOTION_PARENT_PAGE_ID
.env notes:
- Lines like
KEY=valueorKEY="value". - Do not commit
.env(secrets). If you package/share the skill, exclude.env.
3) Extract Q/A via OCR (quick) + Publish Notion page
Run:
SRC_DIR=<OUT_DIR> node scripts/ocr_extract.js(writesocr.json)SRC_DIR=<OUT_DIR> node scripts/qa_extract.js(writesqa.json: OCR question + DOM options)SRC_DIR=<OUT_DIR> BRAND=<BrandName> NOTION_PARENT_PAGE_ID=<page_id> node scripts/publish_notion.js
Fast defaults:
- OCR is used only for the question (line containing
?). - Options come from the crawler’s captured button text (more reliable than OCR).
Required env:
NOTION_API_KEYNOTION_PARENT_PAGE_ID
Output:
- Creates 1 Notion child page under the parent with the format specified in
assets/output.md:Question: ...Answer: ...Screenshot: <external image>
Data handling
- Use sensible dummy data to progress:
- name/email/phone/dates as needed.
- Date picker policy: always pick a random date when a date-picker is present.
Notes:
- This skill currently captures a single deterministic path. Different answers can lead to different pages; add branching only when requested.