webcrawler
Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output.
SKILL.md
| Name | webcrawler |
| Description | Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output. |
name: webcrawler description: "Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output."
Webcrawler Skill
Intelligent documentation harvesting agent that recursively crawls documentation websites and extracts structured content about specific subjects.
Last Updated: 2026-01-23
Quick Start
# Crawl Python documentation about async/await
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://docs.python.org/3/library/asyncio.html" \
--subject "asyncio" \
--depth 2 \
--output .tmp/docs/python-asyncio/
# Crawl React documentation
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://react.dev/" \
--subject "React" \
--depth 3 \
--output .tmp/docs/react/
# Extract only API reference pages
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://expressjs.com/en/4x/api.html" \
--subject "Express API" \
--filter "api" \
--output .tmp/docs/express-api/
Core Workflow
- Initialize Crawl — Provide base URL and subject focus
- Discover Pages — Recursively find all linked documentation pages
- Filter Content — Keep only pages matching the subject criteria
- Extract Content — Convert HTML to clean markdown
- Organize Output — Structure files in a navigable hierarchy
- Generate Index — Create a master index with all harvested pages
Scripts
crawl_docs.py — Main Documentation Crawler
The primary crawling script that handles recursive page discovery and content extraction.
python skills/webcrawler/scripts/crawl_docs.py \
--url <base-url> # Starting URL (required)
--subject <topic> # Subject focus for filtering (required)
--output <directory> # Output directory (default: .tmp/crawled/)
--depth <n> # Max crawl depth (default: 2)
--filter <pattern> # URL path filter pattern (optional)
--delay <seconds> # Delay between requests (default: 0.5)
--max-pages <n> # Maximum pages to crawl (default: 100)
--same-domain # Stay within same domain (default: true)
--include-code # Preserve code blocks (default: true)
--format <md|json|both> # Output format (default: both)
Outputs:
index.md— Master index with links to all pagespages/*.md— Individual markdown files per pagemetadata.json— Crawl metadata and page inventorycontent.json— Structured JSON with all extracted content
extract_page.py — Single Page Extractor
Extract content from a single documentation page.
python skills/webcrawler/scripts/extract_page.py \
--url <page-url> # Page to extract (required)
--output <file> # Output file (default: stdout)
--format <md|json> # Output format (default: md)
--include-links # Include internal links (default: true)
filter_docs.py — Post-Crawl Filtering
Filter already-crawled documentation by subject or pattern.
python skills/webcrawler/scripts/filter_docs.py \
--input <crawl-dir> # Crawled docs directory (required)
--subject <topic> # Subject to filter for (required)
--output <directory> # Filtered output directory (required)
--threshold <0.0-1.0> # Relevance threshold (default: 0.3)
Configuration
Rate Limiting & Politeness
The crawler respects robots.txt and implements polite crawling:
- Default delay: 0.5s between requests
- User-Agent: Identifies as documentation harvester
- robots.txt: Honored by default (disable with
--ignore-robots)
Domain Handling
| Mode | Behavior |
|---|---|
--same-domain | Only crawl pages on the starting domain |
--same-path | Only crawl pages under the starting URL path |
--allow-subdomains | Include subdomains (e.g., api.example.com) |
Content Extraction
The crawler uses intelligent content extraction:
- Main content detection — Finds
<main>,<article>, or content containers - Navigation removal — Strips headers, footers, sidebars
- Code preservation — Maintains code blocks with language hints
- Link normalization — Converts relative links to absolute
- Image handling — Optionally downloads and references images
Output Structure
.tmp/docs/<subject>/
├── index.md # Master index with TOC
├── metadata.json # Crawl metadata
├── content.json # Structured JSON export
└── pages/
├── getting-started.md
├── installation.md
├── api-reference.md
├── configuration/
│ ├── basic.md
│ └── advanced.md
└── troubleshooting.md
Index Format
# <Subject> Documentation
> Crawled from: <base-url>
> Pages: <count>
> Date: <timestamp>
## Table of Contents
- [Getting Started](pages/getting-started.md)
- [Installation](pages/installation.md)
- [API Reference](pages/api-reference.md)
- Configuration
- [Basic](pages/configuration/basic.md)
- [Advanced](pages/configuration/advanced.md)
- [Troubleshooting](pages/troubleshooting.md)
Common Workflows
1. Harvest API Documentation
# Crawl API docs with deep recursion
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://api.example.com/docs" \
--subject "Example API" \
--depth 4 \
--filter "/api/" \
--output .tmp/docs/example-api/
2. Build RAG Knowledge Base
# Crawl and export as JSON for embedding
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://docs.example.com" \
--subject "Example Docs" \
--depth 3 \
--format json \
--output .tmp/rag/example/
# The content.json can be fed directly to embedding pipelines
3. Offline Documentation Mirror
# Full documentation harvest
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://docs.kubernetes.io/docs/concepts/" \
--subject "Kubernetes Concepts" \
--depth 5 \
--max-pages 500 \
--include-images \
--output .tmp/docs/k8s-concepts/
4. Focused Topic Extraction
# Crawl, then filter to specific topic
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://developer.hashicorp.com/terraform/docs" \
--subject "Terraform" \
--depth 3 \
--output .tmp/docs/terraform-full/
# Filter to AWS provider only
python skills/webcrawler/scripts/filter_docs.py \
--input .tmp/docs/terraform-full/ \
--subject "AWS Provider" \
--output .tmp/docs/terraform-aws/
Best Practices
Crawling
- Start shallow — Begin with
--depth 1to test, then increase - Use filters — Narrow scope with
--filterpatterns - Set page limits — Use
--max-pagesto prevent runaway crawls - Respect rate limits — Increase
--delayfor slower servers
Content Quality
- Subject focus — Be specific with
--subjectfor better filtering - Review index — Check
index.mdto verify crawl coverage - Post-filter — Use
filter_docs.pyto refine results
Storage
- Use
.tmp/— Store crawled docs in the temp directory - Organize by subject — Create subdirectories per topic
- Version with dates — Add timestamps for recurring crawls
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| 403 Forbidden | Blocked by server | Increase delay, check robots.txt |
| Empty pages | JavaScript-rendered content | Use --render-js (requires Playwright) |
| Too many pages | Unbounded crawl | Lower depth, use filters |
| Duplicate content | Same page via multiple URLs | Enabled by default (URL normalization) |
| Missing code blocks | Extraction issue | Check --include-code is enabled |
Dependencies
Required Python packages:
pip install requests beautifulsoup4 html2text lxml
# Optional for JavaScript rendering:
pip install playwright && playwright install
Related Skills
- qdrant-memory — Store crawled docs in vector database for RAG
- pdf-reader — Extract text from PDF documentation
External Resources
- Scrapy Documentation — For complex crawling needs
- html2text — HTML to Markdown conversion
- BeautifulSoup — HTML parsing
AGI Framework Integration
Qdrant Memory Integration
Before executing complex tasks with this skill:
python3 execution/memory_manager.py auto --query "<task summary>"
Decision Tree:
- Cache hit? Use cached response directly — no need to re-process.
- Memory match? Inject
context_chunksinto your reasoning. - No match? Proceed normally, then store results:
python3 execution/memory_manager.py store \
--content "Description of what was decided/solved" \
--type decision \
--tags webcrawler <relevant-tags>
Note: Storing automatically updates both Vector (Qdrant) and Keyword (BM25) indices.
Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.
- Orchestration: Invoked by
orchestratorvia intelligent routing. - Context Sharing: Always read previous agent outputs from memory before starting.
Local LLM Support
When available, use local Ollama models for embedding and lightweight inference:
- Embeddings:
nomic-embed-textvia Qdrant memory system - Lightweight analysis: Local models reduce API costs for repetitive patterns