Agent Skill
2/7/2026

webcrawler

Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output.

T
techwavedev
2GitHub Stars
1Views
npx skills add techwavedev/agi-agent-kit

SKILL.md

Namewebcrawler
DescriptionDocumentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output.

name: webcrawler description: "Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output."

Webcrawler Skill

Intelligent documentation harvesting agent that recursively crawls documentation websites and extracts structured content about specific subjects.

Last Updated: 2026-01-23


Quick Start

# Crawl Python documentation about async/await
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://docs.python.org/3/library/asyncio.html" \
  --subject "asyncio" \
  --depth 2 \
  --output .tmp/docs/python-asyncio/

# Crawl React documentation
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://react.dev/" \
  --subject "React" \
  --depth 3 \
  --output .tmp/docs/react/

# Extract only API reference pages
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://expressjs.com/en/4x/api.html" \
  --subject "Express API" \
  --filter "api" \
  --output .tmp/docs/express-api/

Core Workflow

  1. Initialize Crawl — Provide base URL and subject focus
  2. Discover Pages — Recursively find all linked documentation pages
  3. Filter Content — Keep only pages matching the subject criteria
  4. Extract Content — Convert HTML to clean markdown
  5. Organize Output — Structure files in a navigable hierarchy
  6. Generate Index — Create a master index with all harvested pages

Scripts

crawl_docs.py — Main Documentation Crawler

The primary crawling script that handles recursive page discovery and content extraction.

python skills/webcrawler/scripts/crawl_docs.py \
  --url <base-url>           # Starting URL (required)
  --subject <topic>          # Subject focus for filtering (required)
  --output <directory>       # Output directory (default: .tmp/crawled/)
  --depth <n>                # Max crawl depth (default: 2)
  --filter <pattern>         # URL path filter pattern (optional)
  --delay <seconds>          # Delay between requests (default: 0.5)
  --max-pages <n>            # Maximum pages to crawl (default: 100)
  --same-domain              # Stay within same domain (default: true)
  --include-code             # Preserve code blocks (default: true)
  --format <md|json|both>    # Output format (default: both)

Outputs:

  • index.md — Master index with links to all pages
  • pages/*.md — Individual markdown files per page
  • metadata.json — Crawl metadata and page inventory
  • content.json — Structured JSON with all extracted content

extract_page.py — Single Page Extractor

Extract content from a single documentation page.

python skills/webcrawler/scripts/extract_page.py \
  --url <page-url>           # Page to extract (required)
  --output <file>            # Output file (default: stdout)
  --format <md|json>         # Output format (default: md)
  --include-links            # Include internal links (default: true)

filter_docs.py — Post-Crawl Filtering

Filter already-crawled documentation by subject or pattern.

python skills/webcrawler/scripts/filter_docs.py \
  --input <crawl-dir>        # Crawled docs directory (required)
  --subject <topic>          # Subject to filter for (required)
  --output <directory>       # Filtered output directory (required)
  --threshold <0.0-1.0>      # Relevance threshold (default: 0.3)

Configuration

Rate Limiting & Politeness

The crawler respects robots.txt and implements polite crawling:

  • Default delay: 0.5s between requests
  • User-Agent: Identifies as documentation harvester
  • robots.txt: Honored by default (disable with --ignore-robots)

Domain Handling

ModeBehavior
--same-domainOnly crawl pages on the starting domain
--same-pathOnly crawl pages under the starting URL path
--allow-subdomainsInclude subdomains (e.g., api.example.com)

Content Extraction

The crawler uses intelligent content extraction:

  1. Main content detection — Finds <main>, <article>, or content containers
  2. Navigation removal — Strips headers, footers, sidebars
  3. Code preservation — Maintains code blocks with language hints
  4. Link normalization — Converts relative links to absolute
  5. Image handling — Optionally downloads and references images

Output Structure

.tmp/docs/<subject>/
├── index.md              # Master index with TOC
├── metadata.json         # Crawl metadata
├── content.json          # Structured JSON export
└── pages/
    ├── getting-started.md
    ├── installation.md
    ├── api-reference.md
    ├── configuration/
    │   ├── basic.md
    │   └── advanced.md
    └── troubleshooting.md

Index Format

# <Subject> Documentation

> Crawled from: <base-url>
> Pages: <count>
> Date: <timestamp>

## Table of Contents

- [Getting Started](pages/getting-started.md)
- [Installation](pages/installation.md)
- [API Reference](pages/api-reference.md)
- Configuration
  - [Basic](pages/configuration/basic.md)
  - [Advanced](pages/configuration/advanced.md)
- [Troubleshooting](pages/troubleshooting.md)

Common Workflows

1. Harvest API Documentation

# Crawl API docs with deep recursion
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://api.example.com/docs" \
  --subject "Example API" \
  --depth 4 \
  --filter "/api/" \
  --output .tmp/docs/example-api/

2. Build RAG Knowledge Base

# Crawl and export as JSON for embedding
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://docs.example.com" \
  --subject "Example Docs" \
  --depth 3 \
  --format json \
  --output .tmp/rag/example/

# The content.json can be fed directly to embedding pipelines

3. Offline Documentation Mirror

# Full documentation harvest
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://docs.kubernetes.io/docs/concepts/" \
  --subject "Kubernetes Concepts" \
  --depth 5 \
  --max-pages 500 \
  --include-images \
  --output .tmp/docs/k8s-concepts/

4. Focused Topic Extraction

# Crawl, then filter to specific topic
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://developer.hashicorp.com/terraform/docs" \
  --subject "Terraform" \
  --depth 3 \
  --output .tmp/docs/terraform-full/

# Filter to AWS provider only
python skills/webcrawler/scripts/filter_docs.py \
  --input .tmp/docs/terraform-full/ \
  --subject "AWS Provider" \
  --output .tmp/docs/terraform-aws/

Best Practices

Crawling

  1. Start shallow — Begin with --depth 1 to test, then increase
  2. Use filters — Narrow scope with --filter patterns
  3. Set page limits — Use --max-pages to prevent runaway crawls
  4. Respect rate limits — Increase --delay for slower servers

Content Quality

  1. Subject focus — Be specific with --subject for better filtering
  2. Review index — Check index.md to verify crawl coverage
  3. Post-filter — Use filter_docs.py to refine results

Storage

  1. Use .tmp/ — Store crawled docs in the temp directory
  2. Organize by subject — Create subdirectories per topic
  3. Version with dates — Add timestamps for recurring crawls

Troubleshooting

IssueCauseSolution
403 ForbiddenBlocked by serverIncrease delay, check robots.txt
Empty pagesJavaScript-rendered contentUse --render-js (requires Playwright)
Too many pagesUnbounded crawlLower depth, use filters
Duplicate contentSame page via multiple URLsEnabled by default (URL normalization)
Missing code blocksExtraction issueCheck --include-code is enabled

Dependencies

Required Python packages:

pip install requests beautifulsoup4 html2text lxml
# Optional for JavaScript rendering:
pip install playwright && playwright install

Related Skills

  • qdrant-memory — Store crawled docs in vector database for RAG
  • pdf-reader — Extract text from PDF documentation

External Resources

AGI Framework Integration

Qdrant Memory Integration

Before executing complex tasks with this skill:

python3 execution/memory_manager.py auto --query "<task summary>"

Decision Tree:

  • Cache hit? Use cached response directly — no need to re-process.
  • Memory match? Inject context_chunks into your reasoning.
  • No match? Proceed normally, then store results:
python3 execution/memory_manager.py store \
  --content "Description of what was decided/solved" \
  --type decision \
  --tags webcrawler <relevant-tags>

Note: Storing automatically updates both Vector (Qdrant) and Keyword (BM25) indices.

Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.

  • Orchestration: Invoked by orchestrator via intelligent routing.
  • Context Sharing: Always read previous agent outputs from memory before starting.

Local LLM Support

When available, use local Ollama models for embedding and lightweight inference:

  • Embeddings: nomic-embed-text via Qdrant memory system
  • Lightweight analysis: Local models reduce API costs for repetitive patterns
Skills Info
Original Name:webcrawlerAuthor:techwavedev