name: site-harvest description: Extract complete website content, design system, and assets for rebuilding or migration. Uses Firecrawl for content/CSS extraction, Chrome for visual comparison. Generates theme skill file for rebuild. Triggers: harvest site, scrape website, extract design, clone website, migrate site, copy website design, grab design tokens. updated: 2025-01-18 user-invocable: true allowed-tools:

Read
Write
Edit
Bash
Glob
Grep
TodoWrite
WebFetch
mcp__firecrawl__firecrawl_scrape
mcp__firecrawl__firecrawl_map
mcp__firecrawl__firecrawl_crawl
mcp__firecrawl__firecrawl_extract

Site Harvest Skill

Purpose: Extract complete website content, design system, and assets for rebuilding or migration.

Trigger: /site-harvest [url] or "harvest [url]"

Architecture: Token-Efficient Hybrid

Task	Tool	Why
URL Discovery	Firecrawl `map`	Their tokens, instant, free 500 pages
Bulk Content	Firecrawl `crawl`	Their tokens, parallel extraction
Structured Data	Firecrawl `extract`	Schema-based, accurate
Sitemap Analysis	Claude + WebFetch	Quick XML parse, freshness check
Screaming Frog	CSV import	Pre-crawled, comprehensive
Design Screenshots	Firecrawl `scrape`	Branding format extracts design
Side-by-Side Compare	Chrome Tab	Only for old vs new comparison

Token Strategy: Firecrawl does heavy lifting (their tokens) → Chrome only for visual comparison (our tokens)

Prerequisites

Firecrawl MCP - Primary extraction engine
Chrome Tab MCP - Only for side-by-side comparison phase
URL sources (one or more):
- Site URL (Firecrawl discovers pages)
- Screaming Frog export (CSV)
- sitemap.xml (auto-detected)

Quick Start

/site-harvest https://example.com

9-Phase Workflow

Phase 1: URL Discovery & Site Structure (FOUNDATION)

⚠️ THIS IS THE BASIS FOR EVERYTHING. Get this wrong = rebuild has gaps.

Ask for Screaming Frog export first (most comprehensive)
Analyse sitemap.xml via WebFetch (freshness, coverage)
Run Firecrawl map (live discoverable pages)
Check AJAX/Pagination (load more, infinite scroll, WordPress API)
Four-way comparison: Sitemap vs Firecrawl vs SF vs AJAX
Document site structure (page types, navigation, hierarchy)
Report findings and wait for confirmation

Detailed instructions: See references/url-discovery.md

Save: /url-discovery.json

Phase 2: Content Extraction (Firecrawl)

firecrawl_crawl({
  url: "https://example.com",
  limit: [merged URL count],
  scrapeOptions: {
    formats: ["markdown", "html", "links"],
    onlyMainContent: true
  }
})

For each page, save:

/pages/[slug].md (clean markdown)
/pages/[slug].json (structured: title, meta, headings)

Extract media references (images, videos, documents).

Save: /content-manifest.json

Phase 3: Design System Extraction

// Branding extraction
firecrawl_scrape({
  url: "https://example.com",
  formats: ["branding"]
})

// Full CSS capture
firecrawl_scrape({
  url: "https://example.com",
  formats: ["html", "rawHtml"]
})

Parse all stylesheet URLs and download CSS files
Extract CSS variables (--color-, --font-, --spacing-*)
Capture typography scale (h1-h6, p, small)

Save: /design-tokens.json

Phase 4: Component Style Catalogue (EXHAUSTIVE)

Capture EVERY visual pattern - this prevents "footer links unstyled" problems.

Extract computed styles for:

Navigation (header, links, mobile menu, dropdowns)
Footer (container, columns, links, social icons)
Typography (headings, paragraphs, lists, blockquotes)
Buttons & CTAs (primary, secondary, ghost, hover states)
Sections (padding, backgrounds, alternating patterns)
Dividers (hr, borders, SVG waves, clip-path angles)
Cards (container, hover, image, content)
Icons (download exact SVGs, don't substitute!)
Forms (inputs, labels, error states)

Detailed element list: See references/component-styles.md

Save: /component-styles.json

Phase 5: Visual Capture (Screenshots)

firecrawl_scrape({
  url: "https://example.com",
  formats: [
    { type: "screenshot", fullPage: true, viewport: { width: 1920, height: 1080 } },
    { type: "screenshot", fullPage: true, viewport: { width: 768, height: 1024 } },
    { type: "screenshot", fullPage: true, viewport: { width: 375, height: 812 } }
  ]
})

Screenshot key pages (homepage, about, services, blog, contact) and components (header, footer, hero, cards, dividers).

Save to: /screenshots/

Phase 6: Asset Download

Images: Download all, maintain folder structure → /media/
Fonts: Parse @font-face, download all formats → /assets/fonts/
Icons: Extract inline SVGs exactly, download external SVGs → /assets/icons/
JavaScript: Download external JS, note inline scripts → /assets/scripts/

Phase 7: Theme Skill Generation

Generate: /[project-name]-theme.md

Document:

Brand identity (colours with hex + Tailwind classes)
Typography (fonts, sizes, weights)
Section patterns (padding, backgrounds, dividers)
Component specs (buttons, cards, links)
Layout patterns (grids, flexbox)
Special elements (wave SVGs, icons)
Tailwind classes reference

Template: See references/theme-generation.md

This is the single source of truth during rebuild.

Phase 8: Manifest Generation

Generate comprehensive manifest.json with:

Harvest metadata (URL, date, tool version)
URL discovery results (all sources compared)
Pages list with files
Design assets references
Screenshots index
Warnings and flags

Example manifest: See references/output-structure.md

Phase 9: Side-by-Side Comparison (Chrome Tab)

Only runs when BOTH old and new sites exist.

Open both sites in Chrome tabs
Screenshot both at same viewport
Compare: header, hero, sections, footer, dividers, icons
Flag mismatches with specifics
Generate comparison report

Detailed workflow: See references/rebuild-workflow.md

Critical Rules

❌ DON'T

Substitute icons with similar ones from icon libraries
Ignore wave/angle dividers
Skip footer link styling
Assume section spacing without measuring
Miss hover/active/focus states

✅ DO

Extract EXACT SVG markup for all custom icons
Capture ALL divider types (hr, border, SVG, clip-path)
Document EVERY link style (nav, footer, inline, CTA)
Measure actual padding/margin values
Screenshot unusual patterns for reference

Error Handling

Error	Action
Firecrawl rate limit	Wait, retry with smaller batch
sitemap.xml missing	Continue with Firecrawl + Screaming Frog
CSS file 404	Log warning, check for inline styles
Font file blocked	Note in manifest, may need manual download
SVG divider complex	Screenshot + extract raw HTML

Example Usage

# Basic harvest
/site-harvest https://client-site.co.uk

# With Screaming Frog export
/site-harvest https://example.com --urls screaming-frog-export.csv

# Comparison mode (after rebuild)
/site-harvest compare https://old-site.com https://new-site.vercel.app

Output Structure

See references/output-structure.md for complete folder layout and manifest example.

/scraped-data/[site-name]/
├── manifest.json
├── [site-name]-theme.md
├── url-discovery.json
├── design-tokens.json
├── component-styles.json
├── pages/
├── screenshots/
├── media/
├── assets/
└── comparison/

Name	site-harvest
Description	Extract complete website content, design system, and assets for rebuilding or migration. Uses Firecrawl for content/CSS extraction, Chrome for visual comparison. Generates theme skill file for rebuild. Triggers: harvest site, scrape website, extract design, clone website, migrate site, copy website design, grab design tokens.

site-harvest

SKILL.md