api-scraper
Scrape data from websites by inspecting and calling their frontend APIs. Use when asked to "scrape", "fetch data from", "extract data from", "get all X from" a website URL. Automatically discovers API endpoints, fetches data, and outputs JSON or CSV.
SKILL.md
| Name | api-scraper |
| Description | Scrape data from websites by inspecting and calling their frontend APIs. Use when asked to "scrape", "fetch data from", "extract data from", "get all X from" a website URL. Automatically discovers API endpoints, fetches data, and outputs JSON or CSV. |
name: api-scraper description: Scrape data from websites by inspecting and calling their frontend APIs. Use when asked to "scrape", "fetch data from", "extract data from", "get all X from" a website URL. Automatically discovers API endpoints, fetches data, and outputs JSON or CSV. allowed-tools: Read, Write, Bash(python:*), Grep
API Scraper
Scrape data from websites by reverse-engineering their frontend API calls.
Requires: Chrome DevTools MCP (mcp__chrome-devtools__*)
Setup (One-time)
- Use Chrome M144+ (Beta or newer)
- Navigate to
chrome://inspect/#remote-debuggingand enable remote debugging - Add to your Claude Code MCP config:
{ "mcpServers": { "chrome-devtools": { "command": "npx", "args": ["chrome-devtools-mcp@latest", "--autoConnect"] } } } - When prompted, authorize Claude to connect to your browser
Note: Uses your existing browser session - you stay logged in to all your sites.
Workflow
Step 1: Set up browser and navigate
1. Call mcp__chrome-devtools__new_page with the target URL
2. Wait for page to load (requests will be captured automatically)
Step 2: List and filter network requests
1. Call mcp__chrome-devtools__list_network_requests with resourceTypes: ["fetch", "xhr"]
- This filters to only API calls, excluding static assets
2. Look for data API calls by URL pattern:
- /api/, /v1/, /v2/, /graphql
- algolia.net, search, query
- POST requests returning JSON
3. Note the reqid of interesting requests
Step 3: Inspect API details
For each interesting request, call mcp__chrome-devtools__get_network_request with the reqid.
This returns the full details:
- Request Headers - Including auth tokens, API keys
- Request Body - The exact payload sent (for POST requests)
- Response Headers - Content-type, pagination info
- Response Body - The actual JSON data returned
Extract:
- Endpoint URL - The API endpoint
- Authentication - API keys in headers or URL params
- Request format - How to structure the request body
- Response structure - How to parse the data
- Pagination - Total count, page info, cursors
Step 4: Generate and run Python fetcher
Create a Python script using the exact request format discovered:
#!/usr/bin/env python3
import json
import requests
# API configuration extracted from network inspection
API_URL = "extracted_url"
HEADERS = {
"Content-Type": "application/json",
# Add auth headers exactly as seen in request
}
def fetch_data():
all_data = []
# Use exact payload format from request body
payload = {"requests": [...]}
response = requests.post(API_URL, headers=HEADERS, json=payload)
data = response.json()
return data
if __name__ == "__main__":
data = fetch_data()
print(json.dumps(data, indent=2))
Step 5: Output results
Save the fetched data:
- JSON (default):
{descriptive_name}.json - CSV (if requested): Use csv module to flatten and export
Chrome DevTools MCP Tools
| Tool | Purpose |
|---|---|
list_pages | See open browser pages |
new_page | Open URL in new page |
select_page | Switch to a page |
navigate_page | Navigate current page |
list_network_requests | List captured requests (filter by type) |
get_network_request | Get full request/response details |
evaluate_script | Run JavaScript in page |
take_snapshot | Get DOM snapshot |
Filtering Network Requests
Use resourceTypes parameter to filter:
["fetch", "xhr"] # API calls only (recommended)
["document"] # HTML pages
["script"] # JavaScript files
["stylesheet"] # CSS files
Example: YC Companies
1. mcp__chrome-devtools__new_page
url: "https://www.ycombinator.com/companies"
2. mcp__chrome-devtools__list_network_requests
resourceTypes: ["fetch", "xhr"]
Result: reqid=229 POST https://45bwzj1sgc-dsn.algolia.net/...
3. mcp__chrome-devtools__get_network_request
reqid: 229
Result shows:
- Request Body: {"requests":[{"indexName":"YCCompany_production",...}]}
- Response Body: {"results":[{"nbHits":5611,"hits":[...],...}]}
4. Generate Python script with exact API format
5. Output: yc_companies.json
Common API Patterns
Algolia Search
ALGOLIA_URL = "https://{app_id}-dsn.algolia.net/1/indexes/*/queries"
headers = {
"Content-Type": "application/json",
}
# API key often in URL params: x-algolia-api-key=...
payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "query=&hitsPerPage=1000"
}]
}
REST API with pagination
page = 0
while True:
response = requests.get(f"{API_URL}?page={page}&limit=100")
data = response.json()
if not data["items"]:
break
all_items.extend(data["items"])
page += 1
GraphQL
query = """
query GetItems($first: Int, $after: String) {
items(first: $first, after: $after) {
edges { node { id, name } }
pageInfo { hasNextPage, endCursor }
}
}
"""
response = requests.post(API_URL, json={"query": query, "variables": {...}})
Fallback: DOM Scraping
If no API is found (server-rendered pages like WordPress):
- Use
mcp__chrome-devtools__evaluate_scriptto extract data from DOM - Click "Load More" buttons via
mcp__chrome-devtools__click - Use
mcp__chrome-devtools__take_snapshotto get page structure
// Example: Extract data from DOM
document.querySelectorAll('.card').forEach(card => {
const name = card.querySelector('h3')?.textContent;
const url = card.querySelector('a')?.href;
// ...
});
Tips
- Filter by resourceTypes - Use
["fetch", "xhr"]to see only API calls - Check response body - Contains actual data structure and pagination info
- Copy exact headers - Some APIs require specific headers to work
- Look for total count - Response often shows total items for pagination
- Handle rate limits - Add delays between requests if needed