Crawling a Full Documentation Site Using the Crawl Endpoint

Manually scraping hundreds of pages from a site like Supabase Docs is a non-starter. Building a custom crawler can be a multi-week infrastructure project. This guide shows you how to do it in about five minutes with a single API call using the Supacrawler Jobs API.

Note: If you’re using the SDKs (recommended), install them first or see our Install guide: Install the SDKs.

Our goal is to crawl the entire https://supabase.com/docs website and get back a clean JSON object where each page's content is stored as markdown.

Step 1: Create the Crawl Job

First, we'll create an asynchronous crawl job. We will give it the starting URL and, most importantly, use the patterns parameter to tell the crawler to only follow links that live under the /docs/ path. This is the key to an efficient and clean crawl, ensuring we don't accidentally scrape the main marketing site or blog.

Start a crawl job for Supabase Docs

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "format": "markdown",
    "depth": 5,
    "link_limit": 500,
    "include_patterns": ["/docs/*"],
    "render_js": true,
    "include_subdomains": false
  }'

Start a crawl job for Supabase Docs

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "type": "crawl",
    "depth": 5,
    "link_limit": 500,
    "patterns": ["/docs/*"]
  }'

The API immediately responds with a jobId. The crawl is now running on our infrastructure.

Step 2: Retrieve Your Documentation Dataset

Because a full documentation site can contain hundreds of pages, the job runs in the background. You can periodically poll the /v1/crawl/{id} endpoint to check the status.

Once the status is completed, the response will contain a comprehensive crawlData object. This object holds the clean markdown, title, and other metadata for every single page within the Supabase documentation, neatly keyed by its URL.

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "completed",
  "data": {
    "url": "https://supabase.com/docs",
    "crawl_data": {
      "https://supabase.com/docs/guides/auth": {
        "markdown": "[Overview](/docs/guides/auth)\n\n# Auth\n\n## Use Supabase to authenticate and authorize your users.\n\n* * *\n\nSupabase Auth makes it easy to implement authentication and authorization in your app. We provide client SDKs and API endpoints to help you create and manage users.\n\nYour users can use many popular Auth methods, including password, magic link, one-time password (OTP), social login, and single sign-on (SSO).\n\n## About authentication and authorization [\\#](\\#about-authentication-and-authorization)\n\nAuthentication and authorization are the core responsibilities of any Auth system.",
        "metadata": { "title": "Auth | Supabase Docs", "status_code": 200 }
      },
      "https://supabase.com/docs/guides/ai": {
        "markdown": "# AI & Vectors\n\n## The best vector database is the database you already have.\n\n* * *\n\nSupabase provides an open source toolkit for developing AI applications using Postgres and pgvector. Use the Supabase client libraries to store, index, and query your vector embeddings at scale.\n\nThe toolkit includes:\n\n- A [vector store](/docs/guides/ai/vector-columns) and embeddings support using Postgres and pgvector.\n- A [Python client](/docs/guides/ai/vecs-python-client) for managing unstructured embeddings.\n- An [embedding generation](/docs/guides/ai/quickstarts/generate-text-embeddings) process using open source models directly in Edge Functions.",
        "metadata": { "title": "AI &amp; Vectors | Supabase Docs", "status_code": 200 }
      },
      "...": "... many more pages"
    }
  }
}

Bonus: Create a Searchable Dataset from the Results

With this clean JSON file, you can easily transform it into a format suitable for a local search index or for feeding to an LLM.

Here's a simple Python snippet to parse the result into a flat list of objects:

import json

# Assuming 'job_result.json' contains the completed job response
with open('job_result.json', 'r') as f:
    job_data = json.load(f)

crawl_results = job_data.get('data', {}).get('crawl_data', {})
search_dataset = []

for url, page_data in crawl_results.items():
    search_dataset.append({
        'url': url,
        'title': (page_data.get('metadata') or {}).get('title', ''),
        'content': page_data.get('markdown', '')
    })

# Save to a new JSON file
with open('supabase_docs_dataset.json', 'w') as f:
    json.dump(search_dataset, f, indent=2)

Next Steps

In just a few minutes, you've turned an entire, complex documentation website into a clean, structured, and portable dataset. This same pattern works for crawling blogs, product catalogs, or any other sectioned website.

For a full list of all available options to further refine your crawls, check out the complete Jobs API documentation.

Crawling a Full Documentation Site Using the Crawl Endpoint

Step 1: Create the Crawl Job

Start a crawl job for Supabase Docs

Start a crawl job for Supabase Docs

Step 2: Retrieve Your Documentation Dataset

Bonus: Create a Searchable Dataset from the Results

Next Steps

Product

Company

Blog

Support