Crawling a Full Documentation Site Using the Crawl Endpoint
Manually scraping hundreds of pages from a site like Supabase Docs is a non-starter. Building a custom crawler can be a multi-week infrastructure project. This guide shows you how to do it in about five minutes with a single API call using the Supacrawler Jobs API.
Note: If you’re using the SDKs (recommended), install them first or see our Install guide: Install the SDKs.
Our goal is to crawl the entire https://supabase.com/docs
website and get back a clean JSON object where each page's content is stored as markdown.
Step 1: Create the Crawl Job
First, we'll create an asynchronous crawl job. We will give it the starting URL and, most importantly, use the patterns
parameter to tell the crawler to only follow links that live under the /docs/
path. This is the key to an efficient and clean crawl, ensuring we don't accidentally scrape the main marketing site or blog.
Start a crawl job for Supabase Docs
curl https://api.supacrawler.com/api/v1/crawl \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://supabase.com/docs","format": "markdown","depth": 5,"link_limit": 500,"include_patterns": ["/docs/*"],"render_js": true,"include_subdomains": false}'
Start a crawl job for Supabase Docs
curl https://api.supacrawler.com/api/v1/crawl \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://supabase.com/docs","type": "crawl","depth": 5,"link_limit": 500,"patterns": ["/docs/*"]}'
The API immediately responds with a jobId
. The crawl is now running on our infrastructure.
Step 2: Retrieve Your Documentation Dataset
Because a full documentation site can contain hundreds of pages, the job runs in the background. You can periodically poll the /v1/crawl/{id}
endpoint to check the status.
Once the status
is completed
, the response will contain a comprehensive crawlData
object. This object holds the clean markdown, title, and other metadata for every single page within the Supabase documentation, neatly keyed by its URL.
{"success": true,"job_id": "550e8400-e29b-41d4-a716-446655440000","type": "crawl","status": "completed","data": {"url": "https://supabase.com/docs","crawl_data": {"https://supabase.com/docs/guides/auth": {"markdown": "[Overview](/docs/guides/auth)\n\n# Auth\n\n## Use Supabase to authenticate and authorize your users.\n\n* * *\n\nSupabase Auth makes it easy to implement authentication and authorization in your app. We provide client SDKs and API endpoints to help you create and manage users.\n\nYour users can use many popular Auth methods, including password, magic link, one-time password (OTP), social login, and single sign-on (SSO).\n\n## About authentication and authorization [\\#](\\#about-authentication-and-authorization)\n\nAuthentication and authorization are the core responsibilities of any Auth system.","metadata": { "title": "Auth | Supabase Docs", "status_code": 200 }},"https://supabase.com/docs/guides/ai": {"markdown": "# AI & Vectors\n\n## The best vector database is the database you already have.\n\n* * *\n\nSupabase provides an open source toolkit for developing AI applications using Postgres and pgvector. Use the Supabase client libraries to store, index, and query your vector embeddings at scale.\n\nThe toolkit includes:\n\n- A [vector store](/docs/guides/ai/vector-columns) and embeddings support using Postgres and pgvector.\n- A [Python client](/docs/guides/ai/vecs-python-client) for managing unstructured embeddings.\n- An [embedding generation](/docs/guides/ai/quickstarts/generate-text-embeddings) process using open source models directly in Edge Functions.","metadata": { "title": "AI & Vectors | Supabase Docs", "status_code": 200 }},"...": "... many more pages"}}}
Bonus: Create a Searchable Dataset from the Results
With this clean JSON file, you can easily transform it into a format suitable for a local search index or for feeding to an LLM.
Here's a simple Python snippet to parse the result into a flat list of objects:
import json# Assuming 'job_result.json' contains the completed job responsewith open('job_result.json', 'r') as f:job_data = json.load(f)crawl_results = job_data.get('data', {}).get('crawl_data', {})search_dataset = []for url, page_data in crawl_results.items():search_dataset.append({'url': url,'title': (page_data.get('metadata') or {}).get('title', ''),'content': page_data.get('markdown', '')})# Save to a new JSON filewith open('supabase_docs_dataset.json', 'w') as f:json.dump(search_dataset, f, indent=2)
Next Steps
In just a few minutes, you've turned an entire, complex documentation website into a clean, structured, and portable dataset. This same pattern works for crawling blogs, product catalogs, or any other sectioned website.
For a full list of all available options to further refine your crawls, check out the complete Jobs API documentation.