How to Build a Web Crawler from Scratch: The Essential Guide

A web crawler does one thing: it starts at a URL, finds more URLs, and visits them. The concept is simple. The implementation reveals why most companies eventually buy rather than build.

The fundamental problem isn't finding links—it's deciding which links matter, how deep to go, and how to do it without crashing your infrastructure or getting blocked.

This guide starts from first principles: what makes crawling hard, and how do we solve it?

The Core Problem: The Link Graph is Infinite

A naive crawler follows every link. This fails immediately because:

Links grow exponentially: A site with 10 links per page generates 100 pages at depth 2, 1,000 at depth 3, and 10,000 at depth 4.
Most links are noise: Navigation menus, footers, and pagination create duplicate patterns that waste resources.
You need to stop somewhere: Without limits, the crawler runs forever or runs out of money.

The solution requires three constraints:

Depth limit: How many clicks from the start URL?
Link limit: Maximum total pages to crawl?
Pattern matching: Which URLs actually matter?

These aren't optimizations—they're requirements for a crawler that finishes.

Starting Simple: Extract and Follow Links

The minimal crawler has two parts: extract links from a page, then visit them. Here's the foundation:

from playwright.sync_api import sync_playwright
from urllib.parse import urljoin, urlparse

class SimpleCrawler:
    def __init__(self, start_url, max_pages=10):
        self.start_url = start_url
        self.max_pages = max_pages
        self.visited = set()
        self.to_visit = [start_url]
    
    def crawl(self):
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            
            while self.to_visit and len(self.visited) < self.max_pages:
                url = self.to_visit.pop(0)
                if url in self.visited:
                    continue
                
                self.visited.add(url)
                page.goto(url, wait_until="networkidle")
                
                # Extract links
                links = page.eval_on_selector_all("a[href]", 
                    "elements => elements.map(e => e.href)")
                
                # Add same-domain links to queue
                base_domain = urlparse(self.start_url).netloc
                for link in links:
                    if urlparse(link).netloc == base_domain:
                        if link not in self.visited:
                            self.to_visit.append(link)
            
            browser.close()
        
        return self.visited

# Usage
crawler = SimpleCrawler("https://example.com", max_pages=50)
pages = crawler.crawl()
print(f"Crawled {len(pages)} pages")

This works, but it's incomplete. Notice what's missing:

No depth tracking: Can't limit how many clicks deep it goes
No pattern filtering: Crawls everything, including pagination and navigation
No concurrency: Visits one page at a time
No error recovery: One failure stops everything

The Three Critical Improvements

1. Depth Control: Limit How Deep You Go

Some sites are shallow (10 pages), others are deep (10,000 pages). Depth limits prevent runaway crawls:

# Track depth for each URL
self.depths = {self.start_url: 0}

# When adding new links:
current_depth = self.depths[current_url]
if current_depth < self.max_depth:
    self.depths[link] = current_depth + 1
    self.to_visit.append(link)

Depth 2 means "crawl the start page, then pages linked from there, then stop." It's usually enough.

2. Pattern Filtering: Crawl What Matters

Most links on a page are navigation noise. Pattern matching focuses the crawler:

import re

def matches_pattern(url, patterns):
    """Check if URL matches any allowed pattern"""
    if not patterns:
        return True  # No patterns = allow all
    
    for pattern in patterns:
        if re.search(pattern, url):
            return True
    return False

# Only crawl blog posts and docs:
patterns = [r'/blog/', r'/docs/']
if matches_pattern(link, patterns):
    self.to_visit.append(link)

Common patterns:

/blog/* - Only blog posts
/products/* - Only product pages
^(?!.*admin).*$ - Exclude admin pages

3. Concurrency: Crawl Faster

Sequential crawling is slow. Modern sites need parallel requests:

import asyncio
from playwright.async_api import async_playwright

async def crawl_url(url, browser, visited):
    if url in visited:
        return
    visited.add(url)
    
    page = await browser.new_page()
    await page.goto(url)
    # Extract and process...
    await page.close()

async def concurrent_crawl(urls):
    visited = set()
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        # Crawl up to 5 pages simultaneously
        await asyncio.gather(*[
            crawl_url(url, browser, visited) 
            for url in urls[:5]
        ])
        await browser.close()

Limit concurrency to 5-10 concurrent pages. More wastes memory without speed gains.

The Hidden Complexity: What Production Actually Requires

The code above gets you started. Production crawling reveals new problems:

Politeness: Add delays between requests (500ms-2s) to avoid overwhelming servers
Robots.txt: Respect crawling rules (legal requirement)
User-Agent rotation: Avoid looking like a bot
Retry logic: Handle network failures and timeouts
Queue management: Track what failed, what succeeded, what's pending
Data storage: Store results in a database, not memory
Monitoring: Track crawl progress and errors

Each of these is straightforward individually. Together, they're why most teams use managed services.

The Managed Alternative

Supacrawler provides distributed crawling as an API. The same depth control, pattern matching, and concurrency—without the infrastructure:

curl -X POST https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "link_limit": 100,
    "depth": 2,
    "patterns": ["/blog/*", "/docs/*"]
  }'

The difference between DIY and managed:

Challenge	DIY Crawler	Supacrawler API
Link graph explosion	Manual depth/limit tracking	Built-in depth control
Pattern filtering	Write regex logic	Pattern parameter
Concurrency	Manage worker pools	Automatic parallelization
Retries & errors	Custom retry logic	Built-in with backoff
Politeness	Manual delays	Automatic rate limiting
Storage	Design data pipeline	Returns structured JSON

Start with 1,000 free crawl requests • View API docs

What You've Learned

Web crawling comes down to three fundamentals:

Constraint the graph: Limit depth and total pages, or the crawler never stops
Filter intelligently: Most links are noise—patterns focus on what matters
Production is different: The code is easy. Managing queues, retries, and infrastructure is hard.

Start with a basic crawler to understand how links connect. Move to managed services when maintaining the infrastructure distracts from building your product.

The crawling logic works the same way. The difference is who operates it.