Back to Blog

How to Build a Web Crawler from Scratch: The Essential Guide

A web crawler does one thing: it starts at a URL, finds more URLs, and visits them. The concept is simple. The implementation reveals why most companies eventually buy rather than build.

The fundamental problem isn't finding links—it's deciding which links matter, how deep to go, and how to do it without crashing your infrastructure or getting blocked.

This guide starts from first principles: what makes crawling hard, and how do we solve it?

The Core Problem: The Link Graph is Infinite

A naive crawler follows every link. This fails immediately because:

  1. Links grow exponentially: A site with 10 links per page generates 100 pages at depth 2, 1,000 at depth 3, and 10,000 at depth 4.

  2. Most links are noise: Navigation menus, footers, and pagination create duplicate patterns that waste resources.

  3. You need to stop somewhere: Without limits, the crawler runs forever or runs out of money.

The solution requires three constraints:

  • Depth limit: How many clicks from the start URL?
  • Link limit: Maximum total pages to crawl?
  • Pattern matching: Which URLs actually matter?

These aren't optimizations—they're requirements for a crawler that finishes.

Starting Simple: Extract and Follow Links

The minimal crawler has two parts: extract links from a page, then visit them. Here's the foundation:

from playwright.sync_api import sync_playwright
from urllib.parse import urljoin, urlparse
class SimpleCrawler:
def __init__(self, start_url, max_pages=10):
self.start_url = start_url
self.max_pages = max_pages
self.visited = set()
self.to_visit = [start_url]
def crawl(self):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
while self.to_visit and len(self.visited) < self.max_pages:
url = self.to_visit.pop(0)
if url in self.visited:
continue
self.visited.add(url)
page.goto(url, wait_until="networkidle")
# Extract links
links = page.eval_on_selector_all("a[href]",
"elements => elements.map(e => e.href)")
# Add same-domain links to queue
base_domain = urlparse(self.start_url).netloc
for link in links:
if urlparse(link).netloc == base_domain:
if link not in self.visited:
self.to_visit.append(link)
browser.close()
return self.visited
# Usage
crawler = SimpleCrawler("https://example.com", max_pages=50)
pages = crawler.crawl()
print(f"Crawled {len(pages)} pages")

This works, but it's incomplete. Notice what's missing:

  1. No depth tracking: Can't limit how many clicks deep it goes
  2. No pattern filtering: Crawls everything, including pagination and navigation
  3. No concurrency: Visits one page at a time
  4. No error recovery: One failure stops everything

The Three Critical Improvements

1. Depth Control: Limit How Deep You Go

Some sites are shallow (10 pages), others are deep (10,000 pages). Depth limits prevent runaway crawls:

# Track depth for each URL
self.depths = {self.start_url: 0}
# When adding new links:
current_depth = self.depths[current_url]
if current_depth < self.max_depth:
self.depths[link] = current_depth + 1
self.to_visit.append(link)

Depth 2 means "crawl the start page, then pages linked from there, then stop." It's usually enough.

2. Pattern Filtering: Crawl What Matters

Most links on a page are navigation noise. Pattern matching focuses the crawler:

import re
def matches_pattern(url, patterns):
"""Check if URL matches any allowed pattern"""
if not patterns:
return True # No patterns = allow all
for pattern in patterns:
if re.search(pattern, url):
return True
return False
# Only crawl blog posts and docs:
patterns = [r'/blog/', r'/docs/']
if matches_pattern(link, patterns):
self.to_visit.append(link)

Common patterns:

  • /blog/* - Only blog posts
  • /products/* - Only product pages
  • ^(?!.*admin).*$ - Exclude admin pages

3. Concurrency: Crawl Faster

Sequential crawling is slow. Modern sites need parallel requests:

import asyncio
from playwright.async_api import async_playwright
async def crawl_url(url, browser, visited):
if url in visited:
return
visited.add(url)
page = await browser.new_page()
await page.goto(url)
# Extract and process...
await page.close()
async def concurrent_crawl(urls):
visited = set()
async with async_playwright() as p:
browser = await p.chromium.launch()
# Crawl up to 5 pages simultaneously
await asyncio.gather(*[
crawl_url(url, browser, visited)
for url in urls[:5]
])
await browser.close()

Limit concurrency to 5-10 concurrent pages. More wastes memory without speed gains.

The Hidden Complexity: What Production Actually Requires

The code above gets you started. Production crawling reveals new problems:

  1. Politeness: Add delays between requests (500ms-2s) to avoid overwhelming servers
  2. Robots.txt: Respect crawling rules (legal requirement)
  3. User-Agent rotation: Avoid looking like a bot
  4. Retry logic: Handle network failures and timeouts
  5. Queue management: Track what failed, what succeeded, what's pending
  6. Data storage: Store results in a database, not memory
  7. Monitoring: Track crawl progress and errors

Each of these is straightforward individually. Together, they're why most teams use managed services.

The Managed Alternative

Supacrawler provides distributed crawling as an API. The same depth control, pattern matching, and concurrency—without the infrastructure:

curl -X POST https://api.supacrawler.com/api/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"link_limit": 100,
"depth": 2,
"patterns": ["/blog/*", "/docs/*"]
}'

The difference between DIY and managed:

ChallengeDIY CrawlerSupacrawler API
Link graph explosionManual depth/limit trackingBuilt-in depth control
Pattern filteringWrite regex logicPattern parameter
ConcurrencyManage worker poolsAutomatic parallelization
Retries & errorsCustom retry logicBuilt-in with backoff
PolitenessManual delaysAutomatic rate limiting
StorageDesign data pipelineReturns structured JSON

Start with 1,000 free crawl requestsView API docs

What You've Learned

Web crawling comes down to three fundamentals:

  1. Constraint the graph: Limit depth and total pages, or the crawler never stops
  2. Filter intelligently: Most links are noise—patterns focus on what matters
  3. Production is different: The code is easy. Managing queues, retries, and infrastructure is hard.

Start with a basic crawler to understand how links connect. Move to managed services when maintaining the infrastructure distracts from building your product.

The crawling logic works the same way. The difference is who operates it.

By Supacrawler Team
Published on October 2, 2025