How to Build a Web Crawler from Scratch: The Essential Guide
A web crawler does one thing: it starts at a URL, finds more URLs, and visits them. The concept is simple. The implementation reveals why most companies eventually buy rather than build.
The fundamental problem isn't finding links—it's deciding which links matter, how deep to go, and how to do it without crashing your infrastructure or getting blocked.
This guide starts from first principles: what makes crawling hard, and how do we solve it?
The Core Problem: The Link Graph is Infinite
A naive crawler follows every link. This fails immediately because:
-
Links grow exponentially: A site with 10 links per page generates 100 pages at depth 2, 1,000 at depth 3, and 10,000 at depth 4.
-
Most links are noise: Navigation menus, footers, and pagination create duplicate patterns that waste resources.
-
You need to stop somewhere: Without limits, the crawler runs forever or runs out of money.
The solution requires three constraints:
- Depth limit: How many clicks from the start URL?
- Link limit: Maximum total pages to crawl?
- Pattern matching: Which URLs actually matter?
These aren't optimizations—they're requirements for a crawler that finishes.
Starting Simple: Extract and Follow Links
The minimal crawler has two parts: extract links from a page, then visit them. Here's the foundation:
from playwright.sync_api import sync_playwrightfrom urllib.parse import urljoin, urlparseclass SimpleCrawler:def __init__(self, start_url, max_pages=10):self.start_url = start_urlself.max_pages = max_pagesself.visited = set()self.to_visit = [start_url]def crawl(self):with sync_playwright() as p:browser = p.chromium.launch(headless=True)page = browser.new_page()while self.to_visit and len(self.visited) < self.max_pages:url = self.to_visit.pop(0)if url in self.visited:continueself.visited.add(url)page.goto(url, wait_until="networkidle")# Extract linkslinks = page.eval_on_selector_all("a[href]","elements => elements.map(e => e.href)")# Add same-domain links to queuebase_domain = urlparse(self.start_url).netlocfor link in links:if urlparse(link).netloc == base_domain:if link not in self.visited:self.to_visit.append(link)browser.close()return self.visited# Usagecrawler = SimpleCrawler("https://example.com", max_pages=50)pages = crawler.crawl()print(f"Crawled {len(pages)} pages")
This works, but it's incomplete. Notice what's missing:
- No depth tracking: Can't limit how many clicks deep it goes
- No pattern filtering: Crawls everything, including pagination and navigation
- No concurrency: Visits one page at a time
- No error recovery: One failure stops everything
The Three Critical Improvements
1. Depth Control: Limit How Deep You Go
Some sites are shallow (10 pages), others are deep (10,000 pages). Depth limits prevent runaway crawls:
# Track depth for each URLself.depths = {self.start_url: 0}# When adding new links:current_depth = self.depths[current_url]if current_depth < self.max_depth:self.depths[link] = current_depth + 1self.to_visit.append(link)
Depth 2 means "crawl the start page, then pages linked from there, then stop." It's usually enough.
2. Pattern Filtering: Crawl What Matters
Most links on a page are navigation noise. Pattern matching focuses the crawler:
import redef matches_pattern(url, patterns):"""Check if URL matches any allowed pattern"""if not patterns:return True # No patterns = allow allfor pattern in patterns:if re.search(pattern, url):return Truereturn False# Only crawl blog posts and docs:patterns = [r'/blog/', r'/docs/']if matches_pattern(link, patterns):self.to_visit.append(link)
Common patterns:
/blog/*
- Only blog posts/products/*
- Only product pages^(?!.*admin).*$
- Exclude admin pages
3. Concurrency: Crawl Faster
Sequential crawling is slow. Modern sites need parallel requests:
import asynciofrom playwright.async_api import async_playwrightasync def crawl_url(url, browser, visited):if url in visited:returnvisited.add(url)page = await browser.new_page()await page.goto(url)# Extract and process...await page.close()async def concurrent_crawl(urls):visited = set()async with async_playwright() as p:browser = await p.chromium.launch()# Crawl up to 5 pages simultaneouslyawait asyncio.gather(*[crawl_url(url, browser, visited)for url in urls[:5]])await browser.close()
Limit concurrency to 5-10 concurrent pages. More wastes memory without speed gains.
The Hidden Complexity: What Production Actually Requires
The code above gets you started. Production crawling reveals new problems:
- Politeness: Add delays between requests (500ms-2s) to avoid overwhelming servers
- Robots.txt: Respect crawling rules (legal requirement)
- User-Agent rotation: Avoid looking like a bot
- Retry logic: Handle network failures and timeouts
- Queue management: Track what failed, what succeeded, what's pending
- Data storage: Store results in a database, not memory
- Monitoring: Track crawl progress and errors
Each of these is straightforward individually. Together, they're why most teams use managed services.
The Managed Alternative
Supacrawler provides distributed crawling as an API. The same depth control, pattern matching, and concurrency—without the infrastructure:
curl -X POST https://api.supacrawler.com/api/v1/crawl \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://example.com","link_limit": 100,"depth": 2,"patterns": ["/blog/*", "/docs/*"]}'
The difference between DIY and managed:
Challenge | DIY Crawler | Supacrawler API |
---|---|---|
Link graph explosion | Manual depth/limit tracking | Built-in depth control |
Pattern filtering | Write regex logic | Pattern parameter |
Concurrency | Manage worker pools | Automatic parallelization |
Retries & errors | Custom retry logic | Built-in with backoff |
Politeness | Manual delays | Automatic rate limiting |
Storage | Design data pipeline | Returns structured JSON |
Start with 1,000 free crawl requests • View API docs
What You've Learned
Web crawling comes down to three fundamentals:
- Constraint the graph: Limit depth and total pages, or the crawler never stops
- Filter intelligently: Most links are noise—patterns focus on what matters
- Production is different: The code is easy. Managing queues, retries, and infrastructure is hard.
Start with a basic crawler to understand how links connect. Move to managed services when maintaining the infrastructure distracts from building your product.
The crawling logic works the same way. The difference is who operates it.