How to Scrape JavaScript Websites: A First-Principles Guide
Most web scraping tutorials start with HTTP requests and HTML parsing. This works great—until you encounter a React or Vue app that renders an empty <div id="app"></div>
and loads everything with JavaScript.
The fundamental problem: HTTP clients get HTML. JavaScript runs in browsers. To scrape JavaScript-rendered content, you need a browser.
This guide starts from first principles: why JavaScript breaks traditional scraping, and how headless browsers solve it.
The Core Problem: JavaScript Renders After Page Load
Try scraping a React app with an HTTP client:
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://react-app.example.com")soup = BeautifulSoup(response.text, 'html.parser')print(soup.find('div', class_='product-list'))# Returns: None
The HTML you receive looks like this:
<html><body><div id="root"></div><script src="/bundle.js"></script></body></html>
The content loads after JavaScript executes. The HTTP response doesn't include it.
The Solution: Execute JavaScript
You need something that:
- Downloads the HTML
- Executes the JavaScript
- Waits for DOM updates
- Then extracts the content
This is what browsers do. Headless browsers do it without the GUI.
Starting Simple: Playwright with JavaScript Execution
Here's the minimal implementation that actually renders JavaScript:
from playwright.sync_api import sync_playwrightwith sync_playwright() as p:browser = p.chromium.launch(headless=True)page = browser.new_page()# Navigate and wait for JavaScriptpage.goto("https://react-app.example.com", wait_until="networkidle")# Now the content existsproduct_list = page.query_selector('.product-list')print(product_list.text_content())browser.close()
The key difference from HTTP: wait_until="networkidle"
waits for JavaScript to finish executing and network requests to complete. Without this, you get the empty shell.
The Three Wait Strategies
Not all JavaScript sites load the same way. You need different strategies:
1. Network Idle (Most Common)
Wait until network requests stop:
page.goto(url, wait_until="networkidle")
Works for: Sites that load data via API calls
2. Specific Element (Most Reliable)
Wait for the content you need:
page.goto(url)page.wait_for_selector(".product-list", state="visible")content = page.text_content(".product-list")
Works for: When you know exactly what element matters
3. Manual Delay (Last Resort)
Some sites never stop making requests:
page.goto(url)page.wait_for_timeout(2000) # 2 secondscontent = page.content()
Works for: Poorly built sites with continuous polling
Most production scrapers use strategy #2—wait for specific elements. It's faster and more reliable than network idle.
Interactions: Clicking and Filling
Some content requires interaction. Common patterns:
# Click "Load More" buttonpage.click("button.load-more")page.wait_for_selector(".new-items", state="visible")# Fill search formpage.fill("input[name='query']", "search term")page.click("button[type='submit']")page.wait_for_url("**/search?**")# Select dropdownpage.select_option("select.category", "technology")
The pattern is always: interact, then wait for the result.
When Browsers Become Expensive
Playwright works. You can scrape JavaScript-heavy sites. But production reveals costs:
-
Memory Usage: Each browser instance uses 200-500MB. Running 10 concurrent browsers needs 5GB+ RAM.
-
CPU Load: JavaScript execution is CPU-intensive. Rendering a React app is 10-100x slower than parsing static HTML.
-
Deployment Complexity: Playwright requires system dependencies. Docker images are 1.5GB+. Lambda requires custom layers.
-
Reliability: Browsers crash. Pages timeout. You need retry logic, error handling, and monitoring.
-
Anti-Detection: Many sites block headless browsers. You need user-agent rotation, proxy management, and fingerprint randomization.
You're not just scraping—you're operating a browser farm.
The Managed Alternative
Supacrawler handles JavaScript rendering without the browser farm:
curl -G https://api.supacrawler.com/api/v1/scrape \-H "Authorization: Bearer YOUR_API_KEY" \-d url="https://react-app.example.com" \-d render_js=true \-d format="markdown"
The operational difference:
Complexity | DIY Playwright | Supacrawler API |
---|---|---|
Browser management | Manual pool, restart logic | Automatic |
Memory limits | Configure per-instance | Built-in |
Anti-detection | User-agent rotation, proxies | 99.9% success rate |
JavaScript execution | Choose wait strategy | Automatic detection |
Deployment | Docker + dependencies | API call |
Scaling | Provision servers | Automatic |
Start with 1,000 free requests • View API docs
What You've Learned
JavaScript rendering breaks traditional HTTP scraping. The solutions:
- Use a headless browser: Execute JavaScript like a real browser
- Wait strategically: Network idle for API calls, specific elements for reliability
- Production is different: Browser farms need memory management, anti-detection, and retry logic
Start with Playwright to understand how browsers work. Move to managed services when infrastructure becomes more complex than your product.
The JavaScript execution works the same way. The difference is who manages the browsers.