How to Scrape JavaScript Websites: A First-Principles Guide

Most web scraping tutorials start with HTTP requests and HTML parsing. This works great—until you encounter a React or Vue app that renders an empty <div id="app"></div> and loads everything with JavaScript.

The fundamental problem: HTTP clients get HTML. JavaScript runs in browsers. To scrape JavaScript-rendered content, you need a browser.

This guide starts from first principles: why JavaScript breaks traditional scraping, and how headless browsers solve it.

The Core Problem: JavaScript Renders After Page Load

Try scraping a React app with an HTTP client:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://react-app.example.com")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.find('div', class_='product-list'))
# Returns: None

The HTML you receive looks like this:

<html>
  <body>
    <div id="root"></div>
    <script src="/bundle.js"></script>
  </body>
</html>

The content loads after JavaScript executes. The HTTP response doesn't include it.

The Solution: Execute JavaScript

You need something that:

Downloads the HTML
Executes the JavaScript
Waits for DOM updates
Then extracts the content

This is what browsers do. Headless browsers do it without the GUI.

Starting Simple: Playwright with JavaScript Execution

Here's the minimal implementation that actually renders JavaScript:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    # Navigate and wait for JavaScript
    page.goto("https://react-app.example.com", wait_until="networkidle")
    
    # Now the content exists
    product_list = page.query_selector('.product-list')
    print(product_list.text_content())
    
    browser.close()

The key difference from HTTP: wait_until="networkidle" waits for JavaScript to finish executing and network requests to complete. Without this, you get the empty shell.

The Three Wait Strategies

Not all JavaScript sites load the same way. You need different strategies:

1. Network Idle (Most Common)

Wait until network requests stop:

page.goto(url, wait_until="networkidle")

Works for: Sites that load data via API calls

2. Specific Element (Most Reliable)

Wait for the content you need:

page.goto(url)
page.wait_for_selector(".product-list", state="visible")
content = page.text_content(".product-list")

Works for: When you know exactly what element matters

3. Manual Delay (Last Resort)

Some sites never stop making requests:

page.goto(url)
page.wait_for_timeout(2000)  # 2 seconds
content = page.content()

Works for: Poorly built sites with continuous polling

Most production scrapers use strategy #2—wait for specific elements. It's faster and more reliable than network idle.

Interactions: Clicking and Filling

Some content requires interaction. Common patterns:

# Click "Load More" button
page.click("button.load-more")
page.wait_for_selector(".new-items", state="visible")

# Fill search form
page.fill("input[name='query']", "search term")
page.click("button[type='submit']")
page.wait_for_url("**/search?**")

# Select dropdown
page.select_option("select.category", "technology")

The pattern is always: interact, then wait for the result.

When Browsers Become Expensive

Playwright works. You can scrape JavaScript-heavy sites. But production reveals costs:

Memory Usage: Each browser instance uses 200-500MB. Running 10 concurrent browsers needs 5GB+ RAM.
CPU Load: JavaScript execution is CPU-intensive. Rendering a React app is 10-100x slower than parsing static HTML.
Deployment Complexity: Playwright requires system dependencies. Docker images are 1.5GB+. Lambda requires custom layers.
Reliability: Browsers crash. Pages timeout. You need retry logic, error handling, and monitoring.
Anti-Detection: Many sites block headless browsers. You need user-agent rotation, proxy management, and fingerprint randomization.

You're not just scraping—you're operating a browser farm.

The Managed Alternative

Supacrawler handles JavaScript rendering without the browser farm:

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://react-app.example.com" \
  -d format="markdown"

The operational difference:

Complexity	DIY Playwright	Supacrawler API
Browser management	Manual pool, restart logic	Automatic
Memory limits	Configure per-instance	Built-in
Anti-detection	User-agent rotation, proxies	99.9% success rate
JavaScript execution	Choose wait strategy	Automatic detection
Deployment	Docker + dependencies	API call
Scaling	Provision servers	Automatic

Start with 1,000 free requests • View API docs

What You've Learned

JavaScript rendering breaks traditional HTTP scraping. The solutions:

Use a headless browser: Execute JavaScript like a real browser
Wait strategically: Network idle for API calls, specific elements for reliability
Production is different: Browser farms need memory management, anti-detection, and retry logic

Start with Playwright to understand how browsers work. Move to managed services when infrastructure becomes more complex than your product.

The JavaScript execution works the same way. The difference is who manages the browsers.