Back to Blog

Python Web Scraping Tutorial for Beginners: Complete Guide 2025

Web scraping is one of the most valuable skills for any Python developer. Whether you're building a data science project, monitoring competitors, or automating repetitive tasks, the ability to extract data from websites opens up countless possibilities.

This tutorial will take you from complete beginner to confident web scraper. We'll start with Python's basic libraries to understand the fundamentals, then show you how modern tools like Supacrawler can simplify the entire process.

By the end of this guide, you'll understand when to use different approaches and be able to scrape data from any website confidently.

What You'll Learn

  • Python web scraping fundamentals using Requests and BeautifulSoup
  • How to handle different types of websites (static vs dynamic)
  • Common challenges and how to solve them
  • Best practices to avoid getting blocked
  • Modern alternatives that eliminate complexity
  • Complete working examples you can run immediately

Let's dive in!

Method 1: The Traditional Approach (BeautifulSoup + Requests)

First, let's learn web scraping the traditional way. This helps you understand what's happening under the hood and why modern solutions are so valuable.

Setting Up Your Environment

# Install required packages
pip install requests beautifulsoup4 lxml

Your First Python Web Scraper

Let's start by scraping a simple news website to extract headlines:

Basic web scraping with Python

import requests
from bs4 import BeautifulSoup
import time
def scrape_news_headlines(url):
"""
Scrape headlines from a news website
"""
try:
# Step 1: Send HTTP request to get the page
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise an exception for bad status codes
# Step 2: Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Step 3: Find and extract headlines
headlines = []
# Look for common headline selectors
headline_selectors = [
'h1', 'h2', 'h3', # Basic heading tags
'.headline', '.title', # Common CSS classes
'[data-testid="headline"]' # Modern data attributes
]
for selector in headline_selectors:
elements = soup.select(selector)
for element in elements:
text = element.get_text().strip()
if text and len(text) > 10: # Filter out short/empty text
headlines.append(text)
# Remove duplicates while preserving order
seen = set()
unique_headlines = []
for headline in headlines:
if headline not in seen:
seen.add(headline)
unique_headlines.append(headline)
return unique_headlines[:10] # Return top 10 headlines
except requests.RequestException as e:
print(f"Error fetching the page: {e}")
return []
except Exception as e:
print(f"Error parsing the page: {e}")
return []
# Example usage
if __name__ == "__main__":
# Try scraping from a news site
news_sites = [
"https://news.ycombinator.com",
"https://techcrunch.com",
"https://www.bbc.com/news"
]
for site in news_sites:
print(f"\n--- Headlines from {site} ---")
headlines = scrape_news_headlines(site)
if headlines:
for i, headline in enumerate(headlines, 1):
print(f"{i}. {headline}")
else:
print("No headlines found or error occurred")
# Be polite - wait between requests
time.sleep(2)

Understanding the Code

Let's break down what this code does:

  1. HTTP Request: We use requests to fetch the webpage, just like your browser does
  2. HTML Parsing: BeautifulSoup parses the HTML into a structure we can navigate
  3. Data Extraction: We use CSS selectors to find headline elements
  4. Data Cleaning: We remove duplicates and filter out short text
  5. Error Handling: We catch common errors that occur during scraping

Handling More Complex Scenarios

Real websites are messier than our example. Let's handle some common challenges:

Advanced BeautifulSoup techniques

import requests
from bs4 import BeautifulSoup
import time
import re
from urllib.parse import urljoin, urlparse
class WebScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
def scrape_product_details(self, product_url):
"""
Scrape product information from an e-commerce page
"""
try:
response = self.session.get(product_url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Extract product details using multiple strategies
product = {
'name': self._extract_product_name(soup),
'price': self._extract_price(soup),
'description': self._extract_description(soup),
'images': self._extract_images(soup, product_url),
'rating': self._extract_rating(soup)
}
return product
except Exception as e:
print(f"Error scraping {product_url}: {e}")
return None
def _extract_product_name(self, soup):
"""Try multiple selectors to find product name"""
selectors = [
'h1',
'.product-title',
'.product-name',
'[data-testid="product-name"]',
'title'
]
for selector in selectors:
element = soup.select_one(selector)
if element:
return element.get_text().strip()
return "Name not found"
def _extract_price(self, soup):
"""Extract price with multiple patterns"""
# Look for price patterns
price_patterns = [
r'\$[\d,]+\.?\d*', # $19.99, $1,299
r'£[\d,]+\.?\d*', # £19.99
r'€[\d,]+\.?\d*', # €19.99
]
# Try specific selectors first
price_selectors = [
'.price',
'.cost',
'[data-testid="price"]',
'.product-price'
]
for selector in price_selectors:
element = soup.select_one(selector)
if element:
text = element.get_text()
for pattern in price_patterns:
match = re.search(pattern, text)
if match:
return match.group()
# If specific selectors fail, search entire page
page_text = soup.get_text()
for pattern in price_patterns:
match = re.search(pattern, page_text)
if match:
return match.group()
return "Price not found"
def _extract_description(self, soup):
"""Extract product description"""
selectors = [
'.product-description',
'.description',
'[data-testid="description"]',
'.product-details p'
]
for selector in selectors:
element = soup.select_one(selector)
if element:
return element.get_text().strip()[:500] # Limit length
return "Description not found"
def _extract_images(self, soup, base_url):
"""Extract product images and convert to absolute URLs"""
images = []
img_elements = soup.find_all('img')
for img in img_elements:
src = img.get('src') or img.get('data-src') # Handle lazy loading
if src:
# Convert relative URLs to absolute
absolute_url = urljoin(base_url, src)
# Filter out tiny images (likely icons)
if not any(word in src.lower() for word in ['icon', 'logo', 'sprite']):
images.append(absolute_url)
return images[:5] # Return first 5 images
def _extract_rating(self, soup):
"""Extract product rating"""
# Look for star ratings or numeric ratings
rating_selectors = [
'.rating',
'.stars',
'[data-testid="rating"]'
]
for selector in rating_selectors:
element = soup.select_one(selector)
if element:
text = element.get_text()
# Look for patterns like "4.5 stars" or "4.5/5"
rating_match = re.search(r'(\d+\.?\d*)', text)
if rating_match:
return rating_match.group()
return "Rating not found"
# Example usage
scraper = WebScraper()
# Test with a product URL (replace with a real one)
product_urls = [
"https://example-store.com/product/123",
# Add more URLs to test
]
for url in product_urls:
print(f"\n--- Scraping {url} ---")
product = scraper.scrape_product_details(url)
if product:
for key, value in product.items():
print(f"{key.title()}: {value}")
time.sleep(2) # Be respectful with delays

The Challenges with Traditional Scraping

As you can see, even this "simple" approach requires:

  • Multiple fallback strategies for finding data
  • Complex regular expressions for extracting patterns
  • URL handling for images and links
  • Error handling for network issues
  • Rate limiting to avoid being blocked
  • User agent management to appear like a real browser

And we haven't even touched on the biggest challenge: JavaScript-rendered content.

The JavaScript Problem

Many modern websites load their content dynamically with JavaScript. Let's see what happens when we try to scrape a React-based site:

JavaScript scraping challenge

import requests
from bs4 import BeautifulSoup
def try_scraping_spa():
"""
Attempt to scrape a Single Page Application (SPA)
This will demonstrate why traditional scraping fails
"""
# Try scraping a React/Vue app
spa_url = "https://example-react-app.com"
response = requests.get(spa_url)
soup = BeautifulSoup(response.content, 'html.parser')
print("HTML content received:")
print(soup.prettify()[:500])
# You'll likely see something like:
# <div id="root"></div>
# <script src="app.js"></script>
#
# The actual content is loaded by JavaScript after the page loads!
try_scraping_spa()

When you run this against a modern web app, you'll see mostly empty HTML with JavaScript files. The content you want is generated after the page loads, which requests and BeautifulSoup can't handle.

Traditional Solution: Selenium (The Heavy Approach)

To scrape JavaScript-heavy sites, many developers turn to Selenium:

Selenium approach

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
def setup_driver():
"""Setup Chrome driver with options"""
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in background
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# You need to download ChromeDriver and add it to PATH
driver = webdriver.Chrome(options=chrome_options)
return driver
def scrape_with_selenium(url):
"""Scrape JavaScript-heavy sites with Selenium"""
driver = setup_driver()
try:
# Navigate to the page
driver.get(url)
# Wait for content to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "body")))
# Additional wait for dynamic content
time.sleep(3)
# Extract data
elements = driver.find_elements(By.CSS_SELECTOR, ".content-item")
data = []
for element in elements:
data.append(element.text)
return data
except Exception as e:
print(f"Error: {e}")
return []
finally:
driver.quit()
# Example usage
# data = scrape_with_selenium("https://dynamic-content-site.com")

Problems with Selenium

While Selenium works, it comes with significant challenges:

  • Complex setup: Need to install browser drivers
  • Resource intensive: Launches full browser instances
  • Slow: Much slower than HTTP requests
  • Brittle: Breaks when browser/driver versions change
  • Scaling issues: Difficult to run multiple instances
  • Detection: Easier for sites to detect and block

Method 2: The Modern Approach (Supacrawler API)

This is where Supacrawler shines. It handles all the complexity we just discussed with a simple API call. Let's see the difference:

Supacrawler: The simple solution

from supacrawler import SupacrawlerClient
import os
# Initialize the client
client = SupacrawlerClient(api_key=os.environ.get('SUPACRAWLER_API_KEY', 'YOUR_API_KEY'))
def scrape_with_supacrawler(url):
"""
Scrape any website (static or JavaScript) with one API call
"""
try:
# Single API call handles everything
response = client.scrape(
url=url,
render_js=True, # Handle JavaScript automatically
format="markdown" # Get clean, structured content
)
return {
'title': response.metadata.title if response.metadata else 'No title',
'content': response.markdown,
'url': url
}
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
# Example: Scrape different types of sites
sites_to_scrape = [
"https://news.ycombinator.com", # Static content
"https://react-app-example.com", # JavaScript content
"https://docs.python.org/3/tutorial/" # Documentation
]
for site in sites_to_scrape:
print(f"\n--- Scraping {site} ---")
result = scrape_with_supacrawler(site)
if result:
print(f"Title: {result['title']}")
print(f"Content length: {len(result['content'])} characters")
print(f"First 200 chars: {result['content'][:200]}...")

Advanced Supacrawler Usage

For more complex scraping needs, Supacrawler provides powerful features:

Advanced Supacrawler features

from supacrawler import SupacrawlerClient
import os
import json
client = SupacrawlerClient(api_key=os.environ.get('SUPACRAWLER_API_KEY'))
def scrape_product_catalog(base_url):
"""
Scrape an entire product catalog with structured data extraction
"""
response = client.scrape(
url=base_url,
render_js=True,
format="html", # Get HTML for structured extraction
selectors={
"products": {
"selector": ".product-card",
"multiple": True,
"fields": {
"name": ".product-name",
"price": ".price",
"image": "img@src",
"rating": ".rating@data-rating",
"link": "a@href"
}
}
}
)
return response.data.get("products", []) if response.data else []
def scrape_news_with_metadata(news_url):
"""
Scrape news articles with rich metadata
"""
response = client.scrape(
url=news_url,
render_js=True,
format="markdown",
include_metadata=True
)
return {
'title': response.metadata.title if response.metadata else None,
'description': response.metadata.description if response.metadata else None,
'author': response.metadata.author if response.metadata else None,
'publish_date': response.metadata.publish_date if response.metadata else None,
'content': response.markdown,
'word_count': len(response.markdown.split()) if response.markdown else 0
}
def monitor_competitor_prices(competitor_urls):
"""
Monitor multiple competitor sites for price changes
"""
price_data = {}
for url in competitor_urls:
response = client.scrape(
url=url,
render_js=True,
selectors={
"price": ".price, .cost, [data-price]",
"product_name": "h1, .product-title"
}
)
if response.data:
price_data[url] = {
'product': response.data.get('product_name', 'Unknown'),
'price': response.data.get('price', 'Not found'),
'timestamp': response.metadata.scraped_at if response.metadata else None
}
return price_data
# Example usage
if __name__ == "__main__":
# Example 1: Scrape product catalog
print("=== Product Catalog ===")
products = scrape_product_catalog("https://example-store.com/products")
for product in products[:3]: # Show first 3 products
print(f"Product: {product.get('name', 'N/A')}")
print(f"Price: {product.get('price', 'N/A')}")
print(f"Rating: {product.get('rating', 'N/A')}")
print("---")
# Example 2: Scrape news with metadata
print("\n=== News Article ===")
article = scrape_news_with_metadata("https://techcrunch.com/latest-article")
print(f"Title: {article['title']}")
print(f"Author: {article['author']}")
print(f"Word count: {article['word_count']}")
# Example 3: Monitor competitor prices
print("\n=== Competitor Monitoring ===")
competitors = [
"https://competitor1.com/product",
"https://competitor2.com/product"
]
prices = monitor_competitor_prices(competitors)
print(json.dumps(prices, indent=2))

Comparison: Traditional vs Modern Approach

Let's see the difference side by side:

AspectBeautifulSoup + RequestsSeleniumSupacrawler API
Setuppip install 2 packagesComplex driver managementpip install supacrawler
JavaScript❌ Not supported✅ Full support✅ Automatic handling
SpeedFast for static contentSlow (2-5 seconds per page)Fast (< 1 second)
MemoryLow (~10MB)High (~100-300MB per instance)Minimal (~5MB)
ScalingManual proxy/rate limitingComplex orchestrationBuilt-in scaling
MaintenanceConstant updates neededDriver version managementZero maintenance
Error HandlingManual implementationComplex exception handlingBuilt-in retries
Code Complexity50-100 lines80-150 lines5-10 lines

When to Use Each Approach

Use BeautifulSoup + Requests when:

  • Learning web scraping fundamentals
  • Scraping simple, static websites
  • You need maximum control over the process
  • Working with very high volume (thousands of pages)

Use Supacrawler when:

  • Scraping modern websites with JavaScript
  • Building production applications
  • You want reliability and minimal maintenance
  • Working with complex sites (SPAs, authentication, etc.)
  • Focusing on data use rather than scraping mechanics

Best Practices for Python Web Scraping

Regardless of which tool you choose, follow these best practices:

Best practices

import time
import random
from datetime import datetime
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class EthicalScraper:
def __init__(self):
self.request_delays = (1, 3) # Random delay between 1-3 seconds
self.max_retries = 3
def respectful_scrape(self, urls):
"""
Scrape multiple URLs while being respectful
"""
results = []
for i, url in enumerate(urls):
logger.info(f"Scraping {i+1}/{len(urls)}: {url}")
# Retry logic
for attempt in range(self.max_retries):
try:
# Your scraping code here (Supacrawler or other)
result = self.scrape_single_url(url)
results.append(result)
break
except Exception as e:
logger.warning(f"Attempt {attempt+1} failed for {url}: {e}")
if attempt == self.max_retries - 1:
logger.error(f"All attempts failed for {url}")
results.append(None)
else:
time.sleep(2 ** attempt) # Exponential backoff
# Random delay between requests
if i < len(urls) - 1: # Don't wait after last URL
delay = random.uniform(*self.request_delays)
logger.info(f"Waiting {delay:.1f} seconds before next request")
time.sleep(delay)
return results
def scrape_single_url(self, url):
"""
Override this method with your actual scraping logic
"""
# Example with Supacrawler
from supacrawler import SupacrawlerClient
client = SupacrawlerClient(api_key='YOUR_API_KEY')
return client.scrape(url=url, render_js=True)
def check_robots_txt(self, base_url):
"""
Check robots.txt to see if scraping is allowed
"""
import requests
from urllib.robotparser import RobotFileParser
robots_url = f"{base_url}/robots.txt"
try:
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
# Check if our user agent can fetch the page
can_fetch = rp.can_fetch('*', base_url)
logger.info(f"Robots.txt allows scraping: {can_fetch}")
return can_fetch
except Exception as e:
logger.warning(f"Could not read robots.txt: {e}")
return True # Assume allowed if can't read
# Additional utility functions
def save_scraped_data(data, filename=None):
"""Save scraped data with timestamp"""
if filename is None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"scraped_data_{timestamp}.json"
import json
with open(filename, 'w') as f:
json.dump(data, f, indent=2, default=str)
logger.info(f"Data saved to {filename}")
def validate_url(url):
"""Basic URL validation"""
from urllib.parse import urlparse
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except:
return False
# Example usage
scraper = EthicalScraper()
# Check if scraping is allowed
if scraper.check_robots_txt("https://example.com"):
# Scrape respectfully
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
results = scraper.respectful_scrape(urls)
save_scraped_data(results)

Common Challenges and Solutions

Challenge 1: Getting Blocked

Problem: Website returns 403/429 errors or blocks your IP

Solutions:

# Add realistic headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive'
}
# Add delays between requests
time.sleep(random.uniform(1, 3))
# With Supacrawler, these issues are handled automatically

Challenge 2: Dynamic Content Loading

Problem: Content loads after page renders

Traditional Solution: Complex Selenium setup Modern Solution: Supacrawler handles it automatically with render_js=True

Challenge 3: Complex Data Extraction

Problem: Data is scattered across multiple elements

Supacrawler Solution:

selectors = {
"product": {
"selector": ".product-card",
"multiple": True,
"fields": {
"name": ".title",
"price": ".price",
"availability": ".stock-status",
"reviews": {
"selector": ".review",
"multiple": True,
"fields": {
"rating": ".stars@data-rating",
"comment": ".review-text"
}
}
}
}
}

Building Your First Real Project

Let's put everything together and build a practical project: a news aggregator that monitors multiple sources.

Complete news aggregator project

from supacrawler import SupacrawlerClient
import os
import json
from datetime import datetime
import time
class NewsAggregator:
def __init__(self):
self.client = SupacrawlerClient(
api_key=os.environ.get('SUPACRAWLER_API_KEY', 'YOUR_API_KEY')
)
self.sources = {
'TechCrunch': 'https://techcrunch.com',
'Hacker News': 'https://news.ycombinator.com',
'BBC Tech': 'https://www.bbc.com/news/technology',
'The Verge': 'https://www.theverge.com'
}
def scrape_source(self, source_name, url):
"""Scrape headlines from a news source"""
print(f"Scraping {source_name}...")
try:
response = self.client.scrape(
url=url,
render_js=True,
format="markdown",
selectors={
"headlines": {
"selector": "h1, h2, h3, .headline, .title, .story-title",
"multiple": True
}
}
)
headlines = []
if response.data and response.data.get('headlines'):
for headline in response.data['headlines']:
if isinstance(headline, str) and len(headline.strip()) > 20:
headlines.append(headline.strip())
return {
'source': source_name,
'url': url,
'headlines': headlines[:10], # Top 10 headlines
'scraped_at': datetime.now().isoformat(),
'total_found': len(headlines)
}
except Exception as e:
print(f"Error scraping {source_name}: {e}")
return {
'source': source_name,
'url': url,
'headlines': [],
'error': str(e),
'scraped_at': datetime.now().isoformat()
}
def aggregate_all_news(self):
"""Scrape all news sources and aggregate results"""
all_news = []
for source_name, url in self.sources.items():
news_data = self.scrape_source(source_name, url)
all_news.append(news_data)
# Be respectful - wait between requests
time.sleep(2)
return all_news
def save_results(self, news_data, filename=None):
"""Save results to JSON file"""
if filename is None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"news_aggregator_{timestamp}.json"
with open(filename, 'w') as f:
json.dump(news_data, f, indent=2)
print(f"Results saved to {filename}")
return filename
def print_summary(self, news_data):
"""Print a summary of scraped news"""
print("\n" + "="*50)
print("NEWS AGGREGATOR SUMMARY")
print("="*50)
total_headlines = 0
for source_data in news_data:
source_name = source_data['source']
headlines = source_data.get('headlines', [])
error = source_data.get('error')
print(f"\n{source_name}:")
if error:
print(f" ❌ Error: {error}")
else:
print(f" ✅ Found {len(headlines)} headlines")
total_headlines += len(headlines)
# Show first 3 headlines
for i, headline in enumerate(headlines[:3], 1):
print(f" {i}. {headline[:80]}...")
print(f"\nTotal headlines collected: {total_headlines}")
# Example usage
if __name__ == "__main__":
# Create aggregator
aggregator = NewsAggregator()
# Scrape all sources
news_data = aggregator.aggregate_all_news()
# Print summary
aggregator.print_summary(news_data)
# Save results
filename = aggregator.save_results(news_data)
print(f"\nDone! Check {filename} for complete results.")

Next Steps and Advanced Topics

Congratulations! You now understand Python web scraping from basic principles to modern solutions. Here are some areas to explore next:

1. Handling Authentication

# Supacrawler can handle login flows
response = client.scrape(
url="https://private-site.com/data",
authentication={
"type": "form",
"login_url": "https://private-site.com/login",
"username": "your_username",
"password": "your_password"
}
)

2. Large-Scale Crawling

# Use the Crawl API for entire websites
crawl_job = client.create_crawl_job(
url="https://documentation-site.com",
depth=3,
max_pages=500,
include_patterns=["/docs/*"]
)

3. Monitoring Changes

# Set up automated monitoring
watch_job = client.create_watch_job(
url="https://competitor.com/pricing",
frequency="daily",
notify_email="[email protected]"
)

Conclusion

Web scraping in Python has evolved dramatically. While understanding the fundamentals with BeautifulSoup and Requests is valuable, modern tools like Supacrawler eliminate most of the complexity while providing superior results.

Key Takeaways:

  1. Start simple - Understand the basics with traditional tools
  2. Recognize limitations - JavaScript content requires different approaches
  3. Choose the right tool - Supacrawler for production, BeautifulSoup for learning
  4. Be respectful - Follow rate limits and robots.txt
  5. Focus on value - Spend time on using data, not wrestling with scraping mechanics

The goal isn't to become an expert in browser automation - it's to get the data you need to build amazing projects. Choose the approach that lets you focus on what matters most.

Ready to start scraping?

Happy scraping! 🐍✨

By Supacrawler Team
Published on July 2, 2025