Complete Guide to Image Scraping: Websites, GitHub Repos, and Beyond (2025)

Modern websites load images dynamically through JavaScript, making traditional HTTP-only scrapers ineffective. Supacrawler solves this by driving a real browser for you — no Playwright/Puppeteer setup, no headless Chrome to manage, just a simple API call to extract the images you need.

Note: If you're using the SDKs (recommended), install them first or see our Install guide: Install the SDKs.

Goal

Extract all images from a website, including those loaded by JavaScript, with their URLs, alt text, and dimensions. We'll also show how to target specific repositories on GitHub to extract images from README files and other content.

Scraping images from a website

curl -X POST https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://github.com/trending",
    "render_js": true,
    "format": "html",
    "selectors": {
      "images": {
        "selector": "img",
        "multiple": true,
        "attributes": ["src", "alt", "width", "height"]
      }
    }
  }'

Scraping Images from GitHub Repositories

GitHub repositories often contain valuable images in README files, documentation, and issues. Here's how to extract them:

Scraping images from GitHub repositories

curl -X POST https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://github.com/username/repository",
    "render_js": true,
    "format": "html",
    "selectors": {
      "readme_images": {
        "selector": "#readme img",
        "multiple": true,
        "attributes": ["src", "alt"]
      },
      "repo_name": {
        "selector": "strong[itemprop=\"name\"] a",
        "text": true
      }
    }
  }'

Tips for Effective Image Scraping

JavaScript Rendering: Always set render_js=True when scraping modern websites to ensure dynamically loaded images are captured.
CSS Selectors: Use specific CSS selectors to target the images you want:
- img for all images
- .container img for images within a specific container
- img[src*="large"] for images with "large" in the src attribute
Attribute Selection: Request specific attributes to get the information you need:
- src for the image URL
- alt for alternative text
- width and height for dimensions
- data-* attributes for custom metadata
Image Filtering: Filter out icons, avatars, and other small images by checking dimensions or URL patterns.
URL Handling: Convert relative URLs to absolute URLs before downloading.
Rate Limiting: Implement delays between requests to avoid overloading the target server.

Advanced: Scraping Images from Infinite Scroll Pages

Many modern websites load additional images as the user scrolls down the page. Here's how to handle this with Supacrawler:

Scraping images from infinite scroll pages

import { SupacrawlerClient } from '@supacrawler/js'

const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY || 'YOUR_API_KEY' })

async function scrapeInfiniteScrollImages(url, scrollCount = 3) {
  // Scrape with scrolling simulation
  const response = await client.scrape({ 
    url: url, 
    render_js: true,
    format: 'html',
    scroll_to_bottom: true,
    max_scroll_attempts: scrollCount,
    selectors: {
      images: {
        selector: 'img',
        multiple: true,
        attributes: ['src', 'alt', 'width', 'height']
      }
    }
  })
  
  const images = response.data?.images || []
  console.log(`Found ${images.length} images after ${scrollCount} scroll attempts`)
  
  return images
}

// Example usage
scrapeInfiniteScrollImages('https://unsplash.com/t/nature', 5)
  .then(images => console.log(images))
  .catch(console.error)

Best Practices for Image Scraping

Respect Copyright: Only scrape and use images that you have the right to use. Check the website's terms of service and image licensing.
User-Agent: Set an appropriate user agent to identify your scraper.
Error Handling: Implement robust error handling for network issues, invalid images, and other potential problems.
Caching: Cache results to avoid unnecessary repeated requests to the same URLs.
Metadata Storage: Save image metadata (dimensions, alt text, etc.) along with the images for better organization.
Pagination: For sites with pagination, iterate through the pages to collect all images.
Throttling: Limit the rate of your requests to avoid overwhelming the target server.

Conclusion

With Supacrawler, you can efficiently extract images from any website, including JavaScript-heavy sites and GitHub repositories, without managing browser infrastructure. The API handles the complexity of rendering modern web pages, allowing you to focus on using the extracted images.

Whether you're building an image search engine, collecting dataset images for machine learning, or archiving visual content from GitHub repositories, Supacrawler provides a reliable and scalable solution.

Ready to start scraping images? Try Supacrawler for free with 1,000 API calls per month to simplify your web scraping projects.