Complete Guide to Image Scraping: Websites, GitHub Repos, and Beyond (2025)
Modern websites load images dynamically through JavaScript, making traditional HTTP-only scrapers ineffective. Supacrawler solves this by driving a real browser for you — no Playwright/Puppeteer setup, no headless Chrome to manage, just a simple API call to extract the images you need.
Note: If you're using the SDKs (recommended), install them first or see our Install guide: Install the SDKs.
Goal
Extract all images from a website, including those loaded by JavaScript, with their URLs, alt text, and dimensions. We'll also show how to target specific repositories on GitHub to extract images from README files and other content.
Scraping images from a website
curl -X POST https://api.supacrawler.com/api/v1/scrape \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://github.com/trending","render_js": true,"format": "html","selectors": {"images": {"selector": "img","multiple": true,"attributes": ["src", "alt", "width", "height"]}}}'
Scraping Images from GitHub Repositories
GitHub repositories often contain valuable images in README files, documentation, and issues. Here's how to extract them:
Scraping images from GitHub repositories
curl -X POST https://api.supacrawler.com/api/v1/scrape \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://github.com/username/repository","render_js": true,"format": "html","selectors": {"readme_images": {"selector": "#readme img","multiple": true,"attributes": ["src", "alt"]},"repo_name": {"selector": "strong[itemprop=\"name\"] a","text": true}}}'
Tips for Effective Image Scraping
-
JavaScript Rendering: Always set
render_js=True
when scraping modern websites to ensure dynamically loaded images are captured. -
CSS Selectors: Use specific CSS selectors to target the images you want:
img
for all images.container img
for images within a specific containerimg[src*="large"]
for images with "large" in the src attribute
-
Attribute Selection: Request specific attributes to get the information you need:
src
for the image URLalt
for alternative textwidth
andheight
for dimensionsdata-*
attributes for custom metadata
-
Image Filtering: Filter out icons, avatars, and other small images by checking dimensions or URL patterns.
-
URL Handling: Convert relative URLs to absolute URLs before downloading.
-
Rate Limiting: Implement delays between requests to avoid overloading the target server.
Advanced: Scraping Images from Infinite Scroll Pages
Many modern websites load additional images as the user scrolls down the page. Here's how to handle this with Supacrawler:
Scraping images from infinite scroll pages
import { SupacrawlerClient } from '@supacrawler/js'const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY || 'YOUR_API_KEY' })async function scrapeInfiniteScrollImages(url, scrollCount = 3) {// Scrape with scrolling simulationconst response = await client.scrape({url: url,render_js: true,format: 'html',scroll_to_bottom: true,max_scroll_attempts: scrollCount,selectors: {images: {selector: 'img',multiple: true,attributes: ['src', 'alt', 'width', 'height']}}})const images = response.data?.images || []console.log(`Found ${images.length} images after ${scrollCount} scroll attempts`)return images}// Example usagescrapeInfiniteScrollImages('https://unsplash.com/t/nature', 5).then(images => console.log(images)).catch(console.error)
Best Practices for Image Scraping
-
Respect Copyright: Only scrape and use images that you have the right to use. Check the website's terms of service and image licensing.
-
User-Agent: Set an appropriate user agent to identify your scraper.
-
Error Handling: Implement robust error handling for network issues, invalid images, and other potential problems.
-
Caching: Cache results to avoid unnecessary repeated requests to the same URLs.
-
Metadata Storage: Save image metadata (dimensions, alt text, etc.) along with the images for better organization.
-
Pagination: For sites with pagination, iterate through the pages to collect all images.
-
Throttling: Limit the rate of your requests to avoid overwhelming the target server.
Conclusion
With Supacrawler, you can efficiently extract images from any website, including JavaScript-heavy sites and GitHub repositories, without managing browser infrastructure. The API handles the complexity of rendering modern web pages, allowing you to focus on using the extracted images.
Whether you're building an image search engine, collecting dataset images for machine learning, or archiving visual content from GitHub repositories, Supacrawler provides a reliable and scalable solution.
Ready to start scraping images? Try Supacrawler for free with 1,000 API calls per month to simplify your web scraping projects.