The Challenge of JavaScript-Rendered Content
Modern websites increasingly rely on JavaScript to render content. Single-page applications (SPAs) built with React, Vue, or Angular load a minimal HTML shell, then fetch and render data client-side. When you make a simple HTTP request to these sites, you get an empty or incomplete page because the content only exists after JavaScript execution.
Scraping JavaScript-heavy websites requires headless browsers — real browser engines running without a visible window that can execute JavaScript, render DOM, and interact with page elements. Combined with proxies, headless browsers unlock data from even the most dynamic websites.
This guide is part of our Complete Guide to Web Scraping Proxies. For avoiding detection while using headless browsers, see How Anti-Bot Systems Detect Proxies.
When Do You Need a Headless Browser?
| Scenario | Simple HTTP | Headless Browser |
|---|---|---|
| Static HTML pages | Works perfectly | Overkill |
| Server-rendered pages with API | Works (hit the API directly) | Not needed |
| SPA (React, Vue, Angular) | Gets empty shell | Required |
| Infinite scroll / lazy loading | Cannot trigger | Required |
| Content behind login + JS | Difficult | Recommended |
| Pages with anti-bot JS checks | Fails detection | Required |
Always check if the site has an API or server-side rendering before reaching for a headless browser. Many "JavaScript-heavy" sites actually have API endpoints that return clean JSON — much faster and cheaper to scrape.
Puppeteer + Proxies (Node.js)
Puppeteer controls Chrome/Chromium programmatically. It is the most mature headless browser tool for Node.js.
Basic Setup with ProxyHat
const puppeteer = require('puppeteer');
async function scrapeWithPuppeteer(url) {
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--proxy-server=http://gate.proxyhat.com:8080',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
});
const page = await browser.newPage();
// Authenticate with proxy
await page.authenticate({
username: 'USERNAME',
password: 'PASSWORD',
});
// Set realistic viewport and user agent
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
try {
await page.goto(url, { waitUntil: 'networkidle2', timeout: 60000 });
// Wait for specific content to render
await page.waitForSelector('.product-list', { timeout: 10000 });
const content = await page.content();
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-item')).map(el => ({
name: el.querySelector('.product-name')?.textContent?.trim(),
price: el.querySelector('.product-price')?.textContent?.trim(),
url: el.querySelector('a')?.href,
}));
});
return { html: content, data };
} finally {
await browser.close();
}
}
// Usage
const result = await scrapeWithPuppeteer('https://example.com/products');
console.log(`Found ${result.data.length} products`);
Optimized Multi-Page Scraping
const puppeteer = require('puppeteer');
class PuppeteerScraper {
constructor(concurrency = 3) {
this.concurrency = concurrency;
this.browser = null;
}
async init() {
this.browser = await puppeteer.launch({
headless: 'new',
args: [
'--proxy-server=http://gate.proxyhat.com:8080',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-gpu',
'--disable-extensions',
],
});
}
async scrapePage(url) {
const page = await this.browser.newPage();
await page.authenticate({ username: 'USERNAME', password: 'PASSWORD' });
await page.setViewport({ width: 1920, height: 1080 });
// Block unnecessary resources to speed up loading
await page.setRequestInterception(true);
page.on('request', (req) => {
const type = req.resourceType();
if (['image', 'stylesheet', 'font', 'media'].includes(type)) {
req.abort();
} else {
req.continue();
}
});
try {
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
const content = await page.content();
return { url, status: 'success', html: content };
} catch (err) {
return { url, status: 'error', error: err.message };
} finally {
await page.close();
}
}
async scrapeMany(urls) {
const results = [];
for (let i = 0; i < urls.length; i += this.concurrency) {
const batch = urls.slice(i, i + this.concurrency);
const batchResults = await Promise.all(
batch.map(url => this.scrapePage(url))
);
results.push(...batchResults);
console.log(`Progress: ${results.length}/${urls.length}`);
}
return results;
}
async close() {
if (this.browser) await this.browser.close();
}
}
// Usage
const scraper = new PuppeteerScraper(3);
await scraper.init();
const results = await scraper.scrapeMany(urls);
await scraper.close();
Playwright + Proxies (Python)
Playwright is a newer alternative that supports Chromium, Firefox, and WebKit. Its Python API is clean and well-suited for scraping.
Basic Setup
from playwright.sync_api import sync_playwright
def scrape_with_playwright(url: str) -> dict:
"""Scrape a JavaScript-heavy page using Playwright with ProxyHat proxy."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.proxyhat.com:8080",
"username": "USERNAME",
"password": "PASSWORD",
}
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
)
page = context.new_page()
try:
page.goto(url, wait_until="networkidle", timeout=60000)
# Wait for dynamic content
page.wait_for_selector(".product-list", timeout=10000)
# Extract data using page.evaluate
products = page.evaluate("""() => {
return Array.from(document.querySelectorAll('.product-item')).map(el => ({
name: el.querySelector('.product-name')?.textContent?.trim(),
price: el.querySelector('.product-price')?.textContent?.trim(),
url: el.querySelector('a')?.href,
}));
}""")
return {"url": url, "products": products, "html": page.content()}
finally:
browser.close()
Async Playwright for Parallel Scraping
import asyncio
from playwright.async_api import async_playwright
async def scrape_batch(urls: list[str], concurrency: int = 3) -> list[dict]:
"""Scrape multiple JS-heavy pages in parallel using Playwright."""
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.proxyhat.com:8080",
"username": "USERNAME",
"password": "PASSWORD",
}
)
semaphore = asyncio.Semaphore(concurrency)
async def scrape_one(url: str) -> dict:
async with semaphore:
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
)
page = await context.new_page()
# Block heavy resources
await page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}",
lambda route: route.abort())
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
html = await page.content()
return {"url": url, "status": "success", "html": html}
except Exception as e:
return {"url": url, "status": "error", "error": str(e)}
finally:
await context.close()
tasks = [scrape_one(url) for url in urls]
results = await asyncio.gather(*tasks)
await browser.close()
return results
# Usage
urls = [f"https://example.com/product/{i}" for i in range(50)]
results = asyncio.run(scrape_batch(urls, concurrency=5))
Go: Using chromedp with Proxies
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/chromedp"
)
func scrapeJSPage(targetURL string) (string, error) {
// Configure proxy
opts := append(chromedp.DefaultExecAllocatorOptions[:],
chromedp.ProxyServer("http://gate.proxyhat.com:8080"),
chromedp.Flag("headless", true),
chromedp.Flag("disable-gpu", true),
chromedp.Flag("no-sandbox", true),
chromedp.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "+
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"),
)
allocCtx, cancel := chromedp.NewExecAllocator(context.Background(), opts...)
defer cancel()
ctx, cancel := chromedp.NewContext(allocCtx)
defer cancel()
ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
defer cancel()
var htmlContent string
err := chromedp.Run(ctx,
chromedp.Navigate(targetURL),
chromedp.WaitVisible(".product-list", chromedp.ByQuery),
chromedp.OuterHTML("html", &htmlContent),
)
if err != nil {
return "", fmt.Errorf("scrape failed: %w", err)
}
return htmlContent, nil
}
func main() {
html, err := scrapeJSPage("https://example.com/products")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Got %d bytes of rendered HTML\n", len(html))
}
Performance Optimization Strategies
Headless browsers are 10-50x slower than simple HTTP requests. Here are strategies to minimize the performance gap:
1. Block Unnecessary Resources
Images, CSS, fonts, and media files are not needed for data extraction. Blocking them dramatically speeds up page loads:
# Playwright resource blocking
async def fast_scrape(page, url):
# Block images, CSS, fonts, media
await page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2,mp4,webm}",
lambda route: route.abort())
# Also block tracking scripts
await page.route("**/*google-analytics*", lambda route: route.abort())
await page.route("**/*facebook*", lambda route: route.abort())
await page.goto(url, wait_until="domcontentloaded") # Faster than networkidle
return await page.content()
2. Use the Right Wait Strategy
| Strategy | Speed | Reliability | Use Case |
|---|---|---|---|
domcontentloaded | Fast | May miss async data | Pages with inline data |
load | Medium | Good | Most pages |
networkidle | Slow | Highest | Heavy SPAs, infinite scroll |
| Specific selector | Variable | Highest | When you know the target element |
3. Reuse Browser Instances
Launching a browser takes 1-3 seconds. For batch scraping, launch once and create new pages/contexts for each URL:
from playwright.sync_api import sync_playwright
class BrowserPool:
"""Reusable browser pool for efficient headless scraping."""
def __init__(self, pool_size: int = 3):
self.pool_size = pool_size
self.playwright = None
self.browsers = []
def start(self):
self.playwright = sync_playwright().start()
for _ in range(self.pool_size):
browser = self.playwright.chromium.launch(
headless=True,
proxy={
"server": "http://gate.proxyhat.com:8080",
"username": "USERNAME",
"password": "PASSWORD",
}
)
self.browsers.append(browser)
def get_browser(self, index: int):
return self.browsers[index % self.pool_size]
def stop(self):
for browser in self.browsers:
browser.close()
self.playwright.stop()
# Usage
pool = BrowserPool(pool_size=3)
pool.start()
for i, url in enumerate(urls):
browser = pool.get_browser(i)
context = browser.new_context()
page = context.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
context.close()
pool.stop()
4. Intercept API Calls Instead of Parsing DOM
Many SPAs fetch data from APIs. Intercept those API calls directly — you get clean JSON without parsing HTML:
const puppeteer = require('puppeteer');
async function interceptAPIData(url) {
const browser = await puppeteer.launch({
headless: 'new',
args: ['--proxy-server=http://gate.proxyhat.com:8080'],
});
const page = await browser.newPage();
await page.authenticate({ username: 'USERNAME', password: 'PASSWORD' });
const apiResponses = [];
// Intercept XHR/fetch responses
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/') || url.includes('/graphql')) {
try {
const json = await response.json();
apiResponses.push({ url, data: json });
} catch {
// Not JSON, skip
}
}
});
await page.goto(url, { waitUntil: 'networkidle2' });
await browser.close();
return apiResponses;
}
// Get clean API data instead of scraping DOM
const data = await interceptAPIData('https://example.com/products');
console.log(`Intercepted ${data.length} API calls`);
Headless Browser vs HTTP Comparison
| Metric | Simple HTTP + Proxy | Headless Browser + Proxy |
|---|---|---|
| Speed per page | 0.5-2 seconds | 3-15 seconds |
| Memory per instance | ~50 MB | 200-500 MB |
| CPU usage | Minimal | Significant |
| Bandwidth per page | 50-200 KB | 2-10 MB (with resources) |
| JavaScript rendering | No | Full |
| Anti-bot bypass | Limited | Better (real browser) |
| Concurrent pages | 100+ | 3-10 per machine |
Best Practices
- Always try HTTP first. Check for API endpoints, server-rendered content, or JSON embedded in the HTML before using a headless browser.
- Block unnecessary resources. Images, CSS, and fonts add load time without providing data.
- Use specific selectors for waiting.
networkidleis safe but slow. Wait for the specific element you need. - Reuse browser instances. Launch once, create new contexts per page.
- Intercept API calls. Many SPAs load data via APIs — intercept the JSON directly.
- Limit concurrency. Headless browsers are memory-intensive. 3-5 concurrent pages per GB of RAM is a good rule.
- Use residential proxies. ProxyHat residential proxies provide the highest trust scores, reducing detection when running headless browsers.
For handling CAPTCHAs that headless browsers encounter, see Handling CAPTCHAs When Scraping. For scaling headless browser scraping, read How to Scale Scraping Infrastructure.
Get started with the Python SDK, Node SDK, or Go SDK for proxy integration, and explore ProxyHat for Web Scraping.
Frequently Asked Questions
Do I always need a headless browser for JavaScript sites?
No. Many JavaScript-heavy sites load data from API endpoints. Check the browser's Network tab for XHR/fetch requests — if the data comes from an API, you can call that API directly with simple HTTP requests through a proxy, which is much faster.
Puppeteer or Playwright — which is better for scraping?
Playwright is generally recommended for new projects. It supports multiple browser engines (Chromium, Firefox, WebKit), has better auto-waiting, native async support in Python, and built-in proxy configuration. Puppeteer is more mature and has a larger ecosystem if you are in the Node.js world.
How many headless browser pages can I run concurrently?
Each page consumes 200-500 MB of RAM. On a machine with 8 GB RAM, 3-10 concurrent pages is realistic. Use resource blocking (images, CSS) to reduce memory. For higher concurrency, distribute across multiple machines using a queue-based architecture.
Why use proxies with headless browsers?
Even with a real browser, repeated requests from the same IP get blocked. Proxies rotate your IP so each page load appears to come from a different user. Residential proxies through ProxyHat provide the highest trust scores, minimizing blocks and CAPTCHAs.






