为什么传统爬虫无法处理JavaScript渲染的网站？

传统爬虫（如Python requests、curl）只获取服务器返回的原始HTML。现代网站使用React、Vue、Angular等框架，页面内容由JavaScript在客户端动态渲染。原始HTML可能只包含一个空的div容器和一堆JS文件，没有实际数据内容。

爬取JS渲染网站有哪些方法？

三种主要方法：1）使用无头浏览器（Puppeteer/Playwright）渲染页面后提取内容——最可靠但资源消耗大；2）拦截浏览器网络请求，找到返回数据的API端点，直接请求API——最高效；3）使用预渲染服务（如Prerender.io）获取渲染后的HTML——介于两者之间。

如何发现网站使用的API端点？

打开浏览器开发者工具的Network标签页，加载目标页面并观察XHR/Fetch请求。找到返回JSON数据的API调用。分析请求的URL模式、参数和认证方式。然后直接通过代理请求这些API端点获取结构化数据，无需渲染整个页面，效率大幅提升。

无头浏览器配合代理爬取JS网站的资源消耗如何？

每个无头浏览器实例（Chromium）约消耗100-300MB内存。同时运行10个实例需要1-3GB内存。CPU使用也显著高于简单HTTP请求。优化策略：限制并发实例数、禁用不必要的资源加载（图片、字体）、复用浏览器上下文、定期重启实例防止内存泄漏。

如何等待JS渲染完成？

避免使用固定等待时间（如sleep 5秒）——既浪费时间又不可靠。使用智能等待策略：等待特定DOM元素出现（page.waitForSelector）、等待网络活动停止（page.waitForLoadState("networkidle")）、等待特定API响应完成。设置合理的超时限制（通常30秒）防止无限等待。

如何爬取JavaScript重度渲染网站 | ProxyHat

The Challenge of JavaScript-Rendered Content

Modern websites increasingly rely on JavaScript to render content. Single-page applications (SPAs) built with React, Vue, or Angular load a minimal HTML shell, then fetch and render data client-side. When you make a simple HTTP request to these sites, you get an empty or incomplete page because the content only exists after JavaScript execution.

Scraping JavaScript-heavy websites requires headless browsers — real browser engines running without a visible window that can execute JavaScript, render DOM, and interact with page elements. Combined with proxies, headless browsers unlock data from even the most dynamic websites.

This guide is part of our Complete Guide to Web Scraping Proxies. For avoiding detection while using headless browsers, see How Anti-Bot Systems Detect Proxies.

When Do You Need a Headless Browser?

Scenario	Simple HTTP	Headless Browser
Static HTML pages	Works perfectly	Overkill
Server-rendered pages with API	Works (hit the API directly)	Not needed
SPA (React, Vue, Angular)	Gets empty shell	Required
Infinite scroll / lazy loading	Cannot trigger	Required
Content behind login + JS	Difficult	Recommended
Pages with anti-bot JS checks	Fails detection	Required

Always check if the site has an API or server-side rendering before reaching for a headless browser. Many "JavaScript-heavy" sites actually have API endpoints that return clean JSON — much faster and cheaper to scrape.

Puppeteer + Proxies (Node.js)

Puppeteer controls Chrome/Chromium programmatically. It is the most mature headless browser tool for Node.js.

Basic Setup with ProxyHat

const puppeteer = require('puppeteer');
async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: [
      '--proxy-server=http://gate.proxyhat.com:8080',
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
    ],
  });
  const page = await browser.newPage();
  // Authenticate with proxy
  await page.authenticate({
    username: 'USERNAME',
    password: 'PASSWORD',
  });
  // Set realistic viewport and user agent
  await page.setViewport({ width: 1920, height: 1080 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
    '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
  );
  try {
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 60000 });
    // Wait for specific content to render
    await page.waitForSelector('.product-list', { timeout: 10000 });
    const content = await page.content();
    const data = await page.evaluate(() => {
      return Array.from(document.querySelectorAll('.product-item')).map(el => ({
        name: el.querySelector('.product-name')?.textContent?.trim(),
        price: el.querySelector('.product-price')?.textContent?.trim(),
        url: el.querySelector('a')?.href,
      }));
    });
    return { html: content, data };
  } finally {
    await browser.close();
  }
}
// Usage
const result = await scrapeWithPuppeteer('https://example.com/products');
console.log(`Found ${result.data.length} products`);

Optimized Multi-Page Scraping

const puppeteer = require('puppeteer');
class PuppeteerScraper {
  constructor(concurrency = 3) {
    this.concurrency = concurrency;
    this.browser = null;
  }
  async init() {
    this.browser = await puppeteer.launch({
      headless: 'new',
      args: [
        '--proxy-server=http://gate.proxyhat.com:8080',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--disable-extensions',
      ],
    });
  }
  async scrapePage(url) {
    const page = await this.browser.newPage();
    await page.authenticate({ username: 'USERNAME', password: 'PASSWORD' });
    await page.setViewport({ width: 1920, height: 1080 });
    // Block unnecessary resources to speed up loading
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const type = req.resourceType();
      if (['image', 'stylesheet', 'font', 'media'].includes(type)) {
        req.abort();
      } else {
        req.continue();
      }
    });
    try {
      await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
      const content = await page.content();
      return { url, status: 'success', html: content };
    } catch (err) {
      return { url, status: 'error', error: err.message };
    } finally {
      await page.close();
    }
  }
  async scrapeMany(urls) {
    const results = [];
    for (let i = 0; i < urls.length; i += this.concurrency) {
      const batch = urls.slice(i, i + this.concurrency);
      const batchResults = await Promise.all(
        batch.map(url => this.scrapePage(url))
      );
      results.push(...batchResults);
      console.log(`Progress: ${results.length}/${urls.length}`);
    }
    return results;
  }
  async close() {
    if (this.browser) await this.browser.close();
  }
}
// Usage
const scraper = new PuppeteerScraper(3);
await scraper.init();
const results = await scraper.scrapeMany(urls);
await scraper.close();

Playwright + Proxies (Python)

Playwright is a newer alternative that supports Chromium, Firefox, and WebKit. Its Python API is clean and well-suited for scraping.

Basic Setup

from playwright.sync_api import sync_playwright
def scrape_with_playwright(url: str) -> dict:
    """Scrape a JavaScript-heavy page using Playwright with ProxyHat proxy."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gate.proxyhat.com:8080",
                "username": "USERNAME",
                "password": "PASSWORD",
            }
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/120.0.0.0 Safari/537.36",
        )
        page = context.new_page()
        try:
            page.goto(url, wait_until="networkidle", timeout=60000)
            # Wait for dynamic content
            page.wait_for_selector(".product-list", timeout=10000)
            # Extract data using page.evaluate
            products = page.evaluate("""() => {
                return Array.from(document.querySelectorAll('.product-item')).map(el => ({
                    name: el.querySelector('.product-name')?.textContent?.trim(),
                    price: el.querySelector('.product-price')?.textContent?.trim(),
                    url: el.querySelector('a')?.href,
                }));
            }""")
            return {"url": url, "products": products, "html": page.content()}
        finally:
            browser.close()

Async Playwright for Parallel Scraping

import asyncio
from playwright.async_api import async_playwright
async def scrape_batch(urls: list[str], concurrency: int = 3) -> list[dict]:
    """Scrape multiple JS-heavy pages in parallel using Playwright."""
    results = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gate.proxyhat.com:8080",
                "username": "USERNAME",
                "password": "PASSWORD",
            }
        )
        semaphore = asyncio.Semaphore(concurrency)
        async def scrape_one(url: str) -> dict:
            async with semaphore:
                context = await browser.new_context(
                    viewport={"width": 1920, "height": 1080},
                )
                page = await context.new_page()
                # Block heavy resources
                await page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}",
                                 lambda route: route.abort())
                try:
                    await page.goto(url, wait_until="networkidle", timeout=30000)
                    html = await page.content()
                    return {"url": url, "status": "success", "html": html}
                except Exception as e:
                    return {"url": url, "status": "error", "error": str(e)}
                finally:
                    await context.close()
        tasks = [scrape_one(url) for url in urls]
        results = await asyncio.gather(*tasks)
        await browser.close()
    return results
# Usage
urls = [f"https://example.com/product/{i}" for i in range(50)]
results = asyncio.run(scrape_batch(urls, concurrency=5))

Go: Using chromedp with Proxies

package main
import (
    "context"
    "fmt"
    "log"
    "time"
    "github.com/chromedp/chromedp"
)
func scrapeJSPage(targetURL string) (string, error) {
    // Configure proxy
    opts := append(chromedp.DefaultExecAllocatorOptions[:],
        chromedp.ProxyServer("http://gate.proxyhat.com:8080"),
        chromedp.Flag("headless", true),
        chromedp.Flag("disable-gpu", true),
        chromedp.Flag("no-sandbox", true),
        chromedp.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "+
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"),
    )
    allocCtx, cancel := chromedp.NewExecAllocator(context.Background(), opts...)
    defer cancel()
    ctx, cancel := chromedp.NewContext(allocCtx)
    defer cancel()
    ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
    defer cancel()
    var htmlContent string
    err := chromedp.Run(ctx,
        chromedp.Navigate(targetURL),
        chromedp.WaitVisible(".product-list", chromedp.ByQuery),
        chromedp.OuterHTML("html", &htmlContent),
    )
    if err != nil {
        return "", fmt.Errorf("scrape failed: %w", err)
    }
    return htmlContent, nil
}
func main() {
    html, err := scrapeJSPage("https://example.com/products")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("Got %d bytes of rendered HTML\n", len(html))
}

Performance Optimization Strategies

Headless browsers are 10-50x slower than simple HTTP requests. Here are strategies to minimize the performance gap:

1. Block Unnecessary Resources

Images, CSS, fonts, and media files are not needed for data extraction. Blocking them dramatically speeds up page loads:

# Playwright resource blocking
async def fast_scrape(page, url):
    # Block images, CSS, fonts, media
    await page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2,mp4,webm}",
                     lambda route: route.abort())
    # Also block tracking scripts
    await page.route("**/*google-analytics*", lambda route: route.abort())
    await page.route("**/*facebook*", lambda route: route.abort())
    await page.goto(url, wait_until="domcontentloaded")  # Faster than networkidle
    return await page.content()

2. Use the Right Wait Strategy

Strategy	Speed	Reliability	Use Case
`domcontentloaded`	Fast	May miss async data	Pages with inline data
`load`	Medium	Good	Most pages
`networkidle`	Slow	Highest	Heavy SPAs, infinite scroll
Specific selector	Variable	Highest	When you know the target element

3. Reuse Browser Instances

Launching a browser takes 1-3 seconds. For batch scraping, launch once and create new pages/contexts for each URL:

from playwright.sync_api import sync_playwright
class BrowserPool:
    """Reusable browser pool for efficient headless scraping."""
    def __init__(self, pool_size: int = 3):
        self.pool_size = pool_size
        self.playwright = None
        self.browsers = []
    def start(self):
        self.playwright = sync_playwright().start()
        for _ in range(self.pool_size):
            browser = self.playwright.chromium.launch(
                headless=True,
                proxy={
                    "server": "http://gate.proxyhat.com:8080",
                    "username": "USERNAME",
                    "password": "PASSWORD",
                }
            )
            self.browsers.append(browser)
    def get_browser(self, index: int):
        return self.browsers[index % self.pool_size]
    def stop(self):
        for browser in self.browsers:
            browser.close()
        self.playwright.stop()
# Usage
pool = BrowserPool(pool_size=3)
pool.start()
for i, url in enumerate(urls):
    browser = pool.get_browser(i)
    context = browser.new_context()
    page = context.new_page()
    page.goto(url, wait_until="networkidle")
    html = page.content()
    context.close()
pool.stop()

4. Intercept API Calls Instead of Parsing DOM

Many SPAs fetch data from APIs. Intercept those API calls directly — you get clean JSON without parsing HTML:

const puppeteer = require('puppeteer');
async function interceptAPIData(url) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--proxy-server=http://gate.proxyhat.com:8080'],
  });
  const page = await browser.newPage();
  await page.authenticate({ username: 'USERNAME', password: 'PASSWORD' });
  const apiResponses = [];
  // Intercept XHR/fetch responses
  page.on('response', async (response) => {
    const url = response.url();
    if (url.includes('/api/') || url.includes('/graphql')) {
      try {
        const json = await response.json();
        apiResponses.push({ url, data: json });
      } catch {
        // Not JSON, skip
      }
    }
  });
  await page.goto(url, { waitUntil: 'networkidle2' });
  await browser.close();
  return apiResponses;
}
// Get clean API data instead of scraping DOM
const data = await interceptAPIData('https://example.com/products');
console.log(`Intercepted ${data.length} API calls`);

Headless Browser vs HTTP Comparison

Metric	Simple HTTP + Proxy	Headless Browser + Proxy
Speed per page	0.5-2 seconds	3-15 seconds
Memory per instance	~50 MB	200-500 MB
CPU usage	Minimal	Significant
Bandwidth per page	50-200 KB	2-10 MB (with resources)
JavaScript rendering	No	Full
Anti-bot bypass	Limited	Better (real browser)
Concurrent pages	100+	3-10 per machine

Best Practices

Always try HTTP first. Check for API endpoints, server-rendered content, or JSON embedded in the HTML before using a headless browser.
Block unnecessary resources. Images, CSS, and fonts add load time without providing data.
Use specific selectors for waiting. networkidle is safe but slow. Wait for the specific element you need.
Reuse browser instances. Launch once, create new contexts per page.
Intercept API calls. Many SPAs load data via APIs — intercept the JSON directly.
Limit concurrency. Headless browsers are memory-intensive. 3-5 concurrent pages per GB of RAM is a good rule.
Use residential proxies. ProxyHat residential proxies provide the highest trust scores, reducing detection when running headless browsers.

For handling CAPTCHAs that headless browsers encounter, see Handling CAPTCHAs When Scraping. For scaling headless browser scraping, read How to Scale Scraping Infrastructure.

Get started with the Python SDK, Node SDK, or Go SDK for proxy integration, and explore ProxyHat for Web Scraping.

Frequently Asked Questions

Do I always need a headless browser for JavaScript sites?

No. Many JavaScript-heavy sites load data from API endpoints. Check the browser's Network tab for XHR/fetch requests — if the data comes from an API, you can call that API directly with simple HTTP requests through a proxy, which is much faster.

Puppeteer or Playwright — which is better for scraping?

Playwright is generally recommended for new projects. It supports multiple browser engines (Chromium, Firefox, WebKit), has better auto-waiting, native async support in Python, and built-in proxy configuration. Puppeteer is more mature and has a larger ecosystem if you are in the Node.js world.

How many headless browser pages can I run concurrently?

Each page consumes 200-500 MB of RAM. On a machine with 8 GB RAM, 3-10 concurrent pages is realistic. Use resource blocking (images, CSS) to reduce memory. For higher concurrency, distribute across multiple machines using a queue-based architecture.

Why use proxies with headless browsers?

Even with a real browser, repeated requests from the same IP get blocked. Proxies rotate your IP so each page load appears to come from a different user. Residential proxies through ProxyHat provide the highest trust scores, minimizing blocks and CAPTCHAs.

如何爬取JavaScript重度渲染的网站

The Challenge of JavaScript-Rendered Content

When Do You Need a Headless Browser?

Puppeteer + Proxies (Node.js)

Basic Setup with ProxyHat

Optimized Multi-Page Scraping

Playwright + Proxies (Python)

Basic Setup

Async Playwright for Parallel Scraping

Go: Using chromedp with Proxies

Performance Optimization Strategies

1. Block Unnecessary Resources

2. Use the Right Wait Strategy

3. Reuse Browser Instances

4. Intercept API Calls Instead of Parsing DOM

Headless Browser vs HTTP Comparison

Best Practices

Frequently Asked Questions

Do I always need a headless browser for JavaScript sites?

Puppeteer or Playwright — which is better for scraping?

How many headless browser pages can I run concurrently?

Why use proxies with headless browsers?

准备开始了吗？

The Challenge of JavaScript-Rendered Content

When Do You Need a Headless Browser?

Puppeteer + Proxies (Node.js)

Basic Setup with ProxyHat

Optimized Multi-Page Scraping

Playwright + Proxies (Python)

Basic Setup

Async Playwright for Parallel Scraping

Go: Using chromedp with Proxies

Performance Optimization Strategies

1. Block Unnecessary Resources

2. Use the Right Wait Strategy

3. Reuse Browser Instances

4. Intercept API Calls Instead of Parsing DOM

Headless Browser vs HTTP Comparison

Best Practices

Frequently Asked Questions

Do I always need a headless browser for JavaScript sites?

Puppeteer or Playwright — which is better for scraping?

How many headless browser pages can I run concurrently?

Why use proxies with headless browsers?

准备开始了吗？

你可能还感兴趣

无头浏览器 + 代理：Puppeteer和Playwright完整设置指南

设计可靠的网络爬虫架构

如何使用代理大规模爬取产品评论

爬取Google Maps数据：商家列表和评论