为什么无头浏览器需要代理
无头浏览器——在没有可见的GUI的情况下运行的浏览器实例,对于删除JavaScript-havy网站至关重要. 然而,从单个IP地址运行多个无头浏览器会话是一个明显的自动化信号. 将无头浏览器与代理旋转相结合,通过在数千个居民IP中分配请求来解决.
本指南涵盖Puppeteer和Playwright代理配置,隐形插件,以及旋转策略. 关于检测工作的背景,请参见我们 反机器人探测系统指南。 。 。
Puppeter 代理服务器设置
基本配置
// Puppeteer: Basic proxy configuration
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--proxy-server=http://gate.proxyhat.com:8080',
'--no-sandbox',
'--disable-setuid-sandbox'
]
});
const page = await browser.newPage();
// Authenticate with the proxy
await page.authenticate({
username: 'USERNAME',
password: 'PASSWORD'
});
await page.goto('https://example.com', {
waitUntil: 'networkidle2',
timeout: 30000
});
const content = await page.content();
console.log(content.substring(0, 200));
await browser.close();
SOCKS5 带有 Puppeteer 选项
// Puppeteer: SOCKS5 proxy configuration
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--proxy-server=socks5://gate.proxyhat.com:1080'
]
});
const page = await browser.newPage();
await page.authenticate({
username: 'USERNAME',
password: 'PASSWORD'
});
隐藏器隐形插件
默认的Puppeteer配置暴露了数十个反机器人系统探测到的自动化标记. 这个 puppeteer-extra-plugin-stealth 插件自动补丁这些标记 。
// Install: npm install puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
// Apply stealth plugin
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--proxy-server=http://gate.proxyhat.com:8080',
'--disable-blink-features=AutomationControlled',
'--window-size=1920,1080',
'--disable-dev-shm-usage'
]
});
const page = await browser.newPage();
await page.authenticate({
username: 'USERNAME',
password: 'PASSWORD'
});
// Set realistic viewport
await page.setViewport({ width: 1920, height: 1080 });
// Set extra headers for consistency
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9'
});
await page.goto('https://example.com');
隐形插件补丁 :
navigator.webdriver设置为未定义而非真实chrome.runtime- 添加缺失的 Chrome 特定对象- WebGL 供应商/代理商——现实的GPU字符串
- 插件和权限阵列——匹配真实的 Chrome
- 语言和平台一致性
欲了解这些补丁的内容,请见我们的文章: 浏览器指纹。 。 。
播放机代理设置
Playwright 提供了比 Puppeteer 更优雅的代理配置,支持 per-context 代理和内置设备仿真.
基本配置
// Playwright: Basic proxy setup
const { chromium } = require('playwright');
const browser = await chromium.launch({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: 'USERNAME',
password: 'PASSWORD'
}
});
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://example.com');
const title = await page.title();
console.log(`Page title: ${title}`);
await browser.close();
单文本代理( 每个标签有不同的 IP)
// Playwright: Different proxy per context
const { chromium } = require('playwright');
const browser = await chromium.launch();
// Context 1: US proxy session
const ctx1 = await browser.newContext({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: 'USERNAME-country-us-session-abc1',
password: 'PASSWORD'
},
locale: 'en-US',
timezoneId: 'America/New_York'
});
// Context 2: UK proxy session
const ctx2 = await browser.newContext({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: 'USERNAME-country-gb-session-abc2',
password: 'PASSWORD'
},
locale: 'en-GB',
timezoneId: 'Europe/London'
});
const page1 = await ctx1.newPage();
const page2 = await ctx2.newPage();
// Each page uses a different IP and locale
await page1.goto('https://example.com');
await page2.goto('https://example.com');
用代理设备模拟
// Playwright: Realistic device emulation + proxy
const { chromium, devices } = require('playwright');
const browser = await chromium.launch({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: 'USERNAME',
password: 'PASSWORD'
}
});
// Emulate a specific device with matching settings
const context = await browser.newContext({
...devices['Desktop Chrome'],
locale: 'en-US',
timezoneId: 'America/Chicago',
geolocation: { latitude: 41.8781, longitude: -87.6298 },
permissions: ['geolocation'],
colorScheme: 'light'
});
const page = await context.newPage();
await page.goto('https://example.com');
带有无头浏览器的代理旋转
战略1:每个请求的新背景
为每个 URL 创建新的浏览器上下文 。 这给每个请求一个新鲜的IP和干净的饼干.
// Playwright: Rotate proxy per request via new contexts
async function scrapeWithRotation(urls) {
const browser = await chromium.launch();
const results = [];
for (const url of urls) {
const sessionId = `sess-${Date.now()}-${Math.random().toString(36).slice(2, 8)}`;
const context = await browser.newContext({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: `USERNAME-session-${sessionId}`,
password: 'PASSWORD'
}
});
const page = await context.newPage();
try {
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
const data = await page.evaluate(() => document.title);
results.push({ url, data });
} catch (error) {
console.error(`Failed: ${url} — ${error.message}`);
} finally {
await context.close();
}
// Natural delay between requests
await new Promise(r => setTimeout(r, 1000 + Math.random() * 2000));
}
await browser.close();
return results;
}
战略2:多页流动的粘贴会议
在多个页面(pagination,登录流)中需要维护相同的IP时使用粘度会话.
// Playwright: Sticky session for multi-page scraping
async function scrapeWithStickySession(baseUrl, pageCount) {
const sessionId = `sticky-${Date.now()}`;
const browser = await chromium.launch({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: `USERNAME-session-${sessionId}`,
password: 'PASSWORD'
}
});
const context = await browser.newContext();
const page = await context.newPage();
const results = [];
for (let i = 1; i <= pageCount; i++) {
await page.goto(`${baseUrl}?page=${i}`, { waitUntil: 'networkidle' });
const items = await page.$$eval('.item', els =>
els.map(el => el.textContent.trim())
);
results.push(...items);
// Natural delay between pages
await new Promise(r => setTimeout(r, 1500 + Math.random() * 1500));
}
await browser.close();
return results;
}
战略 3: 与 Pool 并排
// Playwright: Concurrent scraping with proxy pool
async function concurrentScrape(urls, concurrency = 5) {
const browser = await chromium.launch();
const results = [];
// Process URLs in batches
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency);
const promises = batch.map(async (url) => {
const sessionId = `conc-${Date.now()}-${Math.random().toString(36).slice(2, 8)}`;
const context = await browser.newContext({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: `USERNAME-session-${sessionId}`,
password: 'PASSWORD'
}
});
const page = await context.newPage();
try {
await page.goto(url, { timeout: 30000, waitUntil: 'domcontentloaded' });
return { url, title: await page.title(), status: 'ok' };
} catch (e) {
return { url, error: e.message, status: 'error' };
} finally {
await context.close();
}
});
const batchResults = await Promise.all(promises);
results.push(...batchResults);
// Delay between batches
await new Promise(r => setTimeout(r, 2000));
}
await browser.close();
return results;
}
配有 Python ( Pyppeteer 替代品) 的 Pippeteer
对于 Python 开发者, playwright 对于Python,提供同样的能力 与更清洁的语法。
# Python: Playwright with proxy and stealth settings
# pip install playwright
# playwright install chromium
from playwright.async_api import async_playwright
import asyncio
async def scrape_with_proxy():
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy={
"server": "http://gate.proxyhat.com:8080",
"username": "USERNAME",
"password": "PASSWORD"
}
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = await context.new_page()
await page.goto("https://example.com")
title = await page.title()
print(f"Title: {title}")
await browser.close()
asyncio.run(scrape_with_proxy())
资源优化
无头浏览器消耗了大量内存和CPU. 优化生产工作量:
块状资源
// Playwright: Block images, fonts, and CSS to save bandwidth
const context = await browser.newContext({
proxy: {
server: 'http://gate.proxyhat.com:8080',
username: 'USERNAME',
password: 'PASSWORD'
}
});
const page = await context.newPage();
// Block non-essential resources
await page.route('**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2,ttf,css}', route =>
route.abort()
);
// Block tracking and analytics
await page.route('**/{google-analytics,gtag,facebook}**', route =>
route.abort()
);
await page.goto('https://example.com');
内存管理
- 使用后关闭上下文 : 每个开放上下文消耗50-150MB. 完成时总是紧密的背景.
- 限制同时存在的上下文 : 基于可用的内存,每个浏览器实例保留 3- 10 个上下文 。
- 定期重新启动浏览器 : 在100-200上下文循环后,重启浏览器以防止内存泄露.
- 无头使用: 'new' (puppeteer): 新无头模式使用的内存比旧模式少.
处理共同问题
代理认证失败
// Handle proxy auth errors gracefully
try {
const response = await page.goto(url, { timeout: 30000 });
if (response.status() === 407) {
console.error('Proxy authentication failed — check credentials');
}
} catch (error) {
if (error.message.includes('net::ERR_PROXY_CONNECTION_FAILED')) {
console.error('Proxy connection failed — check proxy server availability');
}
}
超时处理
// Retry with exponential backoff
async function gotoWithRetry(page, url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await page.goto(url, {
waitUntil: 'domcontentloaded',
timeout: 30000
});
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = 1000 * Math.pow(2, attempt) + Math.random() * 1000;
console.log(`Retry ${attempt}/${maxRetries} after ${delay}ms`);
await new Promise(r => setTimeout(r, delay));
}
}
}
最佳做法核对清单
| 实践 | 傀儡 | 剧作家 |
|---|---|---|
| 密码代理 | 仅通过发射参数 | 内容代理支持 |
| 隐形补丁 | 木偶手- 外插件- 偷盗 | 内置设备模拟 |
| 资源封锁 | 请求访问 | 页: 1 |
| 多浏览器 | 仅铬 | 铬、火狐、WebKit |
| 会话隔离 | 每个会话新建浏览器 | 每届会议的新背景 |
对于大多数刮刮任务来说,Playwright是相对于Puppeteer的推荐选择,因为其上级代理支持(per-context),内置设备仿真,以及多浏览器支持. 把它和 代理哈特的住宅代理 最好的结果。
对于没有无头浏览器的语言专用代理设置,请参见我们的指南 Py, (中文). 节点.js,以及 走开将综合反侦查战略改为: 检测减少指南始终遵循道德刮刮做法和尊重 网站访问政策。 。 。






