Why Cheerio + axios Is the Lightweight Scraping Stack
If you're scraping static HTML, you don't need a headless browser chewing through 2 GB of RAM per tab. Cheerio parses server-rendered HTML in milliseconds. axios handles the HTTP layer. Together they give you a scraping pipeline that's fast, memory-efficient, and easy to debug.
The catch? Any serious scraping volume will hit rate limits, geo-blocks, or IP bans. That's where proxy rotation comes in — and this guide shows you how to integrate it properly, not as a hack, but as a first-class part of your HTTP stack.
We'll cover everything from your first Cheerio script to a production pipeline that scrapes 10,000 URLs with rotating residential proxies, concurrency control, and circuit-breaking error handling.
Setting Up: axios + Cheerio for Server-Side HTML Parsing
Let's start with the basics. Install the dependencies:
npm install axios cheerio
Here's a minimal scraper that fetches a page and extracts data:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeProduct(url) {
const { data: html } = await axios.get(url, {
headers: {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
},
timeout: 15_000,
});
const $ = cheerio.load(html);
return {
title: $('h1.product-title').text().trim(),
price: $('span.price-current').text().trim(),
availability: $('span.stock-status').text().trim(),
breadcrumbs: $('.breadcrumb a')
.map((_, el) => $(el).text().trim())
.get(),
};
}
scrapeProduct('https://example-shop.com/products/widget-123')
.then(console.log)
.catch(console.error);
Cheerio's jQuery-like API makes element selection intuitive. .map() + .get() converts a Cheerio collection into a plain array. No browser, no DOM, no rendering overhead.
Adding Proxy Support to axios
axios supports proxies natively via its proxy config option. For HTTP proxies:
const axios = require('axios');
const client = axios.create({
proxy: {
host: 'gate.proxyhat.com',
port: 8080,
auth: {
username: 'user-country-US',
password: 'PASSWORD',
},
},
});
// Every request now routes through the residential proxy
const { data } = await client.get('https://example.com');
For HTTPS targets through an HTTP proxy, or when you need SOCKS5 support, use https-proxy-agent or socks-proxy-agent:
const { HttpsProxyAgent } = require('https-proxy-agent');
const axios = require('axios');
const agent = new HttpsProxyAgent(
'http://user-country-US:PASSWORD@gate.proxyhat.com:8080'
);
const client = axios.create({
httpsAgent: agent,
// No proxy config needed — agent handles it
});
const { data } = await client.get('https://example.com');
Use the native proxy config for simple HTTP proxying. Switch to https-proxy-agent when you need CONNECT tunneling, SOCKS5, or custom TLS settings.
When Cheerio Is Enough — And When It Isn't
Not every site requires a headless browser. Here's how to decide:
| Signal | Cheerio Works | Need Headless (Puppeteer/Playwright) |
|---|---|---|
| View page source has the data | ✅ | — |
| Data only appears after JS execution | — | ✅ |
| Site uses SSR (Next.js, Rails, Django) | ✅ | — |
| Site is a SPA (React, Vue client-render) | — | ✅ |
| Interaction needed (click, scroll, login) | — | ✅ |
| Anti-bot JS challenges (Cloudflare JS) | — | ✅ (or use residential proxies) |
| Need to scrape 10k+ pages fast | ✅ | — |
Quick test: curl https://target.com | grep 'your-selector'. If the data is there, Cheerio is enough. If it's not in the initial HTML response, you need a browser — or you need to find the underlying API endpoint the JS calls (often the better approach).
Many SPAs load data from a JSON API. Check the Network tab for XHR requests — you can often scrape the API directly with axios, skipping both Cheerio and Puppeteer.
Building a Rotating Proxy Interceptor
Hardcoding a single proxy isn't scalable. You need rotation — a different IP for each request, or sticky sessions when you need consistency. The cleanest way to implement this in axios is as a request interceptor that swaps proxy credentials per request.
Interceptor Design
The interceptor generates a unique session identifier per request. For per-request rotation (different IP every time), use a random session ID. For sticky sessions (same IP for a batch), reuse the session ID across requests.
const axios = require('axios');
const crypto = require('crypto');
const { HttpsProxyAgent } = require('https-proxy-agent');
const PROXY_HOST = 'gate.proxyhat.com';
const PROXY_PORT = 8080;
const PROXY_USER = 'YOUR_USERNAME';
const PROXY_PASS = 'YOUR_PASSWORD';
/**
* Create an axios instance with rotating residential proxy support.
*
* @param {object} options
* @param {string} [options.country] - Geo-target country code (e.g., 'US')
* @param {string} [options.sessionId] - Sticky session ID (omit for per-request rotation)
* @param {number} [options.retries=3] - Max retries on proxy errors
*/
function createScraperClient({ country, sessionId, retries = 3 } = {}) {
const client = axios.create({
timeout: 20_000,
headers: {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
},
});
// Request interceptor: inject rotating proxy agent
client.interceptors.request.use((config) => {
// Per-request rotation: random session = new IP
// Sticky session: reuse sessionId = same IP
const session = sessionId || crypto.randomUUID();
let username = `${PROXY_USER}-session-${session}`;
if (country) username += `-country-${country}`;
const proxyUrl = `http://${username}:${PROXY_PASS}@${PROXY_HOST}:${PROXY_PORT}`;
config.httpsAgent = new HttpsProxyAgent(proxyUrl);
config.httpAgent = new HttpsProxyAgent(proxyUrl);
// Attach metadata for logging
config.metadata = { sessionId: session, attempt: config.metadata?.attempt ?? 1 };
return config;
});
// Response interceptor: retry on proxy/auth errors
client.interceptors.response.use(
(response) => response,
async (error) => {
const config = error.config;
const status = error.response?.status;
const attempt = config.metadata?.attempt ?? 1;
// Retry on 403 (blocked), 429 (rate-limited), or network errors
const retryable = !status || [403, 429, 502, 503, 504].includes(status);
if (retryable && attempt < retries) {
config.metadata = { ...config.metadata, attempt: attempt + 1 };
// Force new session/IP on retry (clear sticky session)
delete config.httpsAgent;
delete config.httpAgent;
const delay = Math.min(1000 * Math.pow(2, attempt - 1), 10_000);
await new Promise((r) => setTimeout(r, delay));
return client(config);
}
return Promise.reject(error);
}
);
return client;
}
module.exports = { createScraperClient };
Usage is simple:
const { createScraperClient } = require('./scraper-client');
// Per-request rotation (different IP each request)
const rotatingClient = createScraperClient({ country: 'US' });
// Sticky session (same IP for a batch of requests)
const stickyClient = createScraperClient({
country: 'DE',
sessionId: 'batch-42',
});
// Each request gets a fresh residential IP
const page1 = await rotatingClient.get('https://example.com/page/1');
const page2 = await rotatingClient.get('https://example.com/page/2');
Concurrent Scraping with p-limit
Scraping 10,000 URLs sequentially would take hours. But unbounded concurrency gets you rate-limited or IP-banned. p-limit gives you controlled concurrency without the overhead of a full job queue.
npm install p-limit
const pLimit = require('p-limit');
const { createScraperClient } = require('./scraper-client');
const cheerio = require('cheerio');
const CONCURRENCY = 15; // 15 concurrent requests with rotating proxies
const limit = pLimit(CONCURRENCY);
async function scrapeProductPage(client, url) {
try {
const { data: html } = await client.get(url);
const $ = cheerio.load(html);
return {
url,
title: $('h1.product-title').text().trim(),
price: $('span.price-current').text().trim(),
inStock: $('.stock-status').text().includes('In Stock'),
};
} catch (err) {
const status = err.response?.status;
console.error(`[${status ?? 'ERR'}] Failed: ${url}`);
return { url, error: status ?? err.code };
}
}
async function scrapeProductCatalog(urls) {
const client = createScraperClient({ country: 'US' });
const tasks = urls.map((url) =>
limit(() => scrapeProductPage(client, url))
);
const results = await Promise.all(tasks);
const succeeded = results.filter((r) => !r.error);
const failed = results.filter((r) => r.error);
console.log(`Scraped ${succeeded.length}/${urls.length} pages`);
console.log(`Failed: ${failed.length}`);
return { succeeded, failed };
}
module.exports = { scrapeProductCatalog };
Concurrency tuning: With residential proxies and per-request rotation, 10–20 concurrent requests is a safe starting point. Datacenter proxies can handle 50+, but you'll hit blocks faster. Monitor your success rate and adjust.
Full Example: 10k-URL E-Commerce Scrape with Circuit Breaking
Let's put it all together. This script scrapes a static e-commerce catalog across 10,000 product URLs with proxy rotation, concurrency control, and circuit-breaking error handling.
Circuit Breaker
A circuit breaker stops all requests when the error rate exceeds a threshold — preventing wasted proxy bandwidth when the target site is blocking you globally.
class CircuitBreaker {
constructor({ threshold = 0.5, window = 50, cooldown = 30_000 } = {}) {
this.threshold = threshold; // 50% failure rate trips the breaker
this.window = window; // Check last N requests
this.cooldown = cooldown; // 30s pause before retrying
this.results = []; // true = success, false = failure
this.tripped = false;
this.trippedAt = null;
}
record(success) {
this.results.push(success);
if (this.results.length > this.window) this.results.shift();
this._check();
}
_check() {
if (this.results.length < 10) return; // Need minimum data
const failRate = this.results.filter((r) => !r).length / this.results.length;
if (failRate >= this.threshold) {
this.tripped = true;
this.trippedAt = Date.now();
}
}
async guard() {
if (!this.tripped) return;
const elapsed = Date.now() - this.trippedAt;
if (elapsed < this.cooldown) {
throw new Error(
`Circuit breaker tripped. Retry after ${Math.ceil(
(this.cooldown - elapsed) / 1000
)}s`
);
}
// Cooldown elapsed — half-open: allow one request
this.tripped = false;
this.results = [];
}
}
module.exports = { CircuitBreaker };
Main Pipeline
const pLimit = require('p-limit');
const cheerio = require('cheerio');
const fs = require('fs');
const { createScraperClient } = require('./scraper-client');
const { CircuitBreaker } = require('./circuit-breaker');
const CONCURRENCY = 15;
const MAX_RETRIES = 3;
const OUTPUT_FILE = 'products.jsonl';
async function scrapeProduct(client, breaker, url, attempt = 1) {
await breaker.guard(); // Throw if circuit is open
try {
const { data: html, status } = await client.get(url);
if (status >= 400) {
breaker.record(false);
throw new Error(`HTTP ${status}`);
}
const $ = cheerio.load(html);
const product = {
url,
title: $('h1.product-title').text().trim(),
price: $('span.price-current').text().trim(),
availability: $('span.stock-status').text().trim(),
breadcrumbs: $('.breadcrumb a').map((_, el) => $(el).text().trim()).get(),
};
breaker.record(true);
return product;
} catch (err) {
breaker.record(false);
if (attempt < MAX_RETRIES) {
const delay = 1000 * Math.pow(2, attempt - 1);
await new Promise((r) => setTimeout(r, delay));
return scrapeProduct(client, breaker, url, attempt + 1);
}
return { url, error: err.message };
}
}
async function main() {
// Load URL list (one URL per line)
const urls = fs.readFileSync('urls.txt', 'utf-8')
.split('\n')
.map((u) => u.trim())
.filter(Boolean);
console.log(`Loaded ${urls.length} URLs`);
const client = createScraperClient({
country: 'US',
retries: MAX_RETRIES,
});
const breaker = new CircuitBreaker({
threshold: 0.5,
window: 50,
cooldown: 30_000,
});
const limit = pLimit(CONCURRENCY);
const stream = fs.createWriteStream(OUTPUT_FILE, { flags: 'a' });
const tasks = urls.map((url) =>
limit(async () => {
const result = await scrapeProduct(client, breaker, url);
if (result && !result.error) {
stream.write(JSON.stringify(result) + '\n');
}
return result;
})
);
const results = await Promise.all(tasks);
stream.end();
const succeeded = results.filter((r) => r && !r.error);
const failed = results.filter((r) => r && r.error);
console.log(`✅ Succeeded: ${succeeded.length}`);
console.log(`❌ Failed: ${failed.length}`);
console.log(`📊 Success rate: ${((succeeded.length / urls.length) * 100).toFixed(1)}%`);
if (failed.length > 0) {
fs.writeFileSync('failed-urls.txt',
failed.map((r) => r.url).join('\n'));
console.log('Failed URLs saved to failed-urls.txt for retry');
}
}
main().catch(console.error);
Why JSONL? Writing each result as a single line means you don't lose data if the process crashes mid-run. You can resume from the last successful entry.
Error Handling Strategy: 403, 429, and Beyond
Different HTTP errors require different responses:
- 403 Forbidden: Your IP is blocked. Rotate to a new proxy session immediately. If the block persists across sessions, the site may be fingerprinting your request headers — diversify your User-Agent and Accept-Language headers.
- 429 Too Many Requests: You're hitting a rate limit. Add exponential backoff. Reduce concurrency. Consider sticky sessions so you spread requests across fewer IPs (some sites rate-limit per IP, not globally).
- 502/503/504: Transient server errors. Retry with backoff. These are usually temporary.
- CAPTCHAs (detected by response content): Check if the response HTML contains CAPTCHA indicators. If so, rotate IP and retry. For persistent CAPTCHAs, consider residential proxies which appear as real user traffic.
- Timeouts: Increase timeout or retry. If consistent, the site may be intentionally slow-loading for bots.
Pro tip: Detect blocks by content, not just status codes. Some sites return 200 with a CAPTCHA page. Always validate that your selectors find the expected elements.
Scaling Beyond a Single Process
For very large scraping jobs (100k+ URLs), a single Node.js process becomes a bottleneck. Here's how to scale:
Containerization
Package your scraper as a Docker container and run multiple instances, each with a different geo-target or URL partition:
# docker-compose.yml
services:
scraper-us:
build: .
environment:
- COUNTRY=US
- URL_FILE=urls-us.txt
scraper-de:
build: .
environment:
- COUNTRY=DE
- URL_FILE=urls-de.txt
Job Queue Pattern
For distributed scraping, use Redis or BullMQ as a job queue. Each worker pops URLs from the queue, scrapes them, and pushes results. This gives you horizontal scaling, automatic retries, and dead-letter queues for failed URLs.
Headless Fleet
When you need Puppeteer at scale, run a fleet of headless browsers behind a load balancer. Tools like browserless or playwright-cluster manage browser lifecycle for you. Route requests through residential proxies to avoid detection.
Key Takeaways
- Cheerio + axios is the right stack for server-rendered HTML — 10–50× faster and more memory-efficient than headless browsers.
- Check the page source first. If the data is in the initial HTML, you don't need Puppeteer.
- Proxy rotation belongs in your HTTP layer, not scattered across your codebase. An axios interceptor is the idiomatic place for it.
- Per-request rotation (random session IDs) gives you a different IP every time. Sticky sessions keep the same IP for a batch — choose based on your target site's behavior.
- Control concurrency with p-limit. Start at 10–15 concurrent requests with residential proxies and adjust based on success rates.
- Circuit breakers prevent cascading failures. When your error rate spikes, pause — don't burn through proxy bandwidth hitting a wall.
- Write results as JSONL so you don't lose progress on crashes. Separate failed URLs for retry runs.
- Detect blocks by content, not just status codes. A 200 response with a CAPTCHA page is still a block.
Frequently Asked Questions
Does Cheerio execute JavaScript?
No. Cheerio parses static HTML only. If the data you need is rendered by client-side JavaScript, you need a headless browser like Puppeteer or Playwright — or you can find the underlying API endpoint the JS calls and request that directly with axios.
How do I use SOCKS5 proxies with axios?
Use the socks-proxy-agent package instead of https-proxy-agent. Construct the agent with socks5://USERNAME:PASSWORD@gate.proxyhat.com:1080 and assign it to config.httpsAgent and config.httpAgent on your axios instance.
What's the difference between per-request rotation and sticky sessions?
Per-request rotation assigns a random session ID to each request, giving you a different residential IP every time. Sticky sessions reuse the same session ID across multiple requests, keeping the same IP for a configurable duration (typically up to 30 minutes). Use sticky sessions when the target site requires session continuity (e.g., paginated browsing, logged-in states).
How many concurrent requests can I run with residential proxies?
With per-request IP rotation, 10–20 concurrent requests is a safe starting point. The proxy pool handles the IP diversity; your concurrency limit is about not overwhelming the target site. Monitor your success rate — if it drops below 90%, reduce concurrency or add delays between requests.
Should I randomize User-Agent headers when scraping?
Yes. Sites often block requests with identical headers across different IPs. Use a library like user-agents or maintain a pool of real browser User-Agent strings and rotate them per request. Pair this with matching Accept, Accept-Language, and Accept-Encoding headers for consistency.






