Node.js Scraping with Cheerio & Proxies: A Code-First Guide

Build production-grade scrapers with Cheerio and axios. Covers proxy rotation as an interceptor, concurrency with p-limit, error handling for 403/429, and a full 10k-URL scraping pipeline.

Node.js Scraping with Cheerio & Proxies: A Code-First Guide

Why Cheerio + axios Is the Lightweight Scraping Stack

If you're scraping static HTML, you don't need a headless browser chewing through 2 GB of RAM per tab. Cheerio parses server-rendered HTML in milliseconds. axios handles the HTTP layer. Together they give you a scraping pipeline that's fast, memory-efficient, and easy to debug.

The catch? Any serious scraping volume will hit rate limits, geo-blocks, or IP bans. That's where proxy rotation comes in — and this guide shows you how to integrate it properly, not as a hack, but as a first-class part of your HTTP stack.

We'll cover everything from your first Cheerio script to a production pipeline that scrapes 10,000 URLs with rotating residential proxies, concurrency control, and circuit-breaking error handling.

Setting Up: axios + Cheerio for Server-Side HTML Parsing

Let's start with the basics. Install the dependencies:

npm install axios cheerio

Here's a minimal scraper that fetches a page and extracts data:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeProduct(url) {
  const { data: html } = await axios.get(url, {
    headers: {
      'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'text/html,application/xhtml+xml',
    },
    timeout: 15_000,
  });

  const $ = cheerio.load(html);

  return {
    title: $('h1.product-title').text().trim(),
    price: $('span.price-current').text().trim(),
    availability: $('span.stock-status').text().trim(),
    breadcrumbs: $('.breadcrumb a')
      .map((_, el) => $(el).text().trim())
      .get(),
  };
}

scrapeProduct('https://example-shop.com/products/widget-123')
  .then(console.log)
  .catch(console.error);

Cheerio's jQuery-like API makes element selection intuitive. .map() + .get() converts a Cheerio collection into a plain array. No browser, no DOM, no rendering overhead.

Adding Proxy Support to axios

axios supports proxies natively via its proxy config option. For HTTP proxies:

const axios = require('axios');

const client = axios.create({
  proxy: {
    host: 'gate.proxyhat.com',
    port: 8080,
    auth: {
      username: 'user-country-US',
      password: 'PASSWORD',
    },
  },
});

// Every request now routes through the residential proxy
const { data } = await client.get('https://example.com');

For HTTPS targets through an HTTP proxy, or when you need SOCKS5 support, use https-proxy-agent or socks-proxy-agent:

const { HttpsProxyAgent } = require('https-proxy-agent');
const axios = require('axios');

const agent = new HttpsProxyAgent(
  'http://user-country-US:PASSWORD@gate.proxyhat.com:8080'
);

const client = axios.create({
  httpsAgent: agent,
  // No proxy config needed — agent handles it
});

const { data } = await client.get('https://example.com');

Use the native proxy config for simple HTTP proxying. Switch to https-proxy-agent when you need CONNECT tunneling, SOCKS5, or custom TLS settings.

When Cheerio Is Enough — And When It Isn't

Not every site requires a headless browser. Here's how to decide:

Signal Cheerio Works Need Headless (Puppeteer/Playwright)
View page source has the data
Data only appears after JS execution
Site uses SSR (Next.js, Rails, Django)
Site is a SPA (React, Vue client-render)
Interaction needed (click, scroll, login)
Anti-bot JS challenges (Cloudflare JS) ✅ (or use residential proxies)
Need to scrape 10k+ pages fast

Quick test: curl https://target.com | grep 'your-selector'. If the data is there, Cheerio is enough. If it's not in the initial HTML response, you need a browser — or you need to find the underlying API endpoint the JS calls (often the better approach).

Many SPAs load data from a JSON API. Check the Network tab for XHR requests — you can often scrape the API directly with axios, skipping both Cheerio and Puppeteer.

Building a Rotating Proxy Interceptor

Hardcoding a single proxy isn't scalable. You need rotation — a different IP for each request, or sticky sessions when you need consistency. The cleanest way to implement this in axios is as a request interceptor that swaps proxy credentials per request.

Interceptor Design

The interceptor generates a unique session identifier per request. For per-request rotation (different IP every time), use a random session ID. For sticky sessions (same IP for a batch), reuse the session ID across requests.

const axios = require('axios');
const crypto = require('crypto');
const { HttpsProxyAgent } = require('https-proxy-agent');

const PROXY_HOST = 'gate.proxyhat.com';
const PROXY_PORT = 8080;
const PROXY_USER = 'YOUR_USERNAME';
const PROXY_PASS = 'YOUR_PASSWORD';

/**
 * Create an axios instance with rotating residential proxy support.
 *
 * @param {object} options
 * @param {string} [options.country] - Geo-target country code (e.g., 'US')
 * @param {string} [options.sessionId] - Sticky session ID (omit for per-request rotation)
 * @param {number} [options.retries=3] - Max retries on proxy errors
 */
function createScraperClient({ country, sessionId, retries = 3 } = {}) {
  const client = axios.create({
    timeout: 20_000,
    headers: {
      'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'text/html,application/xhtml+xml',
    },
  });

  // Request interceptor: inject rotating proxy agent
  client.interceptors.request.use((config) => {
    // Per-request rotation: random session = new IP
    // Sticky session: reuse sessionId = same IP
    const session = sessionId || crypto.randomUUID();

    let username = `${PROXY_USER}-session-${session}`;
    if (country) username += `-country-${country}`;

    const proxyUrl = `http://${username}:${PROXY_PASS}@${PROXY_HOST}:${PROXY_PORT}`;
    config.httpsAgent = new HttpsProxyAgent(proxyUrl);
    config.httpAgent = new HttpsProxyAgent(proxyUrl);

    // Attach metadata for logging
    config.metadata = { sessionId: session, attempt: config.metadata?.attempt ?? 1 };

    return config;
  });

  // Response interceptor: retry on proxy/auth errors
  client.interceptors.response.use(
    (response) => response,
    async (error) => {
      const config = error.config;
      const status = error.response?.status;
      const attempt = config.metadata?.attempt ?? 1;

      // Retry on 403 (blocked), 429 (rate-limited), or network errors
      const retryable = !status || [403, 429, 502, 503, 504].includes(status);

      if (retryable && attempt < retries) {
        config.metadata = { ...config.metadata, attempt: attempt + 1 };
        // Force new session/IP on retry (clear sticky session)
        delete config.httpsAgent;
        delete config.httpAgent;
        const delay = Math.min(1000 * Math.pow(2, attempt - 1), 10_000);
        await new Promise((r) => setTimeout(r, delay));
        return client(config);
      }

      return Promise.reject(error);
    }
  );

  return client;
}

module.exports = { createScraperClient };

Usage is simple:

const { createScraperClient } = require('./scraper-client');

// Per-request rotation (different IP each request)
const rotatingClient = createScraperClient({ country: 'US' });

// Sticky session (same IP for a batch of requests)
const stickyClient = createScraperClient({
  country: 'DE',
  sessionId: 'batch-42',
});

// Each request gets a fresh residential IP
const page1 = await rotatingClient.get('https://example.com/page/1');
const page2 = await rotatingClient.get('https://example.com/page/2');

Concurrent Scraping with p-limit

Scraping 10,000 URLs sequentially would take hours. But unbounded concurrency gets you rate-limited or IP-banned. p-limit gives you controlled concurrency without the overhead of a full job queue.

npm install p-limit
const pLimit = require('p-limit');
const { createScraperClient } = require('./scraper-client');
const cheerio = require('cheerio');

const CONCURRENCY = 15; // 15 concurrent requests with rotating proxies
const limit = pLimit(CONCURRENCY);

async function scrapeProductPage(client, url) {
  try {
    const { data: html } = await client.get(url);
    const $ = cheerio.load(html);

    return {
      url,
      title: $('h1.product-title').text().trim(),
      price: $('span.price-current').text().trim(),
      inStock: $('.stock-status').text().includes('In Stock'),
    };
  } catch (err) {
    const status = err.response?.status;
    console.error(`[${status ?? 'ERR'}] Failed: ${url}`);
    return { url, error: status ?? err.code };
  }
}

async function scrapeProductCatalog(urls) {
  const client = createScraperClient({ country: 'US' });

  const tasks = urls.map((url) =>
    limit(() => scrapeProductPage(client, url))
  );

  const results = await Promise.all(tasks);

  const succeeded = results.filter((r) => !r.error);
  const failed = results.filter((r) => r.error);

  console.log(`Scraped ${succeeded.length}/${urls.length} pages`);
  console.log(`Failed: ${failed.length}`);

  return { succeeded, failed };
}

module.exports = { scrapeProductCatalog };

Concurrency tuning: With residential proxies and per-request rotation, 10–20 concurrent requests is a safe starting point. Datacenter proxies can handle 50+, but you'll hit blocks faster. Monitor your success rate and adjust.

Full Example: 10k-URL E-Commerce Scrape with Circuit Breaking

Let's put it all together. This script scrapes a static e-commerce catalog across 10,000 product URLs with proxy rotation, concurrency control, and circuit-breaking error handling.

Circuit Breaker

A circuit breaker stops all requests when the error rate exceeds a threshold — preventing wasted proxy bandwidth when the target site is blocking you globally.

class CircuitBreaker {
  constructor({ threshold = 0.5, window = 50, cooldown = 30_000 } = {}) {
    this.threshold = threshold; // 50% failure rate trips the breaker
    this.window = window;       // Check last N requests
    this.cooldown = cooldown;   // 30s pause before retrying
    this.results = [];          // true = success, false = failure
    this.tripped = false;
    this.trippedAt = null;
  }

  record(success) {
    this.results.push(success);
    if (this.results.length > this.window) this.results.shift();
    this._check();
  }

  _check() {
    if (this.results.length < 10) return; // Need minimum data
    const failRate = this.results.filter((r) => !r).length / this.results.length;
    if (failRate >= this.threshold) {
      this.tripped = true;
      this.trippedAt = Date.now();
    }
  }

  async guard() {
    if (!this.tripped) return;
    const elapsed = Date.now() - this.trippedAt;
    if (elapsed < this.cooldown) {
      throw new Error(
        `Circuit breaker tripped. Retry after ${Math.ceil(
          (this.cooldown - elapsed) / 1000
        )}s`
      );
    }
    // Cooldown elapsed — half-open: allow one request
    this.tripped = false;
    this.results = [];
  }
}

module.exports = { CircuitBreaker };

Main Pipeline

const pLimit = require('p-limit');
const cheerio = require('cheerio');
const fs = require('fs');
const { createScraperClient } = require('./scraper-client');
const { CircuitBreaker } = require('./circuit-breaker');

const CONCURRENCY = 15;
const MAX_RETRIES = 3;
const OUTPUT_FILE = 'products.jsonl';

async function scrapeProduct(client, breaker, url, attempt = 1) {
  await breaker.guard(); // Throw if circuit is open

  try {
    const { data: html, status } = await client.get(url);

    if (status >= 400) {
      breaker.record(false);
      throw new Error(`HTTP ${status}`);
    }

    const $ = cheerio.load(html);
    const product = {
      url,
      title: $('h1.product-title').text().trim(),
      price: $('span.price-current').text().trim(),
      availability: $('span.stock-status').text().trim(),
      breadcrumbs: $('.breadcrumb a').map((_, el) => $(el).text().trim()).get(),
    };

    breaker.record(true);
    return product;
  } catch (err) {
    breaker.record(false);

    if (attempt < MAX_RETRIES) {
      const delay = 1000 * Math.pow(2, attempt - 1);
      await new Promise((r) => setTimeout(r, delay));
      return scrapeProduct(client, breaker, url, attempt + 1);
    }

    return { url, error: err.message };
  }
}

async function main() {
  // Load URL list (one URL per line)
  const urls = fs.readFileSync('urls.txt', 'utf-8')
    .split('\n')
    .map((u) => u.trim())
    .filter(Boolean);

  console.log(`Loaded ${urls.length} URLs`);

  const client = createScraperClient({
    country: 'US',
    retries: MAX_RETRIES,
  });
  const breaker = new CircuitBreaker({
    threshold: 0.5,
    window: 50,
    cooldown: 30_000,
  });
  const limit = pLimit(CONCURRENCY);

  const stream = fs.createWriteStream(OUTPUT_FILE, { flags: 'a' });

  const tasks = urls.map((url) =>
    limit(async () => {
      const result = await scrapeProduct(client, breaker, url);
      if (result && !result.error) {
        stream.write(JSON.stringify(result) + '\n');
      }
      return result;
    })
  );

  const results = await Promise.all(tasks);
  stream.end();

  const succeeded = results.filter((r) => r && !r.error);
  const failed = results.filter((r) => r && r.error);

  console.log(`✅ Succeeded: ${succeeded.length}`);
  console.log(`❌ Failed: ${failed.length}`);
  console.log(`📊 Success rate: ${((succeeded.length / urls.length) * 100).toFixed(1)}%`);

  if (failed.length > 0) {
    fs.writeFileSync('failed-urls.txt',
      failed.map((r) => r.url).join('\n'));
    console.log('Failed URLs saved to failed-urls.txt for retry');
  }
}

main().catch(console.error);

Why JSONL? Writing each result as a single line means you don't lose data if the process crashes mid-run. You can resume from the last successful entry.

Error Handling Strategy: 403, 429, and Beyond

Different HTTP errors require different responses:

  • 403 Forbidden: Your IP is blocked. Rotate to a new proxy session immediately. If the block persists across sessions, the site may be fingerprinting your request headers — diversify your User-Agent and Accept-Language headers.
  • 429 Too Many Requests: You're hitting a rate limit. Add exponential backoff. Reduce concurrency. Consider sticky sessions so you spread requests across fewer IPs (some sites rate-limit per IP, not globally).
  • 502/503/504: Transient server errors. Retry with backoff. These are usually temporary.
  • CAPTCHAs (detected by response content): Check if the response HTML contains CAPTCHA indicators. If so, rotate IP and retry. For persistent CAPTCHAs, consider residential proxies which appear as real user traffic.
  • Timeouts: Increase timeout or retry. If consistent, the site may be intentionally slow-loading for bots.

Pro tip: Detect blocks by content, not just status codes. Some sites return 200 with a CAPTCHA page. Always validate that your selectors find the expected elements.

Scaling Beyond a Single Process

For very large scraping jobs (100k+ URLs), a single Node.js process becomes a bottleneck. Here's how to scale:

Containerization

Package your scraper as a Docker container and run multiple instances, each with a different geo-target or URL partition:

# docker-compose.yml
services:
  scraper-us:
    build: .
    environment:
      - COUNTRY=US
      - URL_FILE=urls-us.txt
  scraper-de:
    build: .
    environment:
      - COUNTRY=DE
      - URL_FILE=urls-de.txt

Job Queue Pattern

For distributed scraping, use Redis or BullMQ as a job queue. Each worker pops URLs from the queue, scrapes them, and pushes results. This gives you horizontal scaling, automatic retries, and dead-letter queues for failed URLs.

Headless Fleet

When you need Puppeteer at scale, run a fleet of headless browsers behind a load balancer. Tools like browserless or playwright-cluster manage browser lifecycle for you. Route requests through residential proxies to avoid detection.

Key Takeaways

  • Cheerio + axios is the right stack for server-rendered HTML — 10–50× faster and more memory-efficient than headless browsers.
  • Check the page source first. If the data is in the initial HTML, you don't need Puppeteer.
  • Proxy rotation belongs in your HTTP layer, not scattered across your codebase. An axios interceptor is the idiomatic place for it.
  • Per-request rotation (random session IDs) gives you a different IP every time. Sticky sessions keep the same IP for a batch — choose based on your target site's behavior.
  • Control concurrency with p-limit. Start at 10–15 concurrent requests with residential proxies and adjust based on success rates.
  • Circuit breakers prevent cascading failures. When your error rate spikes, pause — don't burn through proxy bandwidth hitting a wall.
  • Write results as JSONL so you don't lose progress on crashes. Separate failed URLs for retry runs.
  • Detect blocks by content, not just status codes. A 200 response with a CAPTCHA page is still a block.

Frequently Asked Questions

Does Cheerio execute JavaScript?

No. Cheerio parses static HTML only. If the data you need is rendered by client-side JavaScript, you need a headless browser like Puppeteer or Playwright — or you can find the underlying API endpoint the JS calls and request that directly with axios.

How do I use SOCKS5 proxies with axios?

Use the socks-proxy-agent package instead of https-proxy-agent. Construct the agent with socks5://USERNAME:PASSWORD@gate.proxyhat.com:1080 and assign it to config.httpsAgent and config.httpAgent on your axios instance.

What's the difference between per-request rotation and sticky sessions?

Per-request rotation assigns a random session ID to each request, giving you a different residential IP every time. Sticky sessions reuse the same session ID across multiple requests, keeping the same IP for a configurable duration (typically up to 30 minutes). Use sticky sessions when the target site requires session continuity (e.g., paginated browsing, logged-in states).

How many concurrent requests can I run with residential proxies?

With per-request IP rotation, 10–20 concurrent requests is a safe starting point. The proxy pool handles the IP diversity; your concurrency limit is about not overwhelming the target site. Monitor your success rate — if it drops below 90%, reduce concurrency or add delays between requests.

Should I randomize User-Agent headers when scraping?

Yes. Sites often block requests with identical headers across different IPs. Use a library like user-agents or maintain a pool of real browser User-Agent strings and rotate them per request. Pair this with matching Accept, Accept-Language, and Accept-Encoding headers for consistency.

Ready to get started?

Access 50M+ residential IPs across 148+ countries with AI-powered filtering.

View PricingResidential Proxies
← Back to Blog