Node.js Scraping with Cheerio & Proxies | ProxyHat

Q: How do I use SOCKS5 proxies with axios?

Use the socks-proxy-agent package instead of https-proxy-agent. Construct the agent with socks5://USERNAME:PASSWORD@gate.proxyhat.com:1080 and assign it to config.httpsAgent and config.httpAgent on your axios instance.

Why Cheerio + axios Is the Lightweight Scraping Stack

If you're scraping static HTML, you don't need a headless browser chewing through 2 GB of RAM per tab. Cheerio parses server-rendered HTML in milliseconds. axios handles the HTTP layer. Together they give you a scraping pipeline that's fast, memory-efficient, and easy to debug.

The catch? Any serious scraping volume will hit rate limits, geo-blocks, or IP bans. That's where proxy rotation comes in — and this guide shows you how to integrate it properly, not as a hack, but as a first-class part of your HTTP stack.

We'll cover everything from your first Cheerio script to a production pipeline that scrapes 10,000 URLs with rotating residential proxies, concurrency control, and circuit-breaking error handling.

Setting Up: axios + Cheerio for Server-Side HTML Parsing

Let's start with the basics. Install the dependencies:

npm install axios cheerio

Here's a minimal scraper that fetches a page and extracts data:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeProduct(url) {
  const { data: html } = await axios.get(url, {
    headers: {
      'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'text/html,application/xhtml+xml',
    },
    timeout: 15_000,
  });

  const $ = cheerio.load(html);

  return {
    title: $('h1.product-title').text().trim(),
    price: $('span.price-current').text().trim(),
    availability: $('span.stock-status').text().trim(),
    breadcrumbs: $('.breadcrumb a')
      .map((_, el) => $(el).text().trim())
      .get(),
  };
}

scrapeProduct('https://example-shop.com/products/widget-123')
  .then(console.log)
  .catch(console.error);

Cheerio's jQuery-like API makes element selection intuitive. .map() + .get() converts a Cheerio collection into a plain array. No browser, no DOM, no rendering overhead.

Adding Proxy Support to axios

axios supports proxies natively via its proxy config option. For HTTP proxies:

const axios = require('axios');

const client = axios.create({
  proxy: {
    host: 'gate.proxyhat.com',
    port: 8080,
    auth: {
      username: 'user-country-US',
      password: 'PASSWORD',
    },
  },
});

// Every request now routes through the residential proxy
const { data } = await client.get('https://example.com');

For HTTPS targets through an HTTP proxy, or when you need SOCKS5 support, use https-proxy-agent or socks-proxy-agent:

const { HttpsProxyAgent } = require('https-proxy-agent');
const axios = require('axios');

const agent = new HttpsProxyAgent(
  'http://user-country-US:PASSWORD@gate.proxyhat.com:8080'
);

const client = axios.create({
  httpsAgent: agent,
  // No proxy config needed — agent handles it
});

const { data } = await client.get('https://example.com');

Use the native proxy config for simple HTTP proxying. Switch to https-proxy-agent when you need CONNECT tunneling, SOCKS5, or custom TLS settings.

When Cheerio Is Enough — And When It Isn't

Not every site requires a headless browser. Here's how to decide:

Signal	Cheerio Works	Need Headless (Puppeteer/Playwright)
View page source has the data	✅	—
Data only appears after JS execution	—	✅
Site uses SSR (Next.js, Rails, Django)	✅	—
Site is a SPA (React, Vue client-render)	—	✅
Interaction needed (click, scroll, login)	—	✅
Anti-bot JS challenges (Cloudflare JS)	—	✅ (or use residential proxies)
Need to scrape 10k+ pages fast	✅	—

Quick test: curl https://target.com | grep 'your-selector'. If the data is there, Cheerio is enough. If it's not in the initial HTML response, you need a browser — or you need to find the underlying API endpoint the JS calls (often the better approach).

Many SPAs load data from a JSON API. Check the Network tab for XHR requests — you can often scrape the API directly with axios, skipping both Cheerio and Puppeteer.

Building a Rotating Proxy Interceptor

Hardcoding a single proxy isn't scalable. You need rotation — a different IP for each request, or sticky sessions when you need consistency. The cleanest way to implement this in axios is as a request interceptor that swaps proxy credentials per request.

Interceptor Design

The interceptor generates a unique session identifier per request. For per-request rotation (different IP every time), use a random session ID. For sticky sessions (same IP for a batch), reuse the session ID across requests.

const axios = require('axios');
const crypto = require('crypto');
const { HttpsProxyAgent } = require('https-proxy-agent');

const PROXY_HOST = 'gate.proxyhat.com';
const PROXY_PORT = 8080;
const PROXY_USER = 'YOUR_USERNAME';
const PROXY_PASS = 'YOUR_PASSWORD';

/**
 * Create an axios instance with rotating residential proxy support.
 *
 * @param {object} options
 * @param {string} [options.country] - Geo-target country code (e.g., 'US')
 * @param {string} [options.sessionId] - Sticky session ID (omit for per-request rotation)
 * @param {number} [options.retries=3] - Max retries on proxy errors
 */
function createScraperClient({ country, sessionId, retries = 3 } = {}) {
  const client = axios.create({
    timeout: 20_000,
    headers: {
      'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept': 'text/html,application/xhtml+xml',
    },
  });

  // Request interceptor: inject rotating proxy agent
  client.interceptors.request.use((config) => {
    // Per-request rotation: random session = new IP
    // Sticky session: reuse sessionId = same IP
    const session = sessionId || crypto.randomUUID();

    let username = `${PROXY_USER}-session-${session}`;
    if (country) username += `-country-${country}`;

    const proxyUrl = `http://${username}:${PROXY_PASS}@${PROXY_HOST}:${PROXY_PORT}`;
    config.httpsAgent = new HttpsProxyAgent(proxyUrl);
    config.httpAgent = new HttpsProxyAgent(proxyUrl);

    // Attach metadata for logging
    config.metadata = { sessionId: session, attempt: config.metadata?.attempt ?? 1 };

    return config;
  });

  // Response interceptor: retry on proxy/auth errors
  client.interceptors.response.use(
    (response) => response,
    async (error) => {
      const config = error.config;
      const status = error.response?.status;
      const attempt = config.metadata?.attempt ?? 1;

      // Retry on 403 (blocked), 429 (rate-limited), or network errors
      const retryable = !status || [403, 429, 502, 503, 504].includes(status);

      if (retryable && attempt < retries) {
        config.metadata = { ...config.metadata, attempt: attempt + 1 };
        // Force new session/IP on retry (clear sticky session)
        delete config.httpsAgent;
        delete config.httpAgent;
        const delay = Math.min(1000 * Math.pow(2, attempt - 1), 10_000);
        await new Promise((r) => setTimeout(r, delay));
        return client(config);
      }

      return Promise.reject(error);
    }
  );

  return client;
}

module.exports = { createScraperClient };

Usage is simple:

const { createScraperClient } = require('./scraper-client');

// Per-request rotation (different IP each request)
const rotatingClient = createScraperClient({ country: 'US' });

// Sticky session (same IP for a batch of requests)
const stickyClient = createScraperClient({
  country: 'DE',
  sessionId: 'batch-42',
});

// Each request gets a fresh residential IP
const page1 = await rotatingClient.get('https://example.com/page/1');
const page2 = await rotatingClient.get('https://example.com/page/2');

Concurrent Scraping with p-limit

Scraping 10,000 URLs sequentially would take hours. But unbounded concurrency gets you rate-limited or IP-banned. p-limit gives you controlled concurrency without the overhead of a full job queue.

npm install p-limit

const pLimit = require('p-limit');
const { createScraperClient } = require('./scraper-client');
const cheerio = require('cheerio');

const CONCURRENCY = 15; // 15 concurrent requests with rotating proxies
const limit = pLimit(CONCURRENCY);

async function scrapeProductPage(client, url) {
  try {
    const { data: html } = await client.get(url);
    const $ = cheerio.load(html);

    return {
      url,
      title: $('h1.product-title').text().trim(),
      price: $('span.price-current').text().trim(),
      inStock: $('.stock-status').text().includes('In Stock'),
    };
  } catch (err) {
    const status = err.response?.status;
    console.error(`[${status ?? 'ERR'}] Failed: ${url}`);
    return { url, error: status ?? err.code };
  }
}

async function scrapeProductCatalog(urls) {
  const client = createScraperClient({ country: 'US' });

  const tasks = urls.map((url) =>
    limit(() => scrapeProductPage(client, url))
  );

  const results = await Promise.all(tasks);

  const succeeded = results.filter((r) => !r.error);
  const failed = results.filter((r) => r.error);

  console.log(`Scraped ${succeeded.length}/${urls.length} pages`);
  console.log(`Failed: ${failed.length}`);

  return { succeeded, failed };
}

module.exports = { scrapeProductCatalog };

Concurrency tuning: With residential proxies and per-request rotation, 10–20 concurrent requests is a safe starting point. Datacenter proxies can handle 50+, but you'll hit blocks faster. Monitor your success rate and adjust.

Full Example: 10k-URL E-Commerce Scrape with Circuit Breaking

Let's put it all together. This script scrapes a static e-commerce catalog across 10,000 product URLs with proxy rotation, concurrency control, and circuit-breaking error handling.

Circuit Breaker

A circuit breaker stops all requests when the error rate exceeds a threshold — preventing wasted proxy bandwidth when the target site is blocking you globally.

class CircuitBreaker {
  constructor({ threshold = 0.5, window = 50, cooldown = 30_000 } = {}) {
    this.threshold = threshold; // 50% failure rate trips the breaker
    this.window = window;       // Check last N requests
    this.cooldown = cooldown;   // 30s pause before retrying
    this.results = [];          // true = success, false = failure
    this.tripped = false;
    this.trippedAt = null;
  }

  record(success) {
    this.results.push(success);
    if (this.results.length > this.window) this.results.shift();
    this._check();
  }

  _check() {
    if (this.results.length < 10) return; // Need minimum data
    const failRate = this.results.filter((r) => !r).length / this.results.length;
    if (failRate >= this.threshold) {
      this.tripped = true;
      this.trippedAt = Date.now();
    }
  }

  async guard() {
    if (!this.tripped) return;
    const elapsed = Date.now() - this.trippedAt;
    if (elapsed < this.cooldown) {
      throw new Error(
        `Circuit breaker tripped. Retry after ${Math.ceil(
          (this.cooldown - elapsed) / 1000
        )}s`
      );
    }
    // Cooldown elapsed — half-open: allow one request
    this.tripped = false;
    this.results = [];
  }
}

module.exports = { CircuitBreaker };

Main Pipeline

const pLimit = require('p-limit');
const cheerio = require('cheerio');
const fs = require('fs');
const { createScraperClient } = require('./scraper-client');
const { CircuitBreaker } = require('./circuit-breaker');

const CONCURRENCY = 15;
const MAX_RETRIES = 3;
const OUTPUT_FILE = 'products.jsonl';

async function scrapeProduct(client, breaker, url, attempt = 1) {
  await breaker.guard(); // Throw if circuit is open

  try {
    const { data: html, status } = await client.get(url);

    if (status >= 400) {
      breaker.record(false);
      throw new Error(`HTTP ${status}`);
    }

    const $ = cheerio.load(html);
    const product = {
      url,
      title: $('h1.product-title').text().trim(),
      price: $('span.price-current').text().trim(),
      availability: $('span.stock-status').text().trim(),
      breadcrumbs: $('.breadcrumb a').map((_, el) => $(el).text().trim()).get(),
    };

    breaker.record(true);
    return product;
  } catch (err) {
    breaker.record(false);

    if (attempt < MAX_RETRIES) {
      const delay = 1000 * Math.pow(2, attempt - 1);
      await new Promise((r) => setTimeout(r, delay));
      return scrapeProduct(client, breaker, url, attempt + 1);
    }

    return { url, error: err.message };
  }
}

async function main() {
  // Load URL list (one URL per line)
  const urls = fs.readFileSync('urls.txt', 'utf-8')
    .split('\n')
    .map((u) => u.trim())
    .filter(Boolean);

  console.log(`Loaded ${urls.length} URLs`);

  const client = createScraperClient({
    country: 'US',
    retries: MAX_RETRIES,
  });
  const breaker = new CircuitBreaker({
    threshold: 0.5,
    window: 50,
    cooldown: 30_000,
  });
  const limit = pLimit(CONCURRENCY);

  const stream = fs.createWriteStream(OUTPUT_FILE, { flags: 'a' });

  const tasks = urls.map((url) =>
    limit(async () => {
      const result = await scrapeProduct(client, breaker, url);
      if (result && !result.error) {
        stream.write(JSON.stringify(result) + '\n');
      }
      return result;
    })
  );

  const results = await Promise.all(tasks);
  stream.end();

  const succeeded = results.filter((r) => r && !r.error);
  const failed = results.filter((r) => r && r.error);

  console.log(`✅ Succeeded: ${succeeded.length}`);
  console.log(`❌ Failed: ${failed.length}`);
  console.log(`📊 Success rate: ${((succeeded.length / urls.length) * 100).toFixed(1)}%`);

  if (failed.length > 0) {
    fs.writeFileSync('failed-urls.txt',
      failed.map((r) => r.url).join('\n'));
    console.log('Failed URLs saved to failed-urls.txt for retry');
  }
}

main().catch(console.error);

Why JSONL? Writing each result as a single line means you don't lose data if the process crashes mid-run. You can resume from the last successful entry.

Error Handling Strategy: 403, 429, and Beyond

Different HTTP errors require different responses:

403 Forbidden: Your IP is blocked. Rotate to a new proxy session immediately. If the block persists across sessions, the site may be fingerprinting your request headers — diversify your User-Agent and Accept-Language headers.
429 Too Many Requests: You're hitting a rate limit. Add exponential backoff. Reduce concurrency. Consider sticky sessions so you spread requests across fewer IPs (some sites rate-limit per IP, not globally).
502/503/504: Transient server errors. Retry with backoff. These are usually temporary.
CAPTCHAs (detected by response content): Check if the response HTML contains CAPTCHA indicators. If so, rotate IP and retry. For persistent CAPTCHAs, consider residential proxies which appear as real user traffic.
Timeouts: Increase timeout or retry. If consistent, the site may be intentionally slow-loading for bots.

Pro tip: Detect blocks by content, not just status codes. Some sites return 200 with a CAPTCHA page. Always validate that your selectors find the expected elements.

Scaling Beyond a Single Process

For very large scraping jobs (100k+ URLs), a single Node.js process becomes a bottleneck. Here's how to scale:

Containerization

Package your scraper as a Docker container and run multiple instances, each with a different geo-target or URL partition:

# docker-compose.yml
services:
  scraper-us:
    build: .
    environment:
      - COUNTRY=US
      - URL_FILE=urls-us.txt
  scraper-de:
    build: .
    environment:
      - COUNTRY=DE
      - URL_FILE=urls-de.txt

Job Queue Pattern

For distributed scraping, use Redis or BullMQ as a job queue. Each worker pops URLs from the queue, scrapes them, and pushes results. This gives you horizontal scaling, automatic retries, and dead-letter queues for failed URLs.

Headless Fleet

When you need Puppeteer at scale, run a fleet of headless browsers behind a load balancer. Tools like browserless or playwright-cluster manage browser lifecycle for you. Route requests through residential proxies to avoid detection.

Key Takeaways

Cheerio + axios is the right stack for server-rendered HTML — 10–50× faster and more memory-efficient than headless browsers.
Check the page source first. If the data is in the initial HTML, you don't need Puppeteer.
Proxy rotation belongs in your HTTP layer, not scattered across your codebase. An axios interceptor is the idiomatic place for it.
Per-request rotation (random session IDs) gives you a different IP every time. Sticky sessions keep the same IP for a batch — choose based on your target site's behavior.
Control concurrency with p-limit. Start at 10–15 concurrent requests with residential proxies and adjust based on success rates.
Circuit breakers prevent cascading failures. When your error rate spikes, pause — don't burn through proxy bandwidth hitting a wall.
Write results as JSONL so you don't lose progress on crashes. Separate failed URLs for retry runs.
Detect blocks by content, not just status codes. A 200 response with a CAPTCHA page is still a block.

Frequently Asked Questions

Does Cheerio execute JavaScript?

No. Cheerio parses static HTML only. If the data you need is rendered by client-side JavaScript, you need a headless browser like Puppeteer or Playwright — or you can find the underlying API endpoint the JS calls and request that directly with axios.

How do I use SOCKS5 proxies with axios?

Use the socks-proxy-agent package instead of https-proxy-agent. Construct the agent with socks5://USERNAME:PASSWORD@gate.proxyhat.com:1080 and assign it to config.httpsAgent and config.httpAgent on your axios instance.

What's the difference between per-request rotation and sticky sessions?

Per-request rotation assigns a random session ID to each request, giving you a different residential IP every time. Sticky sessions reuse the same session ID across multiple requests, keeping the same IP for a configurable duration (typically up to 30 minutes). Use sticky sessions when the target site requires session continuity (e.g., paginated browsing, logged-in states).

How many concurrent requests can I run with residential proxies?

With per-request IP rotation, 10–20 concurrent requests is a safe starting point. The proxy pool handles the IP diversity; your concurrency limit is about not overwhelming the target site. Monitor your success rate — if it drops below 90%, reduce concurrency or add delays between requests.

Should I randomize User-Agent headers when scraping?

Yes. Sites often block requests with identical headers across different IPs. Use a library like user-agents or maintain a pool of real browser User-Agent strings and rotate them per request. Pair this with matching Accept, Accept-Language, and Accept-Encoding headers for consistency.

Node.js Scraping with Cheerio & Proxies: A Code-First Guide

Why Cheerio + axios Is the Lightweight Scraping Stack

Setting Up: axios + Cheerio for Server-Side HTML Parsing

Adding Proxy Support to axios

When Cheerio Is Enough — And When It Isn't

Building a Rotating Proxy Interceptor

Interceptor Design

Concurrent Scraping with p-limit

Full Example: 10k-URL E-Commerce Scrape with Circuit Breaking

Circuit Breaker

Main Pipeline

Error Handling Strategy: 403, 429, and Beyond

Scaling Beyond a Single Process

Containerization

Job Queue Pattern

Headless Fleet

Key Takeaways

Frequently Asked Questions

Does Cheerio execute JavaScript?

How do I use SOCKS5 proxies with axios?

What's the difference between per-request rotation and sticky sessions?

How many concurrent requests can I run with residential proxies?

Should I randomize User-Agent headers when scraping?

Ready to get started?

Why Cheerio + axios Is the Lightweight Scraping Stack

Setting Up: axios + Cheerio for Server-Side HTML Parsing

Adding Proxy Support to axios

When Cheerio Is Enough — And When It Isn't

Building a Rotating Proxy Interceptor

Interceptor Design

Concurrent Scraping with p-limit

Full Example: 10k-URL E-Commerce Scrape with Circuit Breaking

Circuit Breaker

Main Pipeline

Error Handling Strategy: 403, 429, and Beyond

Scaling Beyond a Single Process

Containerization

Job Queue Pattern

Headless Fleet

Key Takeaways

Frequently Asked Questions

Does Cheerio execute JavaScript?

How do I use SOCKS5 proxies with axios?

What's the difference between per-request rotation and sticky sessions?

How many concurrent requests can I run with residential proxies?

Should I randomize User-Agent headers when scraping?

Ready to get started?

You might also be interested in

The Complete Guide to Puppeteer-Extra Stealth with Proxies

Selenium Proxy Auth & Stealth: A Code-First Guide for Scraping Engineers

Scrapy Proxy Middleware: A Code-First Guide to Residential Proxy Rotation

Using HTTP Proxies in Rust: reqwest, hyper, and Rotating Proxy Pools