Why Scraping Infrastructure Needs Dedicated Architecture
A single-threaded script hitting one website works fine for small tasks. But when you need to scrape millions of pages daily across dozens of targets, that script becomes a bottleneck. Scaling scraping infrastructure requires moving from linear scripts to distributed, queue-based architectures that handle failures gracefully, manage proxy rotation, and maximize throughput.
This guide covers the architecture patterns, queue systems, horizontal scaling strategies, and proxy management techniques that power production-grade scraping at scale.
This article builds on concepts from our Complete Guide to Web Scraping Proxies. For proxy pool sizing, see How Many Proxies Do You Need for Scraping?
Architecture Patterns for Scalable Scraping
Pattern 1: Queue-Based Scraping
The foundation of scalable scraping is a message queue that decouples URL discovery from data fetching. Workers pull tasks from the queue, fetch pages through proxies, and push results to storage.
# Architecture overview:
#
# URL Source → [Message Queue] → Worker 1 → [Results Store]
# → Worker 2 →
# → Worker N →
# ↓
# [Dead Letter Queue]
# (failed requests)
Benefits of this pattern:
- Horizontal scaling: Add or remove workers without changing the system
- Fault tolerance: Failed tasks return to the queue for retry
- Rate control: Adjust worker count to control overall throughput
- Visibility: Queue depth shows backlog; completion rate shows health
Python Implementation with Redis Queue
import redis
import requests
import json
import time
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
REDIS_URL = "redis://localhost:6379/0"
@dataclass
class ScrapeTask:
url: str
target: str
priority: int = 0
retries: int = 0
max_retries: int = 3
class ScrapingQueue:
def __init__(self, redis_url: str):
self.redis = redis.from_url(redis_url)
self.queue_key = "scrape:tasks"
self.results_key = "scrape:results"
self.dlq_key = "scrape:dead_letter"
def enqueue(self, task: ScrapeTask):
self.redis.lpush(self.queue_key, json.dumps(task.__dict__))
def dequeue(self, timeout: int = 5) -> ScrapeTask | None:
result = self.redis.brpop(self.queue_key, timeout=timeout)
if result:
data = json.loads(result[1])
return ScrapeTask(**data)
return None
def store_result(self, url: str, data: dict):
self.redis.hset(self.results_key, url, json.dumps(data))
def send_to_dlq(self, task: ScrapeTask, error: str):
task_data = task.__dict__
task_data["error"] = error
self.redis.lpush(self.dlq_key, json.dumps(task_data))
@property
def pending_count(self) -> int:
return self.redis.llen(self.queue_key)
class ScrapingWorker:
def __init__(self, queue: ScrapingQueue, worker_id: int):
self.queue = queue
self.worker_id = worker_id
self.session = requests.Session()
self.session.proxies = {"http": PROXY, "https": PROXY}
def process_task(self, task: ScrapeTask) -> bool:
try:
resp = self.session.get(task.url, timeout=30)
if resp.status_code == 200:
self.queue.store_result(task.url, {
"status": 200,
"body": resp.text[:10000], # Truncate for storage
"target": task.target,
})
return True
elif resp.status_code in [429, 503]:
# Retry with backoff
if task.retries < task.max_retries:
task.retries += 1
time.sleep(2 ** task.retries)
self.queue.enqueue(task)
else:
self.queue.send_to_dlq(task, f"HTTP {resp.status_code}")
else:
self.queue.send_to_dlq(task, f"HTTP {resp.status_code}")
except Exception as e:
if task.retries < task.max_retries:
task.retries += 1
self.queue.enqueue(task)
else:
self.queue.send_to_dlq(task, str(e))
return False
def run(self):
print(f"Worker {self.worker_id} started")
while True:
task = self.queue.dequeue(timeout=5)
if task:
self.process_task(task)
# Launch multiple workers
queue = ScrapingQueue(REDIS_URL)
# Enqueue tasks
for i in range(10000):
queue.enqueue(ScrapeTask(
url=f"https://example.com/product/{i}",
target="example.com"
))
# Start 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
workers = [ScrapingWorker(queue, i) for i in range(10)]
for worker in workers:
executor.submit(worker.run)
Node.js Implementation with Bull Queue
const Queue = require('bull');
const HttpsProxyAgent = require('https-proxy-agent');
const fetch = require('node-fetch');
const agent = new HttpsProxyAgent('http://USERNAME:PASSWORD@gate.proxyhat.com:8080');
// Create queue with Redis backend
const scrapeQueue = new Queue('scraping', 'redis://localhost:6379');
// Define the worker processor
scrapeQueue.process(10, async (job) => { // 10 concurrent workers
const { url, target } = job.data;
try {
const res = await fetch(url, { agent, timeout: 30000 });
if (res.ok) {
const body = await res.text();
return { url, status: res.status, body: body.slice(0, 10000) };
}
if (res.status === 429 || res.status === 503) {
throw new Error(`Rate limited: HTTP ${res.status}`);
}
return { url, status: res.status, body: null };
} catch (err) {
throw err; // Bull will retry based on job options
}
});
// Enqueue tasks with retry options
async function enqueueTasks(urls, target) {
for (const url of urls) {
await scrapeQueue.add(
{ url, target },
{
attempts: 3,
backoff: { type: 'exponential', delay: 2000 },
removeOnComplete: 1000,
removeOnFail: false,
}
);
}
}
// Monitor progress
scrapeQueue.on('completed', (job, result) => {
console.log(`Done: ${result.url} (${result.status})`);
});
scrapeQueue.on('failed', (job, err) => {
console.error(`Failed: ${job.data.url} - ${err.message}`);
});
Pattern 2: Pipeline Architecture
For complex scraping workflows, use a pipeline where each stage handles a different concern:
# Pipeline stages:
#
# Stage 1: URL Discovery → finds pages to scrape
# Stage 2: Content Fetching → downloads pages via proxies
# Stage 3: Data Extraction → parses HTML, extracts data
# Stage 4: Data Validation → checks quality, deduplicates
# Stage 5: Storage → saves to database/warehouse
Go Implementation
package main
import (
"encoding/json"
"fmt"
"io"
"net/http"
"net/url"
"sync"
"time"
)
type ScrapeResult struct {
URL string `json:"url"`
Status int `json:"status"`
Body string `json:"body"`
}
func fetcher(urls <-chan string, results chan<- ScrapeResult, wg *sync.WaitGroup) {
defer wg.Done()
proxyURL, _ := url.Parse("http://USERNAME:PASSWORD@gate.proxyhat.com:8080")
client := &http.Client{
Transport: &http.Transport{Proxy: http.ProxyURL(proxyURL)},
Timeout: 30 * time.Second,
}
for u := range urls {
resp, err := client.Get(u)
if err != nil {
results <- ScrapeResult{URL: u, Status: 0}
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
results <- ScrapeResult{URL: u, Status: resp.StatusCode, Body: string(body)}
}
}
func processor(results <-chan ScrapeResult, done chan<- bool) {
for result := range results {
if result.Status == 200 {
// Extract and store data
data, _ := json.Marshal(result)
fmt.Printf("Processed: %s (%d bytes)\n", result.URL, len(data))
}
}
done <- true
}
func main() {
urls := make(chan string, 1000)
results := make(chan ScrapeResult, 1000)
done := make(chan bool)
// Start 20 fetcher workers
var wg sync.WaitGroup
for i := 0; i < 20; i++ {
wg.Add(1)
go fetcher(urls, results, &wg)
}
// Start processor
go processor(results, done)
// Feed URLs
go func() {
for i := 0; i < 10000; i++ {
urls <- fmt.Sprintf("https://example.com/product/%d", i)
}
close(urls)
}()
// Wait for all fetchers, then close results
wg.Wait()
close(results)
<-done
}
Horizontal Scaling Strategies
Strategy 1: Multi-Machine Deployment
Distribute workers across multiple machines. The queue acts as the coordination point:
| Component | Deployment | Scaling |
|---|---|---|
| Queue (Redis/RabbitMQ) | Dedicated server or managed service | Vertical (more RAM) |
| Workers | Multiple machines or containers | Horizontal (add instances) |
| Results storage | Database or object store | Vertical + sharding |
| Monitoring | Centralized dashboard | Single instance |
Strategy 2: Container-Based Scaling
Use Docker and Kubernetes for elastic scaling. Each worker runs in a container that can be replicated:
# docker-compose.yml for scraping workers
version: '3.8'
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
scraper-worker:
build: .
environment:
- PROXY_URL=http://USERNAME:PASSWORD@gate.proxyhat.com:8080
- REDIS_URL=redis://redis:6379/0
- WORKER_CONCURRENCY=10
deploy:
replicas: 5 # 5 containers × 10 concurrency = 50 parallel requests
resources:
limits:
memory: 512M
cpus: '0.5'
depends_on:
- redis
Proxy Management at Scale
At scale, proxy management becomes a critical system component. Key considerations:
Connection Pooling
Reuse connections to the proxy gateway instead of creating new ones per request. This reduces latency and connection overhead:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_optimized_session() -> requests.Session:
"""Create a session with connection pooling and retry logic."""
session = requests.Session()
# Connection pool: keep 20 connections, max 50
adapter = HTTPAdapter(
pool_connections=20,
pool_maxsize=50,
max_retries=Retry(
total=3,
backoff_factor=1,
status_forcelist=[500, 502, 503],
)
)
session.mount("http://", adapter)
session.mount("https://", adapter)
session.proxies = {
"http": "http://USERNAME:PASSWORD@gate.proxyhat.com:8080",
"https": "http://USERNAME:PASSWORD@gate.proxyhat.com:8080",
}
return session
# Reuse across many requests
session = create_optimized_session()
for url in urls:
resp = session.get(url, timeout=30)
Health Monitoring
Monitor your proxy performance in real time to detect issues early:
import time
from collections import defaultdict
from dataclasses import dataclass, field
@dataclass
class ProxyMetrics:
"""Track proxy health metrics for monitoring."""
requests_total: int = 0
requests_success: int = 0
requests_blocked: int = 0
requests_timeout: int = 0
latency_samples: list = field(default_factory=list)
status_codes: dict = field(default_factory=lambda: defaultdict(int))
def record_request(self, status_code: int, latency_ms: float):
self.requests_total += 1
self.status_codes[status_code] += 1
self.latency_samples.append(latency_ms)
if status_code == 200:
self.requests_success += 1
elif status_code in [403, 429]:
self.requests_blocked += 1
# Keep only last 1000 samples
if len(self.latency_samples) > 1000:
self.latency_samples = self.latency_samples[-1000:]
@property
def success_rate(self) -> float:
return self.requests_success / self.requests_total if self.requests_total else 0
@property
def avg_latency(self) -> float:
return sum(self.latency_samples) / len(self.latency_samples) if self.latency_samples else 0
def report(self) -> str:
return (
f"Total: {self.requests_total}, "
f"Success: {self.success_rate:.1%}, "
f"Blocked: {self.requests_blocked}, "
f"Avg Latency: {self.avg_latency:.0f}ms"
)
Data Storage at Scale
| Storage Type | Best For | Scale |
|---|---|---|
| PostgreSQL | Structured product/pricing data | Millions of rows |
| MongoDB | Semi-structured/variable schemas | Billions of documents |
| S3/Object storage | Raw HTML archives | Petabytes |
| Elasticsearch | Full-text search over scraped data | Billions of documents |
| ClickHouse | Analytics over large datasets | Trillions of rows |
Scaling Checklist
- Decouple URL discovery from fetching. Use a message queue between stages.
- Implement proper retry logic. Exponential backoff with dead letter queues for persistent failures.
- Monitor everything. Queue depth, success rates, latency, error rates per target domain.
- Use connection pooling. Reuse proxy connections instead of creating new ones per request.
- Plan for failure. Workers crash, proxies get blocked, targets change their structure. Build resilience into every layer.
- Test at scale before launch. A system that works at 100 RPM may fail at 10,000 RPM due to memory, connection limits, or queue bottlenecks.
For proxy rotation strategies that complement your scaling architecture, read Proxy Rotation Strategies for Large-Scale Scraping. To handle rate limits as you scale, see Scraping Rate Limits Explained.
Use the Python SDK, Node SDK, or Go SDK for production proxy integration, and explore ProxyHat plans for high-volume scraping.
Frequently Asked Questions
What queue system is best for scraping at scale?
Redis with Bull (Node.js) or RQ (Python) works well up to millions of tasks per day. For larger scale, Apache Kafka or RabbitMQ provides better durability and throughput. Choose based on your existing infrastructure and team expertise.
How many concurrent workers should I run?
Start with 10-20 workers and scale based on your proxy capacity and target site tolerance. Monitor success rates — if they drop below 90%, reduce concurrency before adding more workers. Each worker through ProxyHat gets automatic IP rotation.
Should I use async or threading for workers?
For I/O-bound scraping (most cases), async (Python asyncio, Node.js) provides better resource efficiency than threading. Use threading or multiprocessing only when you need CPU-heavy parsing alongside fetching. Go goroutines excel at both patterns.
How do I handle target site structure changes?
Implement data validation in your pipeline. When parsed data fails validation (missing fields, wrong types), alert your team and queue affected URLs for re-processing with updated parsers. Version your parsers so you can roll back if needed.






