Scrapy Proxy Middleware: A Code-First Guide to Residential Proxy Rotation

Build production-grade Scrapy proxy rotation with custom downloader middleware, retry logic, headless browser integration, and per-IP monitoring. Full code examples included.

Scrapy Proxy Middleware: A Code-First Guide to Residential Proxy Rotation

Why Scrapy Proxy Middleware Matters

If you're running Scrapy in production, you already know the wall: rate limits, CAPTCHAs, and outright IP bans. A single datacenter IP scraping at scale is a recipe for 403s. The fix isn't just more proxies — it's intelligent proxy rotation baked into Scrapy's middleware stack, where every request can be routed through the right IP at the right time.

This guide walks through building Scrapy proxy middleware from first principles: how Scrapy's downloader pipeline works, where proxies plug in, and how to implement rotation, failure handling, and monitoring that actually holds up under production load.

Scrapy's Downloader Middleware Model

Scrapy processes every request through a chain of downloader middlewares before it hits the wire, and processes every response through the same chain in reverse. The key lifecycle hooks are:

  • process_request — mutate or replace the request before it's sent. This is where you assign a proxy.
  • process_response — inspect or replace the response after it arrives. This is where you detect bans.
  • process_exception — handle connection-level failures (timeouts, DNS errors). This is where you retry with a different proxy.

The middleware order in DOWNLOADER_MIDDLEWARES matters. Scrapy's built-in HttpProxyMiddleware (priority 560) reads the proxy metadata from requests and sets the proxy on the Twisted request object. If you assign request.meta['proxy'] before priority 560, the built-in middleware will wire it up for you. Assign it after, and you're on your own.

For custom rotation, you typically disable the built-in HttpProxyMiddleware and replace it with your own at a similar or higher priority, so you control both proxy assignment and the actual connection routing.

Building a Custom Proxy Rotation Middleware

Let's build a full working middleware class that rotates residential proxies from ProxyHat's pool on every request, supports sticky sessions, and integrates geo-targeting.

The Middleware Class

import random
import string
from scrapy import signals
from scrapy.exceptions import CloseSpider


class ProxyHatRotationMiddleware:
    """Rotate residential proxies from ProxyHat on each request."""

    def __init__(self, username, password, country=None, max_retries=3):
        self.username = username
        self.password = password
        self.country = country
        self.max_retries = max_retries
        self.gateway = "gate.proxyhat.com"
        self.http_port = 8080
        self.socks5_port = 1080

    @classmethod
    def from_crawler(cls, crawler):
        s = crawler.settings
        mw = cls(
            username=s.get("PROXYHAT_USERNAME"),
            password=s.get("PROXYHAT_PASSWORD"),
            country=s.get("PROXYHAT_COUNTRY"),
            max_retries=s.getint("PROXYHAT_MAX_RETRIES", 3),
        )
        crawler.signals.connect(mw.spider_opened, signal=signals.spider_opened)
        return mw

    def spider_opened(self, spider):
        if not self.username or not self.password:
            raise CloseSpider("PROXYHAT_USERNAME and PROXYHAT_PASSWORD must be set")
        spider.logger.info(f"ProxyHat middleware active, country={self.country}")

    def _build_proxy_url(self, session_id=None, country=None, city=None):
        """Construct the ProxyHat proxy URL with optional geo and session params."""
        user_parts = [self.username]
        target_country = country or self.country
        if target_country:
            user_parts.append(f"country-{target_country}")
        if city:
            user_parts.append(f"city-{city}")
        if session_id:
            user_parts.append(f"session-{session_id}")
        user_str = "-".join(user_parts)
        return f"http://{user_str}:{self.password}@{self.gateway}:{self.http_port}"

    def process_request(self, request, spider):
        """Assign a rotating proxy to every outgoing request."""
        # Skip proxy for non-HTTP(S) schemes
        if request.url.startswith("data:"):
            return None

        # Use sticky session if meta flag is set
        session_id = request.meta.get("proxyhat_session")
        country = request.meta.get("proxyhat_country")
        city = request.meta.get("proxyhat_city")

        if not session_id:
            # Random session = rotating IP per request
            session_id = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))

        proxy_url = self._build_proxy_url(
            session_id=session_id,
            country=country,
            city=city,
        )
        request.meta["proxy"] = proxy_url
        request.meta["proxyhat_session"] = session_id
        request.meta["proxyhat_retries"] = request.meta.get("proxyhat_retries", 0)
        spider.logger.debug(f"Proxy assigned: session={session_id} country={country}")

    def process_response(self, request, response, spider):
        """Detect bans and retry with a fresh proxy."""
        retry_count = request.meta.get("proxyhat_retries", 0)

        if response.status in (403, 429, 503):
            if retry_count < self.max_retries:
                spider.logger.warning(
                    f"Ban detected ({response.status}), rotating proxy "
                    f"(retry {retry_count + 1}/{self.max_retries})"
                )
                # Force a new session on retry
                request.meta["proxyhat_session"] = None
                request.meta["proxyhat_retries"] = retry_count + 1
                request.dont_filter = True
                return request
            else:
                spider.logger.error(
                    f"Max retries ({self.max_retries}) reached for {request.url}"
                )
        return response

    def process_exception(self, request, exception, spider):
        """Handle connection failures by rotating proxy."""
        retry_count = request.meta.get("proxyhat_retries", 0)
        if retry_count < self.max_retries:
            spider.logger.warning(
                f"Proxy connection failed ({exception}), rotating "
                f"(retry {retry_count + 1}/{self.max_retries})"
            )
            request.meta["proxyhat_session"] = None
            request.meta["proxyhat_retries"] = retry_count + 1
            request.dont_filter = True
            return request
        return None

Settings Configuration

# settings.py

# Disable the built-in proxy middleware
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": None,
    "myproject.middlewares.ProxyHatRotationMiddleware": 560,
}

# ProxyHat credentials
PROXYHAT_USERNAME = "your_username"
PROXYHAT_PASSWORD = "your_password"
PROXYHAT_COUNTRY = "US"  # Default country; override per-request via meta
PROXYHAT_MAX_RETRIES = 3

# Concurrency — don't overwhelm a single exit IP
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_TIMEOUT = 30
RETRY_ENABLED = False  # We handle retries inside our middleware

Using Sticky Sessions and Geo-Targeting in Spiders

import scrapy


class EcommerceSpider(scrapy.Spider):
    name = "ecommerce"

    def start_requests(self):
        urls = [
            "https://example.com/product/1",
            "https://example.com/product/2",
        ]
        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                # Rotating IP (new session per request)
                meta={
                    "proxyhat_country": "DE",
                    "proxyhat_city": "berlin",
                },
            )

    def parse(self, response):
        # For checkout flows that need session persistence
        if response.css(".checkout-btn"):
            yield scrapy.Request(
                response.urljoin("/checkout"),
                callback=self.parse_checkout,
                meta={
                    # Reuse the same session/IP for the entire checkout
                    "proxyhat_session": response.meta["proxyhat_session"],
                    "proxyhat_country": "DE",
                },
            )
        else:
            for item in response.css(".product"):
                yield {"name": item.css("::text").get()}

Community Middleware vs. Rolling Your Own

The scrapy-rotating-proxies package is a popular open-source option. It maintains a proxy pool, tracks which proxies are banned, and rotates automatically. It's a solid starting point, but it has real limitations for production residential proxy workflows.

Feature scrapy-rotating-proxies Custom Middleware
Proxy source Static list in settings Dynamic API or gateway URL
Residential proxy support No built-in concept Native (gateway-based rotation)
Geo-targeting Not supported Per-request country/city
Sticky sessions Limited (session key) Full control via username flags
Ban detection Status code + regex Custom logic per site
Pool health tracking Basic (dead/alive lists) Statsd / Prometheus integration
Proxy pool refresh Manual / cron On-demand (residential pool)
Maintenance burden Low (community-maintained) You own it

When to use scrapy-rotating-proxies: You have a flat list of datacenter proxies and need basic rotation with ban tracking. Quick prototype, low complexity.

When to roll your own: You're using residential proxies from a gateway like ProxyHat, need per-request geo-targeting, sticky sessions, or want to integrate monitoring with your observability stack. The middleware above took ~100 lines and gives you full control.

Retry Middleware with Proxy Rotation

Scrapy's built-in RetryMiddleware retries failed requests, but it doesn't rotate proxies — it just re-sends the same request through the same IP. That's useless when the failure is a ban on that IP.

Our custom middleware already handles this by clearing proxyhat_session on retry, which forces a new IP. But if you want to keep Scrapy's RetryMiddleware for non-proxy errors (DNS failures, timeouts), you need to make them coexist:

# settings.py — using both middlewares safely

DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": None,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": None,  # Disable built-in
    "myproject.middlewares.ProxyHatRotationMiddleware": 560,
    "myproject.middlewares.CustomRetryMiddleware": 550,  # Before proxy assignment
}

RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

By disabling Scrapy's RetryMiddleware and handling retries inside our proxy middleware, we ensure that every retry gets a fresh IP. The process_exception and process_response methods in the middleware above handle this already — they return the modified request with a new session, and Scrapy reschedules it.

JavaScript-Heavy Sites: scrapy-splash and scrapy-playwright

Many modern sites require a real browser to render content. Scrapy has two popular integrations for this: scrapy-splash (for Splash, a lightweight headless browser service) and scrapy-playwright (for Playwright, a full Chromium/Firefox/WebKit driver). Both support proxies, but the configuration differs.

scrapy-playwright with ProxyHat

scrapy-playwright runs a real browser per request, so proxies are configured at the browser context level, not via Scrapy's request.meta['proxy'].

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}

# ProxyHat proxy for Playwright contexts
PROXYHAT_USERNAME = "your_username"
PROXYHAT_PASSWORD = "your_password"

Then in your spider, create a new browser context per request with the proxy configured:

import random
import string
import scrapy


class JSSpider(scrapy.Spider):
    name = "js_heavy"

    def start_requests(self):
        yield scrapy.Request(
            "https://spa-example.com/products",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_context_kwargs": {
                    "proxy": {
                        "server": f"http://gate.proxyhat.com:8080",
                        "username": f"user-country-US-session-{random.randint(1000,9999)}",
                        "password": "your_password",
                    },
                },
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        # Wait for dynamic content
        await page.wait_for_selector(".product-list")
        content = await page.content()
        await page.close()
        # Parse the rendered HTML
        selector = scrapy.Selector(text=content)
        for product in selector.css(".product"):
            yield {"name": product.css("::text").get()}

scrapy-splash with ProxyHat

For Splash, the proxy is passed as a Splash request argument. Splash runs as a separate Docker service, so the proxy configuration happens in the Splash API call, not in Scrapy's download layer:

import scrapy
from scrapy_splash import SplashRequest


class SplashSpider(scrapy.Spider):
    name = "splash_spider"

    def start_requests(self):
        yield SplashRequest(
            "https://spa-example.com/products",
            callback=self.parse,
            args={
                "wait": 3,
                "proxy": "http://user-country-US:your_password@gate.proxyhat.com:8080",
            },
            endpoint="render.html",
        )

    def parse(self, response):
        for product in response.css(".product"):
            yield {"name": product.css("::text").get()}

Deployment: Scrapyd, Docker, or Managed

Docker + Cron (Simple and Reliable)

For most teams, a Docker container with a cron schedule is the simplest production setup:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

# Entrypoint runs the spider on schedule
CMD ["scrapy", "crawl", "ecommerce"]
# docker-compose.yml
version: "3.8"
services:
  scraper:
    build: .
    environment:
      - PROXYHAT_USERNAME=${PROXYHAT_USERNAME}
      - PROXYHAT_PASSWORD=${PROXYHAT_PASSWORD}
    restart: unless-stopped
    # For scheduled runs, use a cron wrapper or external scheduler
    labels:
      - "ofelia.enabled=true"
      - "ofelia.schedule=0 */6 * * *"
      - "ofelia.command=scrapy crawl ecommerce"

Scrapyd

Scrapyd is Scrapy's official deployment daemon. It exposes an HTTP API to schedule, cancel, and list spider runs. It's great when you have multiple spiders and need programmatic control, but it doesn't handle scheduling natively — pair it with Scrapyd-Web or a custom scheduler.

ScrapeOps / Zyte Scrapy Cloud

Managed platforms handle deployment, scheduling, proxy rotation, and monitoring out of the box. The trade-off is cost and less control over the proxy layer. If you're already using ProxyHat for residential proxies, you can still configure these platforms to route through ProxyHat's gateway.

Monitoring: Per-IP Success Rates and Ban Detection

Running proxies blind is how you waste budget and miss data gaps. You need to track success rates, detect bans early, and correlate failures to specific IPs or sessions.

Statsd Integration

Scrapy's stats collection is extensible. Add a pipeline that emits metrics on every response:

from scrapy import signals
from scrapy.statscollectors import StatsCollector


class ProxyStatsMiddleware:
    """Track per-IP success rates and ban patterns."""

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        mw = cls(crawler.stats)
        crawler.signals.connect(mw.spider_closed, signal=signals.spider_closed)
        return mw

    def process_response(self, request, response, spider):
        session_id = request.meta.get("proxyhat_session", "unknown")
        country = request.meta.get("proxyhat_country", "unknown")

        # Increment total requests
        self.stats.inc_value(f"proxy/total/{country}")

        if response.status == 200:
            self.stats.inc_value(f"proxy/success/{country}")
        elif response.status in (403, 429, 503):
            self.stats.inc_value(f"proxy/banned/{country}")
            self.stats.inc_value(f"proxy/banned_status/{response.status}/{country}")
        else:
            self.stats.inc_value(f"proxy/error/{country}")

        self.stats.set_value(
            f"proxy/success_rate/{country}",
            self.stats.get_value(f"proxy/success/{country}", 0)
            / max(self.stats.get_value(f"proxy/total/{country}", 1), 1),
        )

        return response

    def spider_closed(self, spider, reason):
        spider.logger.info(f"Proxy stats: {dict(self.stats.get_stats())}")

Ban Detection Heuristics

Not all bans return 403. Some return 200 with a CAPTCHA page. Others redirect to a challenge page. Build detection into your middleware:

  • Status codes: 403, 429, 503 are obvious bans.
  • Response size: If the response is significantly smaller than expected (e.g., < 5KB for a product page that's normally 50KB), it's likely a CAPTCHA or block page.
  • Content patterns: Check for known CAPTCHA markers — recaptcha, cf-challenge, access-denied in the response body.
  • Redirect chains: If a product URL redirects to a login page, that's a soft ban.

Integrate these checks into the process_response method of your middleware and log them as structured events. Feed the data into Grafana, Datadog, or even a simple SQLite dashboard to spot degradation before it kills your scrape rate.

Scaling Patterns

When you move from a single spider to a fleet, you need to think about concurrency differently:

  • Per-IP rate limiting: Don't send 32 concurrent requests through one residential IP. Use ProxyHat's rotating mode (new session per request) and set CONCURRENT_REQUESTS based on your target site's tolerance.
  • Headless fleet: For JS-heavy targets, run multiple Playwright instances behind a task queue (Celery, Dramatiq). Each worker gets its own proxy session.
  • Containerization: One Scrapy instance per Docker container, each with its own proxy credentials. Use Kubernetes or Docker Compose to scale horizontally.
  • Backoff: When ban rates spike above a threshold, reduce concurrency automatically. Don't keep hammering a site that's actively blocking you.

Key Takeaways

  • Disable Scrapy's built-in HttpProxyMiddleware and replace it with a custom middleware that controls proxy assignment, retries, and ban detection in one place.
  • Use gateway-style residential proxies (like ProxyHat) instead of managing static IP lists. The rotation happens in the username, not in your code.
  • Rotate on retry: When a request fails due to a ban, clear the session ID so the retry uses a fresh IP. This is the single most impactful pattern for success rate.
  • For JS-heavy sites, configure proxies at the browser context level (Playwright) or as Splash arguments — not via request.meta['proxy'].
  • Monitor per-country success rates and ban patterns. Without visibility, you're flying blind on proxy spend.
  • Start with Docker + cron, graduate to Scrapyd or managed platforms only when you need programmatic scheduling.

Conclusion

Scrapy proxy middleware isn't a bolt-on — it's the backbone of any production scraping pipeline. The middleware pattern gives you a clean, testable place to handle rotation, retries, geo-targeting, and monitoring. Start with the ProxyHatRotationMiddleware above, adapt the ban detection heuristics to your target sites, and instrument from day one.

Ready to put it into practice? Get started with ProxyHat residential proxies and configure your Scrapy project with the gateway settings above. For more on web scraping patterns, check out our web scraping use case guide and SERP tracking walkthrough.

Ready to get started?

Access 50M+ residential IPs across 148+ countries with AI-powered filtering.

View PricingResidential Proxies
← Back to Blog