Why Scrapy Proxy Middleware Matters
If you're running Scrapy in production, you already know the wall: rate limits, CAPTCHAs, and outright IP bans. A single datacenter IP scraping at scale is a recipe for 403s. The fix isn't just more proxies — it's intelligent proxy rotation baked into Scrapy's middleware stack, where every request can be routed through the right IP at the right time.
This guide walks through building Scrapy proxy middleware from first principles: how Scrapy's downloader pipeline works, where proxies plug in, and how to implement rotation, failure handling, and monitoring that actually holds up under production load.
Scrapy's Downloader Middleware Model
Scrapy processes every request through a chain of downloader middlewares before it hits the wire, and processes every response through the same chain in reverse. The key lifecycle hooks are:
process_request— mutate or replace the request before it's sent. This is where you assign a proxy.process_response— inspect or replace the response after it arrives. This is where you detect bans.process_exception— handle connection-level failures (timeouts, DNS errors). This is where you retry with a different proxy.
The middleware order in DOWNLOADER_MIDDLEWARES matters. Scrapy's built-in HttpProxyMiddleware (priority 560) reads the proxy metadata from requests and sets the proxy on the Twisted request object. If you assign request.meta['proxy'] before priority 560, the built-in middleware will wire it up for you. Assign it after, and you're on your own.
For custom rotation, you typically disable the built-in HttpProxyMiddleware and replace it with your own at a similar or higher priority, so you control both proxy assignment and the actual connection routing.
Building a Custom Proxy Rotation Middleware
Let's build a full working middleware class that rotates residential proxies from ProxyHat's pool on every request, supports sticky sessions, and integrates geo-targeting.
The Middleware Class
import random
import string
from scrapy import signals
from scrapy.exceptions import CloseSpider
class ProxyHatRotationMiddleware:
"""Rotate residential proxies from ProxyHat on each request."""
def __init__(self, username, password, country=None, max_retries=3):
self.username = username
self.password = password
self.country = country
self.max_retries = max_retries
self.gateway = "gate.proxyhat.com"
self.http_port = 8080
self.socks5_port = 1080
@classmethod
def from_crawler(cls, crawler):
s = crawler.settings
mw = cls(
username=s.get("PROXYHAT_USERNAME"),
password=s.get("PROXYHAT_PASSWORD"),
country=s.get("PROXYHAT_COUNTRY"),
max_retries=s.getint("PROXYHAT_MAX_RETRIES", 3),
)
crawler.signals.connect(mw.spider_opened, signal=signals.spider_opened)
return mw
def spider_opened(self, spider):
if not self.username or not self.password:
raise CloseSpider("PROXYHAT_USERNAME and PROXYHAT_PASSWORD must be set")
spider.logger.info(f"ProxyHat middleware active, country={self.country}")
def _build_proxy_url(self, session_id=None, country=None, city=None):
"""Construct the ProxyHat proxy URL with optional geo and session params."""
user_parts = [self.username]
target_country = country or self.country
if target_country:
user_parts.append(f"country-{target_country}")
if city:
user_parts.append(f"city-{city}")
if session_id:
user_parts.append(f"session-{session_id}")
user_str = "-".join(user_parts)
return f"http://{user_str}:{self.password}@{self.gateway}:{self.http_port}"
def process_request(self, request, spider):
"""Assign a rotating proxy to every outgoing request."""
# Skip proxy for non-HTTP(S) schemes
if request.url.startswith("data:"):
return None
# Use sticky session if meta flag is set
session_id = request.meta.get("proxyhat_session")
country = request.meta.get("proxyhat_country")
city = request.meta.get("proxyhat_city")
if not session_id:
# Random session = rotating IP per request
session_id = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
proxy_url = self._build_proxy_url(
session_id=session_id,
country=country,
city=city,
)
request.meta["proxy"] = proxy_url
request.meta["proxyhat_session"] = session_id
request.meta["proxyhat_retries"] = request.meta.get("proxyhat_retries", 0)
spider.logger.debug(f"Proxy assigned: session={session_id} country={country}")
def process_response(self, request, response, spider):
"""Detect bans and retry with a fresh proxy."""
retry_count = request.meta.get("proxyhat_retries", 0)
if response.status in (403, 429, 503):
if retry_count < self.max_retries:
spider.logger.warning(
f"Ban detected ({response.status}), rotating proxy "
f"(retry {retry_count + 1}/{self.max_retries})"
)
# Force a new session on retry
request.meta["proxyhat_session"] = None
request.meta["proxyhat_retries"] = retry_count + 1
request.dont_filter = True
return request
else:
spider.logger.error(
f"Max retries ({self.max_retries}) reached for {request.url}"
)
return response
def process_exception(self, request, exception, spider):
"""Handle connection failures by rotating proxy."""
retry_count = request.meta.get("proxyhat_retries", 0)
if retry_count < self.max_retries:
spider.logger.warning(
f"Proxy connection failed ({exception}), rotating "
f"(retry {retry_count + 1}/{self.max_retries})"
)
request.meta["proxyhat_session"] = None
request.meta["proxyhat_retries"] = retry_count + 1
request.dont_filter = True
return request
return None
Settings Configuration
# settings.py
# Disable the built-in proxy middleware
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": None,
"myproject.middlewares.ProxyHatRotationMiddleware": 560,
}
# ProxyHat credentials
PROXYHAT_USERNAME = "your_username"
PROXYHAT_PASSWORD = "your_password"
PROXYHAT_COUNTRY = "US" # Default country; override per-request via meta
PROXYHAT_MAX_RETRIES = 3
# Concurrency — don't overwhelm a single exit IP
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_TIMEOUT = 30
RETRY_ENABLED = False # We handle retries inside our middleware
Using Sticky Sessions and Geo-Targeting in Spiders
import scrapy
class EcommerceSpider(scrapy.Spider):
name = "ecommerce"
def start_requests(self):
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
]
for url in urls:
yield scrapy.Request(
url,
callback=self.parse,
# Rotating IP (new session per request)
meta={
"proxyhat_country": "DE",
"proxyhat_city": "berlin",
},
)
def parse(self, response):
# For checkout flows that need session persistence
if response.css(".checkout-btn"):
yield scrapy.Request(
response.urljoin("/checkout"),
callback=self.parse_checkout,
meta={
# Reuse the same session/IP for the entire checkout
"proxyhat_session": response.meta["proxyhat_session"],
"proxyhat_country": "DE",
},
)
else:
for item in response.css(".product"):
yield {"name": item.css("::text").get()}
Community Middleware vs. Rolling Your Own
The scrapy-rotating-proxies package is a popular open-source option. It maintains a proxy pool, tracks which proxies are banned, and rotates automatically. It's a solid starting point, but it has real limitations for production residential proxy workflows.
| Feature | scrapy-rotating-proxies | Custom Middleware |
|---|---|---|
| Proxy source | Static list in settings | Dynamic API or gateway URL |
| Residential proxy support | No built-in concept | Native (gateway-based rotation) |
| Geo-targeting | Not supported | Per-request country/city |
| Sticky sessions | Limited (session key) | Full control via username flags |
| Ban detection | Status code + regex | Custom logic per site |
| Pool health tracking | Basic (dead/alive lists) | Statsd / Prometheus integration |
| Proxy pool refresh | Manual / cron | On-demand (residential pool) |
| Maintenance burden | Low (community-maintained) | You own it |
When to use scrapy-rotating-proxies: You have a flat list of datacenter proxies and need basic rotation with ban tracking. Quick prototype, low complexity.
When to roll your own: You're using residential proxies from a gateway like ProxyHat, need per-request geo-targeting, sticky sessions, or want to integrate monitoring with your observability stack. The middleware above took ~100 lines and gives you full control.
Retry Middleware with Proxy Rotation
Scrapy's built-in RetryMiddleware retries failed requests, but it doesn't rotate proxies — it just re-sends the same request through the same IP. That's useless when the failure is a ban on that IP.
Our custom middleware already handles this by clearing proxyhat_session on retry, which forces a new IP. But if you want to keep Scrapy's RetryMiddleware for non-proxy errors (DNS failures, timeouts), you need to make them coexist:
# settings.py — using both middlewares safely
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": None,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None, # Disable built-in
"myproject.middlewares.ProxyHatRotationMiddleware": 560,
"myproject.middlewares.CustomRetryMiddleware": 550, # Before proxy assignment
}
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
By disabling Scrapy's RetryMiddleware and handling retries inside our proxy middleware, we ensure that every retry gets a fresh IP. The process_exception and process_response methods in the middleware above handle this already — they return the modified request with a new session, and Scrapy reschedules it.
JavaScript-Heavy Sites: scrapy-splash and scrapy-playwright
Many modern sites require a real browser to render content. Scrapy has two popular integrations for this: scrapy-splash (for Splash, a lightweight headless browser service) and scrapy-playwright (for Playwright, a full Chromium/Firefox/WebKit driver). Both support proxies, but the configuration differs.
scrapy-playwright with ProxyHat
scrapy-playwright runs a real browser per request, so proxies are configured at the browser context level, not via Scrapy's request.meta['proxy'].
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}
# ProxyHat proxy for Playwright contexts
PROXYHAT_USERNAME = "your_username"
PROXYHAT_PASSWORD = "your_password"
Then in your spider, create a new browser context per request with the proxy configured:
import random
import string
import scrapy
class JSSpider(scrapy.Spider):
name = "js_heavy"
def start_requests(self):
yield scrapy.Request(
"https://spa-example.com/products",
callback=self.parse,
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_context_kwargs": {
"proxy": {
"server": f"http://gate.proxyhat.com:8080",
"username": f"user-country-US-session-{random.randint(1000,9999)}",
"password": "your_password",
},
},
},
)
async def parse(self, response):
page = response.meta["playwright_page"]
# Wait for dynamic content
await page.wait_for_selector(".product-list")
content = await page.content()
await page.close()
# Parse the rendered HTML
selector = scrapy.Selector(text=content)
for product in selector.css(".product"):
yield {"name": product.css("::text").get()}
scrapy-splash with ProxyHat
For Splash, the proxy is passed as a Splash request argument. Splash runs as a separate Docker service, so the proxy configuration happens in the Splash API call, not in Scrapy's download layer:
import scrapy
from scrapy_splash import SplashRequest
class SplashSpider(scrapy.Spider):
name = "splash_spider"
def start_requests(self):
yield SplashRequest(
"https://spa-example.com/products",
callback=self.parse,
args={
"wait": 3,
"proxy": "http://user-country-US:your_password@gate.proxyhat.com:8080",
},
endpoint="render.html",
)
def parse(self, response):
for product in response.css(".product"):
yield {"name": product.css("::text").get()}
Deployment: Scrapyd, Docker, or Managed
Docker + Cron (Simple and Reliable)
For most teams, a Docker container with a cron schedule is the simplest production setup:
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Entrypoint runs the spider on schedule
CMD ["scrapy", "crawl", "ecommerce"]
# docker-compose.yml
version: "3.8"
services:
scraper:
build: .
environment:
- PROXYHAT_USERNAME=${PROXYHAT_USERNAME}
- PROXYHAT_PASSWORD=${PROXYHAT_PASSWORD}
restart: unless-stopped
# For scheduled runs, use a cron wrapper or external scheduler
labels:
- "ofelia.enabled=true"
- "ofelia.schedule=0 */6 * * *"
- "ofelia.command=scrapy crawl ecommerce"
Scrapyd
Scrapyd is Scrapy's official deployment daemon. It exposes an HTTP API to schedule, cancel, and list spider runs. It's great when you have multiple spiders and need programmatic control, but it doesn't handle scheduling natively — pair it with Scrapyd-Web or a custom scheduler.
ScrapeOps / Zyte Scrapy Cloud
Managed platforms handle deployment, scheduling, proxy rotation, and monitoring out of the box. The trade-off is cost and less control over the proxy layer. If you're already using ProxyHat for residential proxies, you can still configure these platforms to route through ProxyHat's gateway.
Monitoring: Per-IP Success Rates and Ban Detection
Running proxies blind is how you waste budget and miss data gaps. You need to track success rates, detect bans early, and correlate failures to specific IPs or sessions.
Statsd Integration
Scrapy's stats collection is extensible. Add a pipeline that emits metrics on every response:
from scrapy import signals
from scrapy.statscollectors import StatsCollector
class ProxyStatsMiddleware:
"""Track per-IP success rates and ban patterns."""
def __init__(self, stats):
self.stats = stats
@classmethod
def from_crawler(cls, crawler):
mw = cls(crawler.stats)
crawler.signals.connect(mw.spider_closed, signal=signals.spider_closed)
return mw
def process_response(self, request, response, spider):
session_id = request.meta.get("proxyhat_session", "unknown")
country = request.meta.get("proxyhat_country", "unknown")
# Increment total requests
self.stats.inc_value(f"proxy/total/{country}")
if response.status == 200:
self.stats.inc_value(f"proxy/success/{country}")
elif response.status in (403, 429, 503):
self.stats.inc_value(f"proxy/banned/{country}")
self.stats.inc_value(f"proxy/banned_status/{response.status}/{country}")
else:
self.stats.inc_value(f"proxy/error/{country}")
self.stats.set_value(
f"proxy/success_rate/{country}",
self.stats.get_value(f"proxy/success/{country}", 0)
/ max(self.stats.get_value(f"proxy/total/{country}", 1), 1),
)
return response
def spider_closed(self, spider, reason):
spider.logger.info(f"Proxy stats: {dict(self.stats.get_stats())}")
Ban Detection Heuristics
Not all bans return 403. Some return 200 with a CAPTCHA page. Others redirect to a challenge page. Build detection into your middleware:
- Status codes: 403, 429, 503 are obvious bans.
- Response size: If the response is significantly smaller than expected (e.g., < 5KB for a product page that's normally 50KB), it's likely a CAPTCHA or block page.
- Content patterns: Check for known CAPTCHA markers —
recaptcha,cf-challenge,access-deniedin the response body. - Redirect chains: If a product URL redirects to a login page, that's a soft ban.
Integrate these checks into the process_response method of your middleware and log them as structured events. Feed the data into Grafana, Datadog, or even a simple SQLite dashboard to spot degradation before it kills your scrape rate.
Scaling Patterns
When you move from a single spider to a fleet, you need to think about concurrency differently:
- Per-IP rate limiting: Don't send 32 concurrent requests through one residential IP. Use ProxyHat's rotating mode (new session per request) and set
CONCURRENT_REQUESTSbased on your target site's tolerance. - Headless fleet: For JS-heavy targets, run multiple Playwright instances behind a task queue (Celery, Dramatiq). Each worker gets its own proxy session.
- Containerization: One Scrapy instance per Docker container, each with its own proxy credentials. Use Kubernetes or Docker Compose to scale horizontally.
- Backoff: When ban rates spike above a threshold, reduce concurrency automatically. Don't keep hammering a site that's actively blocking you.
Key Takeaways
- Disable Scrapy's built-in
HttpProxyMiddlewareand replace it with a custom middleware that controls proxy assignment, retries, and ban detection in one place.- Use gateway-style residential proxies (like ProxyHat) instead of managing static IP lists. The rotation happens in the username, not in your code.
- Rotate on retry: When a request fails due to a ban, clear the session ID so the retry uses a fresh IP. This is the single most impactful pattern for success rate.
- For JS-heavy sites, configure proxies at the browser context level (Playwright) or as Splash arguments — not via
request.meta['proxy'].- Monitor per-country success rates and ban patterns. Without visibility, you're flying blind on proxy spend.
- Start with Docker + cron, graduate to Scrapyd or managed platforms only when you need programmatic scheduling.
Conclusion
Scrapy proxy middleware isn't a bolt-on — it's the backbone of any production scraping pipeline. The middleware pattern gives you a clean, testable place to handle rotation, retries, geo-targeting, and monitoring. Start with the ProxyHatRotationMiddleware above, adapt the ban detection heuristics to your target sites, and instrument from day one.
Ready to put it into practice? Get started with ProxyHat residential proxies and configure your Scrapy project with the gateway settings above. For more on web scraping patterns, check out our web scraping use case guide and SERP tracking walkthrough.






