Rate Limits de Scraping Explicados

Como rate limits funcionam, como sites detectam scrapers e estrategias praticas para ficar dentro dos limites. Inclui codigo de throttling adaptativo e padroes de rate limiting distribuido.

Rate Limits de Scraping Explicados

What Are Scraping Rate Limits?

Rate limits are the invisible walls that websites build to control how fast any single client can make requests. When you scrape a site too aggressively, you hit these walls — and the consequences range from temporary slowdowns to permanent IP bans. Understanding how rate limits work, how they detect you, and how to stay under them is fundamental to building scrapers that deliver data reliably.

This guide explains the mechanics behind rate limiting, the detection signals websites use, and practical strategies for adaptive throttling that keep your scrapers running smoothly.

For a broader overview of scraping with proxies, see our Complete Guide to Web Scraping Proxies. For avoiding blocks in general, read How to Scrape Websites Without Getting Blocked.

How Rate Limiting Works

Websites implement rate limits at multiple layers, each with different detection granularity:

Layer 1: IP-Based Rate Limits

The most common approach. The server tracks requests per IP address within a time window. Exceed the threshold and you receive HTTP 429 (Too Many Requests) or 503 responses.

# Typical rate limit behavior
Request 1-50:    HTTP 200 (normal)
Request 51:      HTTP 429 (rate limited)
Wait 60 seconds...
Request 52:      HTTP 200 (reset)

Layer 2: Session/Cookie-Based Limits

Tracks request frequency per session or browser cookie. Even if you rotate IPs, the same session token hitting the server fast will trigger limits.

Layer 3: Account-Based Limits

For sites requiring login, limits are tied to the user account regardless of IP. Common on APIs and SaaS platforms.

Layer 4: Behavioral Analysis

Advanced systems like Cloudflare, PerimeterX, and Akamai analyze behavioral patterns: request timing, navigation flow, mouse movements (in browser contexts). This layer is the hardest to bypass because it does not rely on simple counters.

Common Rate Limit Detection Signals

Websites use multiple signals simultaneously to detect automated scraping:

SignalWhat It DetectsDifficulty to Evade
Requests per IP per minuteRaw speedEasy (use proxies)
Requests per IP per hour/daySustained volumeMedium (rotate IPs)
Request timing regularityMachine-like intervalsMedium (add jitter)
Missing/wrong headersNon-browser clientsEasy (set proper headers)
Sequential URL patternsSystematic crawlingMedium (randomize order)
TLS fingerprintLibrary vs browserHard (use real browsers)
JavaScript executionHeadless browserHard (advanced config)
Mouse/keyboard eventsBot behaviorVery hard

Learn more about detection mechanisms in our guide on How Anti-Bot Systems Detect Proxies.

HTTP Response Codes That Signal Rate Limiting

Knowing which HTTP codes indicate rate limiting helps you build proper retry logic:

CodeMeaningAction
200 (with CAPTCHA)Soft block — challenge page servedRotate IP, slow down
403 ForbiddenIP or session blockedRotate IP immediately
429 Too Many RequestsExplicit rate limit hitWait and retry with backoff
503 Service UnavailableServer overload or blockBackoff, check if blocked
302/307 to CAPTCHA URLChallenge redirectRotate IP, reduce speed

Strategy 1: Respectful Throttling

The simplest approach — keep your request rate well below what the target allows. This means fewer failures, less wasted bandwidth, and more sustainable scraping.

import requests
import time
import random
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def respectful_scrape(urls: list[str], rpm_limit: int = 10) -> list[str]:
    """Scrape URLs while respecting a requests-per-minute limit."""
    delay = 60.0 / rpm_limit
    results = []
    for url in urls:
        try:
            resp = requests.get(
                url,
                proxies={"http": PROXY, "https": PROXY},
                timeout=30
            )
            results.append(resp.text if resp.status_code == 200 else None)
        except requests.RequestException:
            results.append(None)
        # Add delay with random jitter (±30%) to look less robotic
        jitter = delay * random.uniform(0.7, 1.3)
        time.sleep(jitter)
    return results

Strategy 2: Adaptive Throttling

Instead of a fixed rate, dynamically adjust your speed based on the responses you receive. Speed up when everything works, slow down when you see warning signs.

Python Implementation

import requests
import time
import random
from dataclasses import dataclass, field
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
@dataclass
class AdaptiveThrottle:
    """Automatically adjusts request rate based on server responses."""
    base_delay: float = 2.0      # seconds between requests
    min_delay: float = 0.5
    max_delay: float = 30.0
    current_delay: float = 2.0
    success_streak: int = 0
    warning_codes: set = field(default_factory=lambda: {429, 403, 503})
    def on_success(self):
        self.success_streak += 1
        # Speed up after 10 consecutive successes
        if self.success_streak >= 10:
            self.current_delay = max(self.current_delay * 0.85, self.min_delay)
            self.success_streak = 0
    def on_rate_limit(self):
        self.success_streak = 0
        # Double the delay on rate limit
        self.current_delay = min(self.current_delay * 2.0, self.max_delay)
    def on_block(self):
        self.success_streak = 0
        # Aggressive backoff on block
        self.current_delay = min(self.current_delay * 3.0, self.max_delay)
    def wait(self):
        jitter = self.current_delay * random.uniform(0.7, 1.3)
        time.sleep(jitter)
def scrape_adaptive(urls: list[str]) -> list[dict]:
    throttle = AdaptiveThrottle()
    results = []
    for url in urls:
        try:
            resp = requests.get(
                url,
                proxies={"http": PROXY, "https": PROXY},
                timeout=30
            )
            if resp.status_code == 200:
                throttle.on_success()
                results.append({"url": url, "status": 200, "body": resp.text})
            elif resp.status_code == 429:
                throttle.on_rate_limit()
                # Check Retry-After header
                retry_after = int(resp.headers.get("Retry-After", 0))
                if retry_after:
                    time.sleep(retry_after)
                results.append({"url": url, "status": 429, "body": None})
            elif resp.status_code == 403:
                throttle.on_block()
                results.append({"url": url, "status": 403, "body": None})
            else:
                results.append({"url": url, "status": resp.status_code, "body": resp.text})
        except requests.RequestException as e:
            throttle.on_block()
            results.append({"url": url, "status": 0, "error": str(e)})
        throttle.wait()
        print(f"Current delay: {throttle.current_delay:.1f}s")
    return results

Node.js Implementation

const HttpsProxyAgent = require('https-proxy-agent');
const fetch = require('node-fetch');
class AdaptiveThrottle {
  constructor() {
    this.currentDelay = 2000; // ms
    this.minDelay = 500;
    this.maxDelay = 30000;
    this.successStreak = 0;
  }
  onSuccess() {
    this.successStreak++;
    if (this.successStreak >= 10) {
      this.currentDelay = Math.max(this.currentDelay * 0.85, this.minDelay);
      this.successStreak = 0;
    }
  }
  onRateLimit() {
    this.successStreak = 0;
    this.currentDelay = Math.min(this.currentDelay * 2, this.maxDelay);
  }
  onBlock() {
    this.successStreak = 0;
    this.currentDelay = Math.min(this.currentDelay * 3, this.maxDelay);
  }
  async wait() {
    const jitter = this.currentDelay * (0.7 + Math.random() * 0.6);
    return new Promise(resolve => setTimeout(resolve, jitter));
  }
}
async function scrapeAdaptive(urls) {
  const throttle = new AdaptiveThrottle();
  const agent = new HttpsProxyAgent('http://USERNAME:PASSWORD@gate.proxyhat.com:8080');
  const results = [];
  for (const url of urls) {
    try {
      const res = await fetch(url, { agent, timeout: 30000 });
      if (res.ok) {
        throttle.onSuccess();
        results.push({ url, status: res.status, body: await res.text() });
      } else if (res.status === 429) {
        throttle.onRateLimit();
        const retryAfter = parseInt(res.headers.get('retry-after') || '0');
        if (retryAfter) await new Promise(r => setTimeout(r, retryAfter * 1000));
        results.push({ url, status: 429, body: null });
      } else if (res.status === 403) {
        throttle.onBlock();
        results.push({ url, status: 403, body: null });
      }
    } catch (err) {
      throttle.onBlock();
      results.push({ url, status: 0, error: err.message });
    }
    await throttle.wait();
    console.log(`Current delay: ${throttle.currentDelay.toFixed(0)}ms`);
  }
  return results;
}

Strategy 3: Distributed Rate Limiting

When running multiple scraper instances in parallel, coordinate the rate limit across all workers. Without coordination, each worker respects its own limit but the combined traffic still overwhelms the target.

import requests
import time
import threading
class DistributedRateLimiter:
    """Thread-safe rate limiter for multiple scraper workers."""
    def __init__(self, max_rpm: int):
        self.min_interval = 60.0 / max_rpm
        self.lock = threading.Lock()
        self.last_request_time = 0.0
    def acquire(self):
        """Block until it is safe to make the next request."""
        with self.lock:
            now = time.time()
            elapsed = now - self.last_request_time
            if elapsed < self.min_interval:
                time.sleep(self.min_interval - elapsed)
            self.last_request_time = time.time()
# Shared limiter across all threads
limiter = DistributedRateLimiter(max_rpm=30)
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def worker(urls: list[str], results: list):
    for url in urls:
        limiter.acquire()
        try:
            resp = requests.get(
                url,
                proxies={"http": PROXY, "https": PROXY},
                timeout=30
            )
            results.append({"url": url, "status": resp.status_code})
        except Exception as e:
            results.append({"url": url, "error": str(e)})

Strategy 4: Request Queue with Priority

For complex scraping projects, use a priority queue that manages rate limits per target domain:

import requests
import time
import heapq
import threading
from collections import defaultdict
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
class DomainRateLimiter:
    """Per-domain rate limiting with priority queue."""
    def __init__(self, default_rpm: int = 10):
        self.default_rpm = default_rpm
        self.domain_limits = {}          # domain -> max RPM
        self.domain_last = defaultdict(float)  # domain -> last request time
        self.lock = threading.Lock()
    def set_limit(self, domain: str, rpm: int):
        self.domain_limits[domain] = rpm
    def wait_for_domain(self, domain: str):
        rpm = self.domain_limits.get(domain, self.default_rpm)
        min_interval = 60.0 / rpm
        with self.lock:
            now = time.time()
            elapsed = now - self.domain_last[domain]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            self.domain_last[domain] = time.time()
# Configure per-domain limits
limiter = DomainRateLimiter(default_rpm=10)
limiter.set_limit("amazon.com", 3)        # Very conservative for Amazon
limiter.set_limit("example.com", 30)      # Lenient for simple sites
limiter.set_limit("google.com", 5)        # Moderate for Google

Reading Robots.txt for Rate Hints

Many sites publish their crawl preferences in robots.txt. The Crawl-delay directive tells you the minimum seconds between requests:

import requests
from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser
def get_crawl_delay(base_url: str, user_agent: str = "*") -> float | None:
    """Extract Crawl-delay from robots.txt."""
    parsed = urlparse(base_url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    try:
        resp = requests.get(robots_url, timeout=10)
        if resp.status_code != 200:
            return None
        rp = RobotFileParser()
        rp.parse(resp.text.splitlines())
        delay = rp.crawl_delay(user_agent)
        return delay
    except Exception:
        return None
# Check before scraping
delay = get_crawl_delay("https://example.com")
if delay:
    print(f"Site requests {delay}s between requests")
else:
    print("No crawl-delay specified")

Common Rate Limit Mistakes

  • Ignoring 429 responses. Many scrapers treat all non-200 responses the same. A 429 tells you exactly what happened — use the Retry-After header and back off.
  • Fixed delays without jitter. A request exactly every 2.000 seconds looks robotic. Add random variation (jitter) to your delays.
  • Not coordinating parallel workers. Five workers each doing 10 RPM equals 50 RPM total. Use a shared rate limiter.
  • Rotating IPs without slowing down. IP rotation buys you time, but if each new IP immediately hammers the site, advanced detection will still catch you. Combine rotation with proper throttling.
  • Scraping during peak hours. Sites are more aggressive with rate limiting during high-traffic periods. Schedule heavy crawls during off-peak hours for the target's timezone.

To calculate how many proxies you need to support your rate-limited scraping, see How Many Proxies Do You Need for Scraping?. For proxy rotation strategies that complement rate limiting, read Proxy Rotation Strategies for Large-Scale Scraping.

Get started with properly rate-limited scraping using the ProxyHat Python SDK or explore pricing plans for your project.

Frequently Asked Questions

What happens when I exceed a rate limit?

The response depends on the site. Most return HTTP 429 with a Retry-After header. Some serve CAPTCHAs. Aggressive sites immediately block the IP with a 403 response. In the worst case, repeated violations lead to permanent IP bans.

How do I find a site's rate limit?

Start slow and increase gradually while monitoring response codes. Check robots.txt for Crawl-delay directives. Observe response headers for X-RateLimit-Limit and X-RateLimit-Remaining fields. Some APIs publish their limits in documentation.

Does using proxies bypass rate limits?

Proxies distribute requests across multiple IPs, so each IP stays under the per-IP limit. However, sophisticated sites also track sessions, fingerprints, and behavioral patterns. Proxies are necessary but not sufficient — combine them with proper throttling and realistic request patterns.

What is the safest request rate for scraping?

There is no universal answer. For aggressive targets like Google or Amazon, 1-5 requests per minute per IP is safe. For lightly protected sites, 20-60 RPM per IP may work. Always start conservative and increase based on observed success rates.

Pronto para começar?

Acesse mais de 50M de IPs residenciais em mais de 148 países com filtragem por IA.

Ver preçosProxies residenciais
← Voltar ao Blog