CAPTCHAs beim Scraping behandeln

CAPTCHA-Typen, Präventionsstrategien, die effektiver sind als Lösung, und die entscheidende Rolle von Proxies bei der CAPTCHA-Vermeidung. Code-Beispiele für Erkennung und Routing.

CAPTCHAs beim Scraping behandeln

Why CAPTCHAs Are the Scraper's Biggest Obstacle

CAPTCHAs exist to distinguish humans from bots, and they are increasingly effective at it. When your scraper encounters a CAPTCHA, it means the target site has detected automated behavior — your request frequency was too high, your IP has low trust, or your browser fingerprint looked suspicious. The best CAPTCHA strategy is prevention, not solving.

This guide covers the types of CAPTCHAs you will encounter, why prevention is more effective and cheaper than solving, and how proxies play a critical role in avoiding CAPTCHAs entirely.

This article is part of our Complete Guide to Web Scraping Proxies series. For understanding detection systems, see How Anti-Bot Systems Detect Proxies.

Types of CAPTCHAs in 2026

TypeHow It WorksDifficulty to Bypass
reCAPTCHA v2 (checkbox)Click "I'm not a robot" + optional image challengeMedium
reCAPTCHA v3 (invisible)Scores behavior 0.0-1.0 without user interactionHard
hCaptchaImage selection challenges (similar to reCAPTCHA v2)Medium
Cloudflare TurnstileBrowser challenge, usually invisibleHard
Custom image CAPTCHAsSite-specific challenges (distorted text, puzzles)Variable
Proof of WorkBrowser must compute a hash (Cloudflare Under Attack)Medium

Invisible CAPTCHAs Are the Real Threat

The most dangerous CAPTCHAs for scrapers are the ones you never see. reCAPTCHA v3 and Cloudflare Turnstile run in the background, analyzing mouse movements, scroll behavior, typing patterns, and browser environment. They assign a trust score without showing any challenge — and if the score is too low, the request is silently blocked or redirected.

Prevention vs Solving: Why Prevention Wins

ApproachCost per CAPTCHASpeedReliabilityScalability
Prevention (no CAPTCHAs triggered)$0InstantHighestExcellent
CAPTCHA solving services$1-3 per 1,00010-60 seconds85-95%Moderate
AI-based auto-solving$2-5 per 1,0005-30 seconds70-90%Limited
At scale, prevention saves both money and time. Solving 100,000 CAPTCHAs per day costs $100-500 and adds hours of latency. Preventing them costs nothing extra beyond proper proxy and request management.

Prevention Strategy 1: Use High-Quality Residential Proxies

The single most effective CAPTCHA prevention measure is using residential proxies with high trust scores. Residential IPs are assigned to real households by ISPs, so websites cannot easily distinguish your requests from genuine user traffic.

import requests
# Residential proxy — high trust score, fewer CAPTCHAs
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def scrape_with_residential(url: str) -> str:
    """Use residential proxies to avoid triggering CAPTCHAs."""
    session = requests.Session()
    session.proxies = {"http": PROXY, "https": PROXY}
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    })
    resp = session.get(url, timeout=30)
    return resp.text

ProxyHat's residential pool provides IPs from real ISPs in 190+ countries, giving each request the highest possible trust score. See Residential vs Datacenter Proxies for Scraping for a detailed comparison.

Prevention Strategy 2: Realistic Request Patterns

CAPTCHAs are often triggered by robotic behavior patterns, not just IP reputation. Make your scraper look human:

Python Implementation

import requests
import random
import time
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
]
REFERRERS = [
    "https://www.google.com/",
    "https://www.bing.com/",
    "https://duckduckgo.com/",
    None,  # Direct visit
]
def human_like_scrape(urls: list[str]) -> list[str]:
    """Scrape with realistic human behavior patterns."""
    results = []
    session = requests.Session()
    session.proxies = {"http": PROXY, "https": PROXY}
    for url in urls:
        # Randomize headers per request
        headers = {
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }
        referrer = random.choice(REFERRERS)
        if referrer:
            headers["Referer"] = referrer
        try:
            resp = session.get(url, headers=headers, timeout=30)
            results.append(resp.text)
        except requests.RequestException:
            results.append(None)
        # Human-like delays: 1-5 seconds with occasional longer pauses
        if random.random() < 0.1:
            time.sleep(random.uniform(5, 15))  # 10% chance of long pause
        else:
            time.sleep(random.uniform(1, 4))
    return results

Node.js Implementation

const HttpsProxyAgent = require('https-proxy-agent');
const fetch = require('node-fetch');
const agent = new HttpsProxyAgent('http://USERNAME:PASSWORD@gate.proxyhat.com:8080');
const USER_AGENTS = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
];
function randomDelay() {
  const isLongPause = Math.random() < 0.1;
  const ms = isLongPause
    ? 5000 + Math.random() * 10000
    : 1000 + Math.random() * 3000;
  return new Promise(r => setTimeout(r, ms));
}
async function humanLikeScrape(urls) {
  const results = [];
  for (const url of urls) {
    const headers = {
      'User-Agent': USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)],
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en-US,en;q=0.9',
    };
    try {
      const res = await fetch(url, { agent, headers, timeout: 30000 });
      results.push(await res.text());
    } catch {
      results.push(null);
    }
    await randomDelay();
  }
  return results;
}

Prevention Strategy 3: Smart IP Rotation

The way you rotate IPs directly affects CAPTCHA rates. Aggressive rotation (new IP every request) can actually increase CAPTCHAs on some sites, because a series of requests from different IPs accessing the same session path looks suspicious.

import requests
import uuid
def create_session_for_site(site_id: str):
    """Create a sticky session that maintains the same IP per site.
    This avoids the suspicious pattern of different IPs accessing the same flow."""
    session_id = uuid.uuid5(uuid.NAMESPACE_URL, site_id).hex[:8]
    proxy = f"http://USERNAME-session-{session_id}:PASSWORD@gate.proxyhat.com:8080"
    session = requests.Session()
    session.proxies = {"http": proxy, "https": proxy}
    return session
# Same IP for all requests to a specific product section
session = create_session_for_site("example.com-electronics")
page1 = session.get("https://example.com/electronics?page=1")
page2 = session.get("https://example.com/electronics?page=2")
page3 = session.get("https://example.com/electronics?page=3")
# Different IP for a different section
session2 = create_session_for_site("example.com-clothing")
clothes1 = session2.get("https://example.com/clothing?page=1")

For more rotation patterns, see Proxy Rotation Strategies for Large-Scale Scraping.

Prevention Strategy 4: Respect Rate Limits

CAPTCHAs are often the escalation after rate limiting. If you handle rate limit signals properly, you rarely see CAPTCHAs:

import requests
import time
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
CAPTCHA_INDICATORS = [
    "captcha",
    "recaptcha",
    "hcaptcha",
    "challenge",
    "verify you are human",
    "please complete the security check",
]
def is_captcha_page(html: str) -> bool:
    """Detect if the response is a CAPTCHA challenge page."""
    html_lower = html.lower()
    return any(indicator in html_lower for indicator in CAPTCHA_INDICATORS)
def scrape_with_captcha_detection(urls: list[str]) -> list[dict]:
    results = []
    session = requests.Session()
    session.proxies = {"http": PROXY, "https": PROXY}
    captcha_count = 0
    backoff = 2.0
    for url in urls:
        try:
            resp = session.get(url, timeout=30)
            if resp.status_code == 200 and not is_captcha_page(resp.text):
                results.append({"url": url, "status": "success", "body": resp.text})
                captcha_count = 0
                backoff = max(backoff * 0.9, 1.0)  # Reduce backoff on success
            elif is_captcha_page(resp.text) or resp.status_code == 403:
                captcha_count += 1
                results.append({"url": url, "status": "captcha"})
                if captcha_count >= 3:
                    # Too many CAPTCHAs — increase backoff significantly
                    backoff = min(backoff * 3, 60)
                    print(f"CAPTCHA streak: {captcha_count}. Backing off to {backoff:.0f}s")
                else:
                    backoff = min(backoff * 1.5, 30)
        except requests.RequestException as e:
            results.append({"url": url, "status": "error", "error": str(e)})
        time.sleep(backoff)
    return results

For comprehensive rate limit strategies, see Scraping Rate Limits Explained.

When You Must Handle CAPTCHAs: Detection and Routing

Even with perfect prevention, some CAPTCHAs are unavoidable. Build detection into your pipeline so you can route CAPTCHA pages for special handling:

import requests
from enum import Enum
PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
class ResponseType(Enum):
    SUCCESS = "success"
    CAPTCHA = "captcha"
    BLOCKED = "blocked"
    ERROR = "error"
def classify_response(resp: requests.Response) -> ResponseType:
    """Classify a response to determine next action."""
    if resp.status_code == 403:
        return ResponseType.BLOCKED
    if resp.status_code == 429:
        return ResponseType.BLOCKED
    if resp.status_code == 200:
        html = resp.text.lower()
        captcha_signals = ["captcha", "recaptcha", "hcaptcha", "cf-challenge"]
        if any(s in html for s in captcha_signals):
            return ResponseType.CAPTCHA
        return ResponseType.SUCCESS
    return ResponseType.ERROR
def scrape_with_routing(urls: list[str]) -> dict:
    """Scrape URLs and route based on response classification."""
    session = requests.Session()
    session.proxies = {"http": PROXY, "https": PROXY}
    results = {"success": [], "captcha": [], "blocked": [], "error": []}
    for url in urls:
        try:
            resp = session.get(url, timeout=30)
            response_type = classify_response(resp)
            results[response_type.value].append(url)
            if response_type == ResponseType.CAPTCHA:
                # Route to CAPTCHA queue for manual or service-based solving
                print(f"CAPTCHA detected: {url}")
            elif response_type == ResponseType.BLOCKED:
                # Rotate IP and retry
                print(f"Blocked: {url}")
        except requests.RequestException:
            results["error"].append(url)
    print(f"Success: {len(results['success'])}, "
          f"CAPTCHAs: {len(results['captcha'])}, "
          f"Blocked: {len(results['blocked'])}")
    return results

CAPTCHA Prevention Checklist

  • Use residential proxies. They have the highest trust scores and trigger the fewest CAPTCHAs. ProxyHat residential proxies provide millions of clean IPs.
  • Set realistic headers. Always send User-Agent, Accept, Accept-Language, and other standard browser headers.
  • Add human-like delays. Random 1-5 second delays between requests with occasional longer pauses.
  • Maintain sessions properly. Use cookies and consistent IPs for related requests via sticky sessions.
  • Respect robots.txt. Sites that detect robots.txt violations escalate to CAPTCHAs faster.
  • Monitor CAPTCHA rates. If your CAPTCHA rate exceeds 5%, something in your approach needs fixing.
  • Avoid scraping during peak hours. Anti-bot systems are more aggressive during high-traffic periods.
  • Rotate User-Agents properly. Use recent, realistic browser strings. Do not mix mobile and desktop UAs on the same session.

For proxy setup in your preferred language, see Using Proxies in Python, Using Proxies in Node.js, or Using Proxies in Go. Explore ProxyHat for Web Scraping to get started.

Frequently Asked Questions

Can proxies help avoid CAPTCHAs?

Yes, significantly. High-quality residential proxies have clean IP reputations that rarely trigger CAPTCHAs. Datacenter IPs are flagged more often because they are known automated sources. The combination of residential proxies with proper request patterns virtually eliminates CAPTCHAs for most targets.

What is the cheapest way to handle CAPTCHAs at scale?

Prevention. Investing in residential proxies and proper scraping patterns costs far less than CAPTCHA solving services at scale. If you must solve CAPTCHAs, third-party services cost $1-3 per 1,000 but add 10-60 seconds of latency per request.

Do headless browsers help with CAPTCHAs?

They help with invisible CAPTCHAs (reCAPTCHA v3, Turnstile) by providing a real browser environment with JavaScript execution. However, they are slower and more resource-intensive. Use them only for targets that specifically require browser-level verification.

How do I know if I am getting CAPTCHA pages?

Check response HTML for CAPTCHA indicators: "captcha", "recaptcha", "hcaptcha", "challenge", or "verify you are human". Also watch for unexpected 403 responses and redirects to challenge URLs. Build automated detection into your scraping pipeline.

Why do I still get CAPTCHAs with residential proxies?

Usually because of request patterns, not IP quality. Common causes: too many requests per minute, missing browser headers, cookie handling issues, or scraping patterns that are too systematic. Slow down, add jitter, and use sticky sessions for related requests.

Bereit loszulegen?

Zugang zu über 50 Mio. Residential-IPs in über 148 Ländern mit KI-gesteuerter Filterung.

Preise ansehenResidential Proxies
← Zurück zum Blog