Como Reduzir Deteccao ao Fazer Scraping: Guia Completo

Guia abrangente em multiplas camadas para evitar deteccao ao fazer web scraping — cobrindo rotacao de IP, headers HTTP, TLS fingerprints, browser fingerprints, padroes comportamentais e gerenciamento de sessao.

Como Reduzir Deteccao ao Fazer Scraping: Guia Completo

Why Detection Happens

Web scraping detection is a multi-layered process. Anti-bot systems do not rely on a single signal — they combine IP reputation, HTTP headers, TLS fingerprints, browser fingerprints, and behavioral analysis to calculate a risk score. When that score exceeds a threshold, you get blocked, served a CAPTCHA, or fed misleading data.

This guide provides a comprehensive approach to reducing detection across all layers. For an overview of how these systems work, see our pillar article on how anti-bot systems detect proxies.

Layer 1: IP Reputation and Proxy Selection

Your IP address is the first thing a server sees. Anti-bot systems maintain databases that score IP addresses by type, history, and behavior.

Proxy Type Selection

Proxy TypeDetection RiskBest For
ResidentialLowMost scraping tasks, protected sites
ISP (Static Residential)Low-MediumLong sessions, accounts
DatacenterHighUnprotected sites, high-volume tasks
MobileVery LowHighest protection sites, social media

For most scraping projects, ProxyHat's residential proxies offer the best balance of low detection risk and cost efficiency. See our detailed proxy type comparison for guidance.

IP Rotation Strategy

# Python: Rotating proxy per request using ProxyHat
import requests
proxy_url = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
proxies = {
    "http": proxy_url,
    "https": proxy_url
}
# Each request through the gateway gets a different IP
for url in urls_to_scrape:
    response = requests.get(url, proxies=proxies, timeout=30)
    process(response)
  • Rotate per request for listing pages and search results.
  • Use sticky sessions for multi-page flows (pagination, login sequences).
  • Geo-target your IPs to match the site's expected audience using ProxyHat's location targeting.

Layer 2: HTTP Headers

Incorrect or missing HTTP headers are one of the easiest signals for anti-bot systems to detect. A real browser sends 15-20 headers in a specific order; a default Python script sends 3-4.

Essential Headers

# Python: Realistic header set
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Cache-Control": "max-age=0",
    "Sec-Ch-Ua": '"Chromium";v="131", "Not_A Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "Connection": "keep-alive"
}
response = requests.get(url, headers=headers, proxies=proxies)

Header Consistency Rules

  • Match Sec-Ch-Ua with User-Agent: If you claim Chrome 131, your Sec-Ch-Ua must reference version 131.
  • Include all Sec-Fetch headers: Modern Chrome sends these on every navigation. Missing them is a strong bot signal.
  • Set Accept-Language to match your proxy geo: A US proxy with Accept-Language: ja-JP is suspicious.
  • Maintain header order: Some anti-bot systems check header ordering. Use libraries that preserve insertion order.

Layer 3: TLS and HTTP/2 Fingerprinting

Your HTTP client library produces a unique TLS fingerprint that anti-bot systems check against your claimed user-agent. A Chrome user-agent with a Python TLS fingerprint is immediately flagged.

Mitigation by Language

LanguageDefault LibraryDetection RiskBrowser-Grade Alternative
Pythonrequests/urllib3Very Highcurl_cffi with impersonate
Node.jsaxios/gotHighgot-scraping
Gonet/httpVery HighuTLS + custom transport
# Python: Browser-grade TLS with curl_cffi
from curl_cffi import requests as curl_requests
response = curl_requests.get(
    "https://example.com",
    impersonate="chrome",
    proxies={
        "http": "http://USERNAME:PASSWORD@gate.proxyhat.com:8080",
        "https": "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
    }
)

Layer 4: Browser Fingerprinting

If you are using a headless browser, anti-bot JavaScript probes your browser fingerprint — Canvas, WebGL, AudioContext, navigator properties. The key principle is internal consistency:

  • All fingerprint signals must agree with each other
  • The fingerprint must match your user-agent claims
  • The fingerprint should change when you rotate proxies

Stealth Configuration

// Node.js: Puppeteer with stealth and proxy
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
  headless: 'new',
  args: [
    '--proxy-server=http://gate.proxyhat.com:8080',
    '--disable-blink-features=AutomationControlled',
    '--window-size=1920,1080'
  ]
});
const page = await browser.newPage();
await page.authenticate({
  username: 'USERNAME',
  password: 'PASSWORD'
});
await page.setViewport({ width: 1920, height: 1080 });

Layer 5: Behavioral Patterns

Even with perfect technical mimicry, bot-like behavior patterns will trigger detection. Anti-bot systems analyze timing, navigation patterns, and interaction signatures.

Request Timing

  • Add random delays: Humans do not make requests at exact intervals. Add 1-5 seconds of random delay between requests.
  • Vary delays by page type: Content pages deserve longer "reading" pauses than listing pages.
  • Avoid burst patterns: Do not make 50 rapid requests then pause. Distribute requests evenly with natural variance.
# Python: Natural request timing
import time
import random
def scrape_with_natural_timing(urls, proxies):
    for url in urls:
        response = requests.get(url, proxies=proxies, headers=headers)
        process(response)
        # Random delay: 1-4 seconds with normal distribution
        delay = max(0.5, random.gauss(2.5, 0.8))
        time.sleep(delay)

Navigation Patterns

  • Follow natural paths: Visit the homepage first, then category pages, then detail pages — not jump directly to deep URLs.
  • Set proper Referer headers: Each page should reference the previous page as its referer.
  • Handle redirects: Follow HTTP redirects naturally rather than retrying the original URL.

Session Management

  • Maintain cookie jars: Accept and return cookies within a session — discarding all cookies is a bot signal.
  • Limit session length: After 50-100 requests, start a new session with a fresh IP and cookies.
  • Respect rate limits: If you receive 429 responses, back off exponentially rather than retrying immediately.

Layer 6: Response Validation

Detection does not always result in a block. Sites may serve different content, inject misleading data, or return soft blocks. Always validate your responses:

  • Check status codes: 200 does not always mean success — some sites return 200 with CAPTCHA pages or empty content.
  • Validate content structure: Ensure the response contains expected elements (product prices, article text, etc.).
  • Monitor for honeypots: Hidden links or form fields designed to catch automated crawlers.
  • Track success rates: If your success rate drops below 90%, something has changed and needs investigation.

Comprehensive Anti-Detection Checklist

LayerActionPriority
IPUse residential proxies with geo-targetingCritical
IPRotate IPs per request or sessionCritical
HeadersSend complete, realistic header setsCritical
HeadersMatch Accept-Language to proxy locationHigh
TLSUse browser-grade TLS libraryCritical
TLSMatch TLS fingerprint to claimed browserCritical
BrowserUse stealth plugins for headless browsersHigh
BrowserMaintain consistent fingerprint profilesHigh
BehaviorAdd random delays between requestsHigh
BehaviorFollow natural navigation pathsMedium
BehaviorMaintain cookies within sessionsMedium
ValidationCheck response content, not just status codesHigh

Example: Full Anti-Detection Scraper

# Python: Complete anti-detection scraper setup
from curl_cffi import requests as curl_requests
import time
import random
class StealthScraper:
    def __init__(self, proxy_user, proxy_pass):
        self.proxy = f"http://{proxy_user}:{proxy_pass}@gate.proxyhat.com:8080"
        self.session = curl_requests.Session(impersonate="chrome")
        self.session.proxies = {
            "http": self.proxy,
            "https": self.proxy
        }
        self.request_count = 0
    def get(self, url, referer=None):
        headers = {}
        if referer:
            headers["Referer"] = referer
        response = self.session.get(url, headers=headers, timeout=30)
        self.request_count += 1
        # Rotate session every 50-80 requests
        if self.request_count >= random.randint(50, 80):
            self._rotate_session()
        # Natural delay
        time.sleep(max(0.5, random.gauss(2.0, 0.6)))
        return response
    def _rotate_session(self):
        self.session = curl_requests.Session(impersonate="chrome")
        self.session.proxies = {
            "http": self.proxy,
            "https": self.proxy
        }
        self.request_count = 0
# Usage
scraper = StealthScraper("USERNAME", "PASSWORD")
home = scraper.get("https://example.com")
listing = scraper.get("https://example.com/products", referer="https://example.com")
detail = scraper.get("https://example.com/products/123", referer="https://example.com/products")

When to Escalate Your Approach

Start with the simplest approach and escalate only when needed:

  1. Level 1 — HTTP client + headers + proxy: Works for most sites. Use curl_cffi or got-scraping with ProxyHat proxies.
  2. Level 2 — Add browser-grade TLS: Required when the site checks JA3/JA4 fingerprints.
  3. Level 3 — Headless browser + stealth: Necessary for JavaScript-rendered content and sophisticated anti-bot systems.
  4. Level 4 — Full browser automation with behavioral mimicry: Reserved for the most protected sites (Cloudflare Enterprise, PerimeterX, etc.).

For implementation patterns in specific languages, refer to our guides: Python, Node.js, and Go.

Ethical Guidelines

Anti-detection techniques are tools — their ethical use depends on context. Always:

  • Respect robots.txt and terms of service
  • Scrape only publicly available data
  • Limit request rates to avoid server impact
  • Comply with data protection regulations (GDPR, CCPA)
  • Use ethical scraping practices as your baseline
The goal of anti-detection is not to bypass legitimate security. It is to ensure your automated access to public data is not incorrectly flagged as malicious.

Frequently Asked Questions

Pronto para começar?

Acesse mais de 50M de IPs residenciais em mais de 148 países com filtragem por IA.

Ver preçosProxies residenciais
← Voltar ao Blog