Why Detection Happens
Web scraping detection is a multi-layered process. Anti-bot systems do not rely on a single signal — they combine IP reputation, HTTP headers, TLS fingerprints, browser fingerprints, and behavioral analysis to calculate a risk score. When that score exceeds a threshold, you get blocked, served a CAPTCHA, or fed misleading data.
This guide provides a comprehensive approach to reducing detection across all layers. For an overview of how these systems work, see our pillar article on how anti-bot systems detect proxies.
Layer 1: IP Reputation and Proxy Selection
Your IP address is the first thing a server sees. Anti-bot systems maintain databases that score IP addresses by type, history, and behavior.
Proxy Type Selection
| Proxy Type | Detection Risk | Best For |
|---|---|---|
| Residential | Low | Most scraping tasks, protected sites |
| ISP (Static Residential) | Low-Medium | Long sessions, accounts |
| Datacenter | High | Unprotected sites, high-volume tasks |
| Mobile | Very Low | Highest protection sites, social media |
For most scraping projects, ProxyHat's residential proxies offer the best balance of low detection risk and cost efficiency. See our detailed proxy type comparison for guidance.
IP Rotation Strategy
# Python: Rotating proxy per request using ProxyHat
import requests
proxy_url = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
proxies = {
"http": proxy_url,
"https": proxy_url
}
# Each request through the gateway gets a different IP
for url in urls_to_scrape:
response = requests.get(url, proxies=proxies, timeout=30)
process(response)
- Rotate per request for listing pages and search results.
- Use sticky sessions for multi-page flows (pagination, login sequences).
- Geo-target your IPs to match the site's expected audience using ProxyHat's location targeting.
Layer 2: HTTP Headers
Incorrect or missing HTTP headers are one of the easiest signals for anti-bot systems to detect. A real browser sends 15-20 headers in a specific order; a default Python script sends 3-4.
Essential Headers
# Python: Realistic header set
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Cache-Control": "max-age=0",
"Sec-Ch-Ua": '"Chromium";v="131", "Not_A Brand";v="24"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"Connection": "keep-alive"
}
response = requests.get(url, headers=headers, proxies=proxies)
Header Consistency Rules
- Match Sec-Ch-Ua with User-Agent: If you claim Chrome 131, your
Sec-Ch-Uamust reference version 131. - Include all Sec-Fetch headers: Modern Chrome sends these on every navigation. Missing them is a strong bot signal.
- Set Accept-Language to match your proxy geo: A US proxy with
Accept-Language: ja-JPis suspicious. - Maintain header order: Some anti-bot systems check header ordering. Use libraries that preserve insertion order.
Layer 3: TLS and HTTP/2 Fingerprinting
Your HTTP client library produces a unique TLS fingerprint that anti-bot systems check against your claimed user-agent. A Chrome user-agent with a Python TLS fingerprint is immediately flagged.
Mitigation by Language
| Language | Default Library | Detection Risk | Browser-Grade Alternative |
|---|---|---|---|
| Python | requests/urllib3 | Very High | curl_cffi with impersonate |
| Node.js | axios/got | High | got-scraping |
| Go | net/http | Very High | uTLS + custom transport |
# Python: Browser-grade TLS with curl_cffi
from curl_cffi import requests as curl_requests
response = curl_requests.get(
"https://example.com",
impersonate="chrome",
proxies={
"http": "http://USERNAME:PASSWORD@gate.proxyhat.com:8080",
"https": "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
}
)
Layer 4: Browser Fingerprinting
If you are using a headless browser, anti-bot JavaScript probes your browser fingerprint — Canvas, WebGL, AudioContext, navigator properties. The key principle is internal consistency:
- All fingerprint signals must agree with each other
- The fingerprint must match your user-agent claims
- The fingerprint should change when you rotate proxies
Stealth Configuration
// Node.js: Puppeteer with stealth and proxy
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--proxy-server=http://gate.proxyhat.com:8080',
'--disable-blink-features=AutomationControlled',
'--window-size=1920,1080'
]
});
const page = await browser.newPage();
await page.authenticate({
username: 'USERNAME',
password: 'PASSWORD'
});
await page.setViewport({ width: 1920, height: 1080 });
Layer 5: Behavioral Patterns
Even with perfect technical mimicry, bot-like behavior patterns will trigger detection. Anti-bot systems analyze timing, navigation patterns, and interaction signatures.
Request Timing
- Add random delays: Humans do not make requests at exact intervals. Add 1-5 seconds of random delay between requests.
- Vary delays by page type: Content pages deserve longer "reading" pauses than listing pages.
- Avoid burst patterns: Do not make 50 rapid requests then pause. Distribute requests evenly with natural variance.
# Python: Natural request timing
import time
import random
def scrape_with_natural_timing(urls, proxies):
for url in urls:
response = requests.get(url, proxies=proxies, headers=headers)
process(response)
# Random delay: 1-4 seconds with normal distribution
delay = max(0.5, random.gauss(2.5, 0.8))
time.sleep(delay)
Navigation Patterns
- Follow natural paths: Visit the homepage first, then category pages, then detail pages — not jump directly to deep URLs.
- Set proper Referer headers: Each page should reference the previous page as its referer.
- Handle redirects: Follow HTTP redirects naturally rather than retrying the original URL.
Session Management
- Maintain cookie jars: Accept and return cookies within a session — discarding all cookies is a bot signal.
- Limit session length: After 50-100 requests, start a new session with a fresh IP and cookies.
- Respect rate limits: If you receive 429 responses, back off exponentially rather than retrying immediately.
Layer 6: Response Validation
Detection does not always result in a block. Sites may serve different content, inject misleading data, or return soft blocks. Always validate your responses:
- Check status codes: 200 does not always mean success — some sites return 200 with CAPTCHA pages or empty content.
- Validate content structure: Ensure the response contains expected elements (product prices, article text, etc.).
- Monitor for honeypots: Hidden links or form fields designed to catch automated crawlers.
- Track success rates: If your success rate drops below 90%, something has changed and needs investigation.
Comprehensive Anti-Detection Checklist
| Layer | Action | Priority |
|---|---|---|
| IP | Use residential proxies with geo-targeting | Critical |
| IP | Rotate IPs per request or session | Critical |
| Headers | Send complete, realistic header sets | Critical |
| Headers | Match Accept-Language to proxy location | High |
| TLS | Use browser-grade TLS library | Critical |
| TLS | Match TLS fingerprint to claimed browser | Critical |
| Browser | Use stealth plugins for headless browsers | High |
| Browser | Maintain consistent fingerprint profiles | High |
| Behavior | Add random delays between requests | High |
| Behavior | Follow natural navigation paths | Medium |
| Behavior | Maintain cookies within sessions | Medium |
| Validation | Check response content, not just status codes | High |
Example: Full Anti-Detection Scraper
# Python: Complete anti-detection scraper setup
from curl_cffi import requests as curl_requests
import time
import random
class StealthScraper:
def __init__(self, proxy_user, proxy_pass):
self.proxy = f"http://{proxy_user}:{proxy_pass}@gate.proxyhat.com:8080"
self.session = curl_requests.Session(impersonate="chrome")
self.session.proxies = {
"http": self.proxy,
"https": self.proxy
}
self.request_count = 0
def get(self, url, referer=None):
headers = {}
if referer:
headers["Referer"] = referer
response = self.session.get(url, headers=headers, timeout=30)
self.request_count += 1
# Rotate session every 50-80 requests
if self.request_count >= random.randint(50, 80):
self._rotate_session()
# Natural delay
time.sleep(max(0.5, random.gauss(2.0, 0.6)))
return response
def _rotate_session(self):
self.session = curl_requests.Session(impersonate="chrome")
self.session.proxies = {
"http": self.proxy,
"https": self.proxy
}
self.request_count = 0
# Usage
scraper = StealthScraper("USERNAME", "PASSWORD")
home = scraper.get("https://example.com")
listing = scraper.get("https://example.com/products", referer="https://example.com")
detail = scraper.get("https://example.com/products/123", referer="https://example.com/products")
When to Escalate Your Approach
Start with the simplest approach and escalate only when needed:
- Level 1 — HTTP client + headers + proxy: Works for most sites. Use
curl_cffiorgot-scrapingwith ProxyHat proxies. - Level 2 — Add browser-grade TLS: Required when the site checks JA3/JA4 fingerprints.
- Level 3 — Headless browser + stealth: Necessary for JavaScript-rendered content and sophisticated anti-bot systems.
- Level 4 — Full browser automation with behavioral mimicry: Reserved for the most protected sites (Cloudflare Enterprise, PerimeterX, etc.).
For implementation patterns in specific languages, refer to our guides: Python, Node.js, and Go.
Ethical Guidelines
Anti-detection techniques are tools — their ethical use depends on context. Always:
- Respect robots.txt and terms of service
- Scrape only publicly available data
- Limit request rates to avoid server impact
- Comply with data protection regulations (GDPR, CCPA)
- Use ethical scraping practices as your baseline
The goal of anti-detection is not to bypass legitimate security. It is to ensure your automated access to public data is not incorrectly flagged as malicious.






