Webサイトをブロックされずにスクレイピングする方法

Webサイトをブロックされずにスクレイピングする実証済みテクニックを学びます。プロキシローテーション、ヘッダー管理、レート制限、Python・Node.js・Goのコード例をカバーします。

Webサイトをブロックされずにスクレイピングする方法

Every serious web scraping project eventually hits the same wall: your requests start returning CAPTCHAs, 403 errors, or empty pages. Websites have become remarkably good at detecting automated traffic, and the arms race between scrapers and anti-bot systems is more intense than ever. Whether you are collecting pricing data, monitoring competitor content, or building datasets for AI training, learning to scrape websites without getting blocked is no longer optional — it is fundamental to any reliable data pipeline.

This guide covers the technical reasons behind blocks, the detection signals modern anti-bot systems look for, and proven strategies to keep your scrapers running smoothly. We include working code examples using residential proxies to show how these concepts translate into production-ready implementations.

Why Websites Block Scrapers

Before solving the problem, it helps to understand what you are up against. Websites deploy anti-bot measures for several legitimate reasons:

  • Infrastructure protection — Aggressive scraping can overwhelm servers, degrade performance for real users, and inflate hosting costs.
  • Content protection — Publishers, e-commerce sites, and data providers want to prevent competitors from copying their data at scale.
  • Security — Automated traffic patterns overlap with credential stuffing, DDoS attacks, and vulnerability scanning.
  • Regulatory compliance — Sites handling personal data may restrict automated access to comply with privacy regulations.

Modern websites rely on specialized anti-bot services like Cloudflare Bot Management, Akamai Bot Manager, PerimeterX, and DataDome. These services analyze traffic in real time using a combination of signals, and they share intelligence across their networks — meaning a pattern flagged on one site can trigger blocks across thousands of others.

Detection Signals That Get You Blocked

Anti-bot systems rarely rely on a single indicator. They build a risk score from multiple signals and block requests that exceed a threshold. Here are the key detection vectors:

IP Address Reputation

This is the most fundamental signal. Datacenter IP ranges are well-documented and carry inherently higher risk scores. If your requests originate from AWS, Google Cloud, or any known hosting provider, many anti-bot systems will challenge or block them immediately. Even with residential IPs, sending too many requests from a single address will get it flagged. IP reputation databases are updated in real time, and a burned IP can stay blacklisted for weeks.

Request Rate and Pattern Analysis

Humans do not request 50 pages per second with perfectly uniform intervals. Anti-bot systems track request frequency, timing patterns, and navigation flow. Scraping that follows a perfectly sequential path through paginated results — with identical delays between requests — looks mechanical even if the rate is conservative.

HTTP Fingerprinting

Every HTTP client has a distinctive fingerprint based on the combination of headers it sends: the order of headers, TLS handshake characteristics (JA3/JA4 fingerprints), HTTP/2 settings frames, and header values. A Python requests library has a completely different fingerprint than Chrome. Anti-bot systems maintain databases of known browser fingerprints and flag anything that does not match.

Browser Fingerprinting and JavaScript Challenges

Advanced anti-bot systems serve JavaScript challenges that inspect the browser environment: canvas rendering, WebGL capabilities, installed fonts, screen resolution, timezone, language preferences, and hundreds of other signals. Headless browsers like Puppeteer and Playwright can be detected through subtle differences — missing browser plugins, incorrect property descriptors on navigator objects, or the absence of expected rendering behaviors.

Behavioral Analysis

Some systems track mouse movements, scroll patterns, and click behavior. A session that navigates directly to data-heavy pages without visiting the homepage first, or that never moves the mouse, signals automation.

Detection Signal Risk Level Mitigation Difficulty Primary Defense
Datacenter IP range Critical Easy Use residential proxies
High request rate High Easy Rate limiting + random delays
Missing/wrong headers High Medium Realistic header profiles
TLS fingerprint mismatch High Hard TLS fingerprint spoofing libraries
JavaScript challenge failure Critical Hard Real browser (Playwright/Puppeteer)
Behavioral anomalies Medium Hard Human-like interaction simulation
Cookie/session anomalies Medium Medium Proper session management

Strategies to Scrape Without Getting Blocked

1. Use Residential Proxies for IP Rotation

The single most effective defense against IP-based blocking is routing your requests through residential proxies. Residential IPs belong to real ISPs and carry the same reputation as regular household internet connections. Anti-bot systems cannot blanket-block residential ranges without affecting legitimate users.

Effective proxy rotation means assigning a different IP to each request or small batch of requests. For session-dependent scraping (where you need to maintain login state or navigate multi-page flows), use sticky sessions that keep the same IP for a defined duration before rotating.

ProxyHat provides automatic rotation with configurable session control. You can target IPs from specific countries, states, or cities to access geo-restricted content while maintaining residential-grade trust scores.

2. Craft Realistic HTTP Headers

Default headers from scraping libraries are a dead giveaway. A request from Python's requests library sends User-Agent: python-requests/2.31.0 — which immediately flags it as automated. Build header profiles that exactly match real browsers:

  • Set a current, complete User-Agent string matching a real browser version
  • Include Accept, Accept-Language, Accept-Encoding, and Sec-CH-UA headers
  • Match the header order to the browser you are impersonating
  • Rotate between multiple browser profiles to avoid a single fingerprint
  • Include a plausible Referer header (e.g., a search engine results page)

3. Implement Smart Rate Limiting

Uniform delays are nearly as suspicious as no delays at all. Implement randomized delays that follow a realistic distribution:

  • Base delay of 2-5 seconds between requests
  • Add random jitter of plus or minus 30-50%
  • Insert longer pauses (15-30 seconds) every 20-50 requests
  • Reduce concurrency per domain — 2-3 parallel requests maximum
  • Implement exponential backoff when you receive rate-limit signals (429 status codes)

4. Manage Sessions and Cookies Properly

Many websites assign tracking cookies on the first visit and expect them on subsequent requests. A scraper that never sends cookies, or that sends fresh cookies on every request, triggers anomaly detection. Maintain a cookie jar per session, and carry cookies across requests within a logical browsing session.

5. Handle JavaScript-Rendered Content

For sites that require JavaScript execution, use a real browser engine through Playwright or Puppeteer. But running headless browsers without precautions is easily detected. Key hardening steps include:

  • Use playwright-extra or puppeteer-extra with stealth plugins
  • Set a realistic viewport size (not the default 800x600)
  • Enable WebGL and inject consistent GPU renderer strings
  • Set timezone and locale to match your proxy's geographic location
  • Add random mouse movements and scroll actions before extracting data

6. Respect robots.txt and Implement Backoff

While robots.txt is not legally binding in all jurisdictions, respecting it demonstrates good faith. More practically, sites that see you ignoring robots.txt are more likely to implement aggressive blocking. Always implement automatic backoff when you receive 429 (Too Many Requests) or 503 (Service Unavailable) responses — these are explicit signals to slow down.

Code Examples: Scraping with ProxyHat Residential Proxies

The following examples demonstrate how to configure residential proxy rotation with realistic headers. Each example uses the ProxyHat SDK for the respective language. For full API documentation, see the ProxyHat docs.

Python Example

Install the SDK: pip install proxyhat (GitHub)

import time
import random
from proxyhat import ProxyHatClient
client = ProxyHatClient(
    api_key="your_api_key",
    country="US",
    session_type="rotating",  # New IP per request
)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-CH-UA": '"Chromium";v="131", "Not_A Brand";v="24"',
    "Sec-CH-UA-Mobile": "?0",
    "Sec-CH-UA-Platform": '"Windows"',
}
urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]
for url in urls:
    response = client.get(url, headers=headers)
    print(f"{response.status_code} - {url} via {response.proxy_ip}")
    # Randomized delay: 2-5 seconds with jitter
    delay = random.uniform(2.0, 5.0)
    time.sleep(delay)

Node.js Example

Install the SDK: npm install @proxyhat/sdk (GitHub)

const { ProxyHatClient } = require("@proxyhat/sdk");
const client = new ProxyHatClient({
  apiKey: "your_api_key",
  country: "US",
  sessionType: "rotating",
});
const headers = {
  "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
  Accept:
    "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
  "Accept-Language": "en-US,en;q=0.9",
};
const urls = [
  "https://example.com/page/1",
  "https://example.com/page/2",
  "https://example.com/page/3",
];
async function scrape() {
  for (const url of urls) {
    const response = await client.get(url, { headers });
    console.log(`${response.status} - ${url} via ${response.proxyIp}`);
    // Randomized delay between requests
    const delay = 2000 + Math.random() * 3000;
    await new Promise((r) => setTimeout(r, delay));
  }
}
scrape();

Go Example

Install the SDK: go get github.com/ProxyHatCom/go-sdk (GitHub)

package main
import (
    "fmt"
    "math/rand"
    "time"
    proxyhat "github.com/ProxyHatCom/go-sdk"
)
func main() {
    client := proxyhat.NewClient(&proxyhat.Config{
        APIKey:      "your_api_key",
        Country:     "US",
        SessionType: proxyhat.Rotating,
    })
    headers := map[string]string{
        "User-Agent":      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    }
    urls := []string{
        "https://example.com/page/1",
        "https://example.com/page/2",
        "https://example.com/page/3",
    }
    for _, url := range urls {
        resp, err := client.Get(url, proxyhat.WithHeaders(headers))
        if err != nil {
            fmt.Printf("Error: %v\n", err)
            continue
        }
        fmt.Printf("%d - %s via %s\n", resp.StatusCode, url, resp.ProxyIP)
        // Randomized delay: 2-5 seconds
        delay := time.Duration(2000+rand.Intn(3000)) * time.Millisecond
        time.Sleep(delay)
    }
}

Sticky Sessions for Multi-Page Flows

Some scraping tasks require maintaining the same IP address across multiple requests — for example, navigating a paginated product listing, maintaining a logged-in session, or completing a multi-step form. ProxyHat supports sticky sessions that hold the same residential IP for a configurable duration.

# Python: Sticky session example
from proxyhat import ProxyHatClient
client = ProxyHatClient(
    api_key="your_api_key",
    country="DE",
    session_type="sticky",
    session_ttl=300,  # Same IP for 5 minutes
)
# All requests within the session use the same IP
response1 = client.get("https://example.com/login", headers=headers)
response2 = client.post("https://example.com/login", data=credentials, headers=headers)
response3 = client.get("https://example.com/dashboard", headers=headers)
print(f"Session IP: {response1.proxy_ip}")  # Same IP for all three requests

Common Mistakes That Trigger Blocks

Even experienced developers make these errors. Each one can burn through proxy bandwidth and get IPs flagged unnecessarily:

  • Using default library headers — The python-requests User-Agent string is on every blocklist. Always set custom headers.
  • Ignoring TLS fingerprints — Your headers might say "Chrome" but your TLS handshake says "Python." Use libraries like curl_cffi or tls-client that impersonate real browser TLS fingerprints.
  • Scraping too fast on initial launch — Start slow. Ramp up request rates gradually over hours, not minutes.
  • Not handling errors gracefully — Retrying blocked requests immediately with the same configuration wastes bandwidth and confirms you are a bot. Implement backoff and switch proxy sessions on errors.
  • Reusing burned IPs — If a request returns a CAPTCHA or block page, that IP is compromised for that target. Rotate to a new session immediately.
  • Ignoring geographic consistency — Sending requests from a US IP with Accept-Language: ja and a timezone offset of +9 looks suspicious. Match your headers and browser settings to your proxy's location.
  • Not monitoring success rates — Without tracking your block rate, you cannot tell if your strategy is working. Log every response status and alert on success rate drops.

Advanced Techniques for High-Value Targets

Fingerprint Randomization

For heavily protected sites, rotate not just IPs but entire browser fingerprint profiles. Each session should have a consistent combination of User-Agent, screen resolution, timezone, language, and platform — and these should match realistic combinations. A Windows User-Agent with a Linux platform string is an obvious red flag.

Request Chain Simulation

Real users do not jump directly to product pages. They arrive from search engines, browse category pages, and follow internal links. Build your scraper to simulate realistic navigation paths: load the homepage, follow links to category pages, then access the target data. This generates a believable session pattern.

SERP Scraping Considerations

Search engine scraping has unique challenges because Google, Bing, and others have particularly aggressive bot detection. Residential proxies are essential for reliable SERP tracking, and you should distribute requests across multiple geographic locations to avoid triggering rate limits from any single region.

Choosing the Right Proxy Type

Not every scraping job requires residential proxies. The right choice depends on your target's defenses and your budget. See our detailed comparison of proxy types for a deep dive. Here is a quick decision matrix:

Use Case Recommended Proxy Type Reason
General web scraping Residential rotating Best balance of trust and cost
E-commerce price monitoring Residential rotating High anti-bot protection on most retailers
SERP tracking Residential geo-targeted Search engines block datacenter IPs aggressively
Social media scraping Mobile proxies Highest trust for platforms that expect mobile traffic
Public API access Datacenter Low anti-bot risk, cheapest option
Sneaker/ticket sites Residential sticky Session persistence with residential trust

For most scraping projects, residential rotating proxies offer the best combination of reliability and cost-effectiveness. ProxyHat pricing is based on bandwidth consumption, so you only pay for successful data transfer.

Key Takeaways

  • Residential proxies are the foundation — Datacenter IPs get blocked immediately on most protected sites. Residential IPs carry natural trust.
  • Headers matter as much as IPs — A residential IP with default Python headers still gets blocked. Build complete, realistic header profiles.
  • Randomize everything — Delays, header combinations, navigation paths. Predictable patterns are detectable patterns.
  • Monitor and adapt — Track your success rate. When blocks increase, investigate and adjust before burning through your proxy pool.
  • Match your fingerprint — Every signal should tell a consistent story: User-Agent, TLS fingerprint, timezone, language, and geographic location must align.
  • Start slow, scale gradually — Begin with conservative rate limits and increase only after confirming your setup works reliably.
  • Use sticky sessions for stateful flows — Login sequences and multi-page navigation need IP consistency. Use sticky sessions with appropriate TTLs.

Frequently Asked Questions

How do I know if my scraper is being blocked?

Common signs include receiving HTTP 403 or 429 status codes, being redirected to CAPTCHA pages, getting empty response bodies where you expect HTML content, or receiving different content than what you see in a regular browser. Monitor your response status codes and content length — a sudden drop in average response size often indicates soft blocks where the site returns a challenge page instead of the actual content.

Are residential proxies enough to avoid all blocks?

Residential proxies eliminate IP-based blocking, which is the most common detection method, but they are not a complete solution on their own. You still need realistic headers, proper rate limiting, and session management. Think of residential proxies as the foundation — they solve the hardest problem (IP reputation), but the other layers of your scraping stack must also be solid. For the most protected sites, combine residential proxies with browser fingerprint impersonation using tools like curl_cffi or stealth-configured Playwright.

How many requests per second can I send without getting blocked?

There is no universal answer because it depends on the target website's defenses. As a conservative starting point, limit yourself to 1 request every 2-5 seconds per domain with rotating IPs. For less protected sites, you can gradually increase to 5-10 concurrent requests. For heavily protected sites like Google or Amazon, stay under 1 request per 3 seconds even with residential proxies. Always ramp up gradually and monitor your success rate — if it drops below 95%, you are going too fast.

What is the difference between rotating and sticky proxy sessions?

Rotating sessions assign a new IP address to each request, which is ideal for scraping independent pages where no state needs to persist between requests. Sticky sessions maintain the same IP for a configured duration (typically 1-30 minutes), which is necessary for login flows, paginated navigation, or any multi-step process where the server tracks your IP. Use rotating sessions by default and switch to sticky only when your use case specifically requires session continuity.

Is web scraping legal?

Web scraping legality varies by jurisdiction, the type of data being collected, and how it is used. In the United States, the 2022 hiQ Labs v. LinkedIn ruling established that scraping publicly available data does not violate the Computer Fraud and Abuse Act. In the EU, the GDPR applies to personal data regardless of how it is collected. As a general rule: scraping publicly available, non-personal data for legitimate business purposes is broadly accepted. Always review a website's Terms of Service, respect robots.txt as a courtesy, and consult legal counsel for your specific use case.

始める準備はできましたか?

AIフィルタリングで148か国以上、5,000万以上のレジデンシャルIPにアクセス。

料金を見るレジデンシャルプロキシ
← ブログに戻る