プロキシを使ったAmazon商品データのスクレイピング方法

プロキシを使ってAmazon商品データをスクレイピングする方法を学びます。商品ページ解析、レビュー収集、価格監視、ブロック回避のための実践的なコード例と戦略を解説します。

プロキシを使ったAmazon商品データのスクレイピング方法

Why Scrape Amazon Product Data?

Amazon hosts over 350 million products across dozens of marketplaces worldwide. For e-commerce businesses, this data is invaluable: competitor pricing, product descriptions, customer reviews, Best Sellers Rank, and inventory signals can all drive smarter decisions. Whether you are building a price monitoring tool, conducting market research, or training an AI model, Amazon product data is one of the highest-value targets on the web.

The challenge is that Amazon invests heavily in anti-bot defenses. Without the right proxy strategy, your scrapers will hit CAPTCHAs, IP blocks, and misleading responses within minutes. This guide walks you through the architecture, code, and proxy configuration needed to scrape Amazon reliably at scale.

Amazon's Anti-Bot Protections

Before writing a single line of code, you need to understand what you are up against. Amazon employs a layered detection system that analyzes every incoming request.

Request Fingerprinting

Amazon inspects HTTP headers, TLS fingerprints, and request ordering. Requests missing standard browser headers or using known bot signatures are flagged immediately. The Accept-Language, Accept-Encoding, and User-Agent headers must be consistent and realistic.

Behavioral Analysis

Requests arriving at a rate no human could achieve, or following predictable patterns (e.g., sequential ASINs), trigger rate limiting. Amazon tracks session behavior across multiple requests, so each IP needs to behave like a genuine shopper.

CAPTCHA Challenges

When Amazon suspects automated traffic, it serves a CAPTCHA page instead of product data. Residential IPs receive far fewer CAPTCHAs than datacenter IPs because they share the same IP pools as real Amazon shoppers. For a deeper look at detection methods, see our guide on how anti-bot systems detect proxies.

Key takeaway: Residential proxies with proper rotation are essential for sustained Amazon scraping. Datacenter proxies will get blocked within hours.

Data You Can Extract from Amazon

Data PointSource PageUse Case
Product title, images, descriptionProduct detail pageCatalog building, content analysis
Current price, deal price, list priceProduct detail / offer listingPrice monitoring, repricing
Customer reviews and ratingsReview pagesSentiment analysis, product research
Best Sellers Rank (BSR)Product detail pageMarket demand estimation
Buy Box seller, shipping infoProduct detail pageCompetitor tracking
Search result rankingsSearch results pagesSEO and advertising optimization
Category hierarchyBrowse nodesTaxonomy mapping

Setting Up Your Proxy Configuration

ProxyHat's residential proxy gateway provides the IP diversity and geo-targeting needed for Amazon scraping. Connect through our gateway and rotate IPs automatically on every request or maintain sticky sessions when needed.

Basic Connection

# HTTP proxy
http://USERNAME:PASSWORD@gate.proxyhat.com:8080
# With geo-targeting (US Amazon)
http://USERNAME-country-US:PASSWORD@gate.proxyhat.com:8080
# With sticky session (maintain same IP for a browsing session)
http://USERNAME-session-amz001:PASSWORD@gate.proxyhat.com:8080

For Amazon scraping, we recommend targeting the country matching the marketplace you are scraping. Scraping amazon.de? Use German IPs. Scraping amazon.co.jp? Use Japanese IPs. Check available locations for the full list.

Python Implementation

Here is a complete Python scraper for Amazon product data using ProxyHat's Python SDK alongside requests and BeautifulSoup.

Basic Product Scraper

import requests
from bs4 import BeautifulSoup
import random
import time
import json
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
]
def get_amazon_product(asin, marketplace="com"):
    """Scrape product data from Amazon by ASIN."""
    url = f"https://www.amazon.{marketplace}/dp/{asin}"
    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
    }
    proxies = {
        "http": PROXY_URL,
        "https": PROXY_URL,
    }
    response = requests.get(url, headers=headers, proxies=proxies, timeout=30)
    if response.status_code != 200:
        return None
    soup = BeautifulSoup(response.text, "html.parser")
    product = {
        "asin": asin,
        "title": extract_title(soup),
        "price": extract_price(soup),
        "rating": extract_rating(soup),
        "review_count": extract_review_count(soup),
        "bsr": extract_bsr(soup),
        "availability": extract_availability(soup),
    }
    return product
def extract_title(soup):
    el = soup.find("span", {"id": "productTitle"})
    return el.get_text(strip=True) if el else None
def extract_price(soup):
    el = soup.find("span", {"class": "a-price-whole"})
    if el:
        fraction = soup.find("span", {"class": "a-price-fraction"})
        price = el.get_text(strip=True).rstrip(".")
        if fraction:
            price += "." + fraction.get_text(strip=True)
        return price
    return None
def extract_rating(soup):
    el = soup.find("span", {"class": "a-icon-alt"})
    if el and "out of" in el.get_text():
        return el.get_text(strip=True).split(" ")[0]
    return None
def extract_review_count(soup):
    el = soup.find("span", {"id": "acrCustomerReviewCount"})
    return el.get_text(strip=True) if el else None
def extract_bsr(soup):
    el = soup.find("th", string=lambda t: t and "Best Sellers Rank" in t)
    if el:
        return el.find_next("td").get_text(strip=True)
    return None
def extract_availability(soup):
    el = soup.find("div", {"id": "availability"})
    return el.get_text(strip=True) if el else None
# Example usage
if __name__ == "__main__":
    asins = ["B0CHX3QBCH", "B0D5BKRY4R", "B0CRMZHDG7"]
    for asin in asins:
        product = get_amazon_product(asin)
        if product:
            print(json.dumps(product, indent=2))
        time.sleep(random.uniform(2, 5))  # Random delay between requests

Handling Pagination for Search Results

def scrape_search_results(keyword, max_pages=5):
    """Scrape Amazon search results with pagination."""
    results = []
    for page in range(1, max_pages + 1):
        url = f"https://www.amazon.com/s?k={keyword}&page={page}"
        headers = {
            "User-Agent": random.choice(USER_AGENTS),
            "Accept-Language": "en-US,en;q=0.9",
        }
        proxies = {"http": PROXY_URL, "https": PROXY_URL}
        response = requests.get(url, headers=headers, proxies=proxies, timeout=30)
        if response.status_code != 200:
            print(f"Page {page}: status {response.status_code}")
            break
        soup = BeautifulSoup(response.text, "html.parser")
        items = soup.find_all("div", {"data-component-type": "s-search-result"})
        for item in items:
            asin = item.get("data-asin", "")
            title_el = item.find("h2")
            price_el = item.find("span", {"class": "a-price-whole"})
            results.append({
                "asin": asin,
                "title": title_el.get_text(strip=True) if title_el else None,
                "price": price_el.get_text(strip=True) if price_el else None,
                "page": page,
            })
        time.sleep(random.uniform(3, 7))
    return results

Node.js Implementation

For Node.js projects, use ProxyHat's Node SDK with cheerio for parsing.

const axios = require("axios");
const cheerio = require("cheerio");
const { HttpsProxyAgent } = require("https-proxy-agent");
const PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080";
const agent = new HttpsProxyAgent(PROXY_URL);
const USER_AGENTS = [
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
];
async function scrapeProduct(asin, marketplace = "com") {
  const url = `https://www.amazon.${marketplace}/dp/${asin}`;
  const { data } = await axios.get(url, {
    httpsAgent: agent,
    headers: {
      "User-Agent": USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)],
      "Accept-Language": "en-US,en;q=0.9",
      Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    },
    timeout: 30000,
  });
  const $ = cheerio.load(data);
  return {
    asin,
    title: $("#productTitle").text().trim() || null,
    price: $(".a-price-whole").first().text().trim() || null,
    rating: $(".a-icon-alt").first().text().trim().split(" ")[0] || null,
    reviewCount: $("#acrCustomerReviewCount").text().trim() || null,
    availability: $("#availability").text().trim() || null,
  };
}
async function scrapeMultiple(asins) {
  const results = [];
  for (const asin of asins) {
    try {
      const product = await scrapeProduct(asin);
      results.push(product);
      console.log(`Scraped: ${product.title}`);
    } catch (err) {
      console.error(`Failed ${asin}: ${err.message}`);
    }
    // Random delay 2-5 seconds
    await new Promise((r) => setTimeout(r, 2000 + Math.random() * 3000));
  }
  return results;
}
// Usage
scrapeMultiple(["B0CHX3QBCH", "B0D5BKRY4R"]).then((results) => {
  console.log(JSON.stringify(results, null, 2));
});

Proxy Rotation Strategies for Amazon

Amazon's detection becomes more aggressive the more requests come from a single IP. Here are the rotation strategies that work best.

Per-Request Rotation

For bulk product lookups where each request is independent, rotate IPs on every request. This is the default behavior with ProxyHat's gateway: each new connection gets a fresh residential IP.

Session-Based Rotation

When scraping search results across multiple pages, maintain the same IP for the entire session. Switching IPs mid-pagination looks suspicious to Amazon. Use ProxyHat's sticky sessions:

# Maintain same IP for up to 10 minutes
http://USERNAME-session-search001:PASSWORD@gate.proxyhat.com:8080

Geo-Targeted Rotation

Match your proxy location to the Amazon marketplace. Accessing amazon.de from a US IP raises flags. Target specific countries:

# German IPs for amazon.de
http://USERNAME-country-DE:PASSWORD@gate.proxyhat.com:8080
# Japanese IPs for amazon.co.jp
http://USERNAME-country-JP:PASSWORD@gate.proxyhat.com:8080
# UK IPs for amazon.co.uk
http://USERNAME-country-GB:PASSWORD@gate.proxyhat.com:8080

For more on rotation techniques, read our detailed guide on the best proxies for web scraping in 2026.

Best Practices for Amazon Scraping

  • Randomize delays: Use random intervals of 2-7 seconds between requests. Never scrape at a fixed rate.
  • Rotate User-Agents: Maintain a pool of at least 10 realistic browser User-Agent strings and rotate them.
  • Handle CAPTCHAs gracefully: If you receive a CAPTCHA response, back off for 30-60 seconds and retry with a new IP.
  • Respect robots.txt: While not legally binding in most jurisdictions, following robots.txt directives demonstrates good faith.
  • Use residential proxies: Datacenter IPs are easily identified and blocked by Amazon. Residential proxies share the same IP ranges as real shoppers.
  • Monitor success rates: Track your HTTP 200 rate. If it drops below 90%, reduce concurrency or adjust your rotation strategy.
  • Cache responses: Never scrape the same URL twice if the data has not changed. Cache product data and set refresh intervals based on how frequently prices change.

Scaling Your Amazon Scraper

When moving from hundreds to millions of products, architecture matters.

Queue-Based Architecture

Use a message queue (Redis, RabbitMQ, or SQS) to manage your ASIN list. Worker processes pull ASINs from the queue, scrape them, and push results to a data store. This decouples scheduling from scraping and lets you scale workers independently.

Concurrency Control

Start with 5-10 concurrent requests and increase gradually while monitoring success rates. With ProxyHat's residential pool, you can typically run 20-50 concurrent sessions without issues. See our web scraping use case page for recommended configurations.

Data Pipeline

Store raw HTML in an object store (S3) for reprocessing, and parsed data in PostgreSQL or a data warehouse. This separation lets you fix parsing bugs without re-scraping.

Pro tip: Amazon product pages change structure frequently. Store raw HTML so you can re-extract data when selectors change, without hitting Amazon again.

Legal and Ethical Considerations

Web scraping is legal in most jurisdictions for publicly available data, but responsible practices matter. Only collect data that is publicly displayed. Do not attempt to access authenticated pages, seller accounts, or private data. Rate-limit your requests to avoid degrading Amazon's service for other users. Store only the data you need and handle it in compliance with applicable privacy laws.

Key Takeaways

  • Amazon's anti-bot system requires residential proxies with geo-targeting to match the target marketplace.
  • Rotate IPs per request for bulk lookups; use sticky sessions for paginated browsing.
  • Randomize delays, User-Agents, and request patterns to avoid detection.
  • Build a queue-based architecture for scaling beyond a few thousand products.
  • Store raw HTML for resilience against selector changes.
  • Use ProxyHat's residential proxies for high success rates across all Amazon marketplaces.

Ready to start scraping Amazon data? Our e-commerce data scraping guide covers the full strategy, and you can explore ProxyHat's proxy infrastructure on our pricing page.

始める準備はできましたか?

AIフィルタリングで148か国以上、5,000万以上のレジデンシャルIPにアクセス。

料金を見るレジデンシャルプロキシ
← ブログに戻る