Como Fazer Scraping de Avaliacoes de Produtos em Escala com Proxies

Aprenda a fazer scraping de avaliacoes de produtos da Amazon e outras plataformas em escala. Codigo em Python e Node.js para coleta multi-plataforma de reviews, tratamento de paginacao e preparacao para analise de sentimento.

Como Fazer Scraping de Avaliacoes de Produtos em Escala com Proxies

Why Scrape Product Reviews at Scale?

Product reviews are one of the most valuable data sources in e-commerce. They reveal customer sentiment, product quality issues, feature requests, and competitive positioning — information that no other data source can provide. At scale, review data enables:

  • Sentiment analysis: Track how customers feel about your products and competitors' products over time.
  • Product development: Identify recurring complaints and feature requests across thousands of reviews.
  • Competitive intelligence: Understand competitor strengths and weaknesses from their customers' own words.
  • Market research: Discover unmet needs and emerging trends by analyzing review patterns across categories.
  • Quality monitoring: Detect product quality issues early by monitoring review sentiment trends.

The challenge is that review data is spread across multiple platforms (Amazon, Walmart, Best Buy, Trustpilot, Google), each with different structures and anti-bot protections. Scraping reviews at scale requires platform-specific strategies and robust proxy infrastructure. For foundational e-commerce scraping patterns, see our e-commerce data scraping guide.

Review Data Structure Across Platforms

PlatformReview FieldsPaginationAnti-Bot Level
AmazonRating, title, text, date, verified, helpful votesPage-based (10/page)High
WalmartRating, title, text, date, submission sourceOffset-based APIMedium
Best BuyRating, title, text, date, helpful/unhelpfulPage-based APIMedium
TrustpilotRating, title, text, date, replyPage-basedLow-Medium
Google ShoppingRating, text, date, sourceScroll-basedHigh

Proxy Configuration for Review Scraping

Review scraping involves paginated navigation, which means maintaining sessions across multiple requests. ProxyHat's sticky sessions are ideal for this pattern.

ProxyHat Setup

# Per-request rotation for initial product lookups
http://USERNAME:PASSWORD@gate.proxyhat.com:8080
# Sticky session for paginating through reviews of one product
http://USERNAME-session-rev001:PASSWORD@gate.proxyhat.com:8080
# Geo-targeted for region-specific review pages
http://USERNAME-country-US:PASSWORD@gate.proxyhat.com:8080

For review scraping, use sticky sessions when paginating through all reviews for a single product, and per-request rotation when moving between different products. This mimics natural browsing behavior where a user reads multiple pages of reviews for one product before moving to the next.

Python Implementation

Here is a multi-platform review scraper using ProxyHat's Python SDK.

Amazon Review Scraper

import requests
from bs4 import BeautifulSoup
import json
import time
import random
from dataclasses import dataclass
from datetime import datetime
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]
@dataclass
class Review:
    platform: str
    product_id: str
    rating: float
    title: str
    text: str
    date: str
    author: str
    verified: bool
    helpful_votes: int
def scrape_amazon_reviews(asin, max_pages=10):
    """Scrape all reviews for an Amazon product."""
    reviews = []
    session_id = f"rev-{asin}-{random.randint(1000, 9999)}"
    proxy = f"http://USERNAME-session-{session_id}:PASSWORD@gate.proxyhat.com:8080"
    session = requests.Session()
    session.proxies = {"http": proxy, "https": proxy}
    session.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    })
    for page in range(1, max_pages + 1):
        url = (f"https://www.amazon.com/product-reviews/{asin}"
               f"?pageNumber={page}&sortBy=recent")
        try:
            response = session.get(url, timeout=30)
            if response.status_code != 200:
                break
            if "captcha" in response.text.lower():
                print(f"CAPTCHA on page {page}, switching session")
                break
            soup = BeautifulSoup(response.text, "html.parser")
            review_divs = soup.find_all("div", {"data-hook": "review"})
            if not review_divs:
                break
            for div in review_divs:
                review = parse_amazon_review(div, asin)
                if review:
                    reviews.append(review)
            print(f"Page {page}: {len(review_divs)} reviews (total: {len(reviews)})")
            time.sleep(random.uniform(2, 5))
        except requests.RequestException as e:
            print(f"Error on page {page}: {e}")
            break
    return reviews
def parse_amazon_review(div, asin):
    """Parse a single Amazon review element."""
    try:
        rating_el = div.find("i", {"data-hook": "review-star-rating"})
        rating = float(rating_el.get_text().split(" ")[0]) if rating_el else None
        title_el = div.find("a", {"data-hook": "review-title"})
        title = title_el.get_text(strip=True) if title_el else ""
        body_el = div.find("span", {"data-hook": "review-body"})
        text = body_el.get_text(strip=True) if body_el else ""
        date_el = div.find("span", {"data-hook": "review-date"})
        date_str = date_el.get_text(strip=True) if date_el else ""
        author_el = div.find("span", {"class": "a-profile-name"})
        author = author_el.get_text(strip=True) if author_el else ""
        verified = bool(div.find("span", {"data-hook": "avp-badge"}))
        helpful_el = div.find("span", {"data-hook": "helpful-vote-statement"})
        helpful = 0
        if helpful_el:
            text_h = helpful_el.get_text()
            if "one" in text_h.lower():
                helpful = 1
            else:
                nums = [int(s) for s in text_h.split() if s.isdigit()]
                helpful = nums[0] if nums else 0
        return Review(
            platform="amazon",
            product_id=asin,
            rating=rating,
            title=title,
            text=text,
            date=date_str,
            author=author,
            verified=verified,
            helpful_votes=helpful,
        )
    except Exception:
        return None

Multi-Platform Review Collector

class ReviewCollector:
    """Collect reviews from multiple platforms for a product."""
    def __init__(self):
        self.scrapers = {
            "amazon": scrape_amazon_reviews,
        }
    def collect_all(self, product_ids: dict) -> list[Review]:
        """
        Collect reviews from all platforms.
        product_ids: {"amazon": "B0CHX3QBCH", "walmart": "12345"}
        """
        all_reviews = []
        for platform, product_id in product_ids.items():
            if platform in self.scrapers:
                print(f"\nScraping {platform} reviews for {product_id}")
                reviews = self.scrapers[platform](product_id)
                all_reviews.extend(reviews)
                print(f"Collected {len(reviews)} reviews from {platform}")
                time.sleep(random.uniform(5, 10))
        return all_reviews
    def to_dataframe(self, reviews: list[Review]):
        """Convert reviews to a pandas DataFrame for analysis."""
        import pandas as pd
        return pd.DataFrame([vars(r) for r in reviews])
# Usage
collector = ReviewCollector()
reviews = collector.collect_all({
    "amazon": "B0CHX3QBCH",
})
print(f"\nTotal reviews collected: {len(reviews)}")

Node.js Implementation

A Node.js review scraper using ProxyHat's Node SDK.

const axios = require("axios");
const cheerio = require("cheerio");
const { HttpsProxyAgent } = require("https-proxy-agent");
function getProxy(sessionId = null) {
  if (sessionId) {
    return `http://USERNAME-session-${sessionId}:PASSWORD@gate.proxyhat.com:8080`;
  }
  return "http://USERNAME:PASSWORD@gate.proxyhat.com:8080";
}
async function scrapeAmazonReviews(asin, maxPages = 10) {
  const reviews = [];
  const sessionId = `rev-${asin}-${Math.floor(Math.random() * 9000 + 1000)}`;
  const agent = new HttpsProxyAgent(getProxy(sessionId));
  for (let page = 1; page <= maxPages; page++) {
    const url = `https://www.amazon.com/product-reviews/${asin}?pageNumber=${page}&sortBy=recent`;
    try {
      const { data } = await axios.get(url, {
        httpsAgent: agent,
        headers: {
          "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
          "Accept-Language": "en-US,en;q=0.9",
        },
        timeout: 30000,
      });
      if (data.toLowerCase().includes("captcha")) {
        console.log(`CAPTCHA on page ${page}`);
        break;
      }
      const $ = cheerio.load(data);
      const reviewDivs = $('[data-hook="review"]');
      if (reviewDivs.length === 0) break;
      reviewDivs.each((_, el) => {
        const $el = $(el);
        const ratingText = $el.find('[data-hook="review-star-rating"]').text();
        const rating = parseFloat(ratingText.split(" ")[0]) || null;
        reviews.push({
          platform: "amazon",
          productId: asin,
          rating,
          title: $el.find('[data-hook="review-title"]').text().trim(),
          text: $el.find('[data-hook="review-body"]').text().trim(),
          date: $el.find('[data-hook="review-date"]').text().trim(),
          author: $el.find(".a-profile-name").text().trim(),
          verified: $el.find('[data-hook="avp-badge"]').length > 0,
        });
      });
      console.log(`Page ${page}: ${reviewDivs.length} reviews (total: ${reviews.length})`);
      await new Promise((r) => setTimeout(r, 2000 + Math.random() * 3000));
    } catch (err) {
      console.error(`Error page ${page}: ${err.message}`);
      break;
    }
  }
  return reviews;
}
// Usage
scrapeAmazonReviews("B0CHX3QBCH", 5).then((reviews) => {
  console.log(`Collected ${reviews.length} reviews`);
  console.log(JSON.stringify(reviews.slice(0, 2), null, 2));
});

Handling Pagination at Scale

Review pagination is one of the biggest challenges in large-scale review scraping.

Amazon Pagination Strategy

Amazon limits review pages to 10 reviews each and typically shows up to 500 pages (5,000 reviews). For products with more reviews, use filter parameters to segment:

# Filter by star rating to get more reviews
star_filters = [
    "one_star", "two_star", "three_star",
    "four_star", "five_star"
]
for star in star_filters:
    url = (f"https://www.amazon.com/product-reviews/{asin}"
           f"?filterByStar={star}&pageNumber={page}")
    # This lets you access more reviews per product

Session Management for Pagination

Each product's review pagination should use its own sticky session. When you finish one product and move to the next, create a new session with a different IP.

PhaseProxy StrategyReason
Finding productsPer-request rotationIndependent lookups, no session needed
Paginating reviewsSticky session per productSame IP across pages looks natural
Between productsNew session/IPFresh identity for each product

Preparing Data for Sentiment Analysis

Raw review text needs preprocessing before sentiment analysis.

import re
from collections import Counter
def clean_review_text(text):
    """Clean review text for analysis."""
    # Remove HTML entities
    text = re.sub(r'&\w+;', ' ', text)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove very short reviews (likely not useful)
    if len(text) < 20:
        return None
    return text
def extract_key_phrases(reviews, min_frequency=3):
    """Extract frequently mentioned phrases from reviews."""
    from collections import Counter
    import re
    words = []
    for review in reviews:
        if review.text:
            # Simple bigram extraction
            tokens = re.findall(r'\b\w+\b', review.text.lower())
            for i in range(len(tokens) - 1):
                bigram = f"{tokens[i]} {tokens[i+1]}"
                words.append(bigram)
    return Counter(words).most_common(50)
def aggregate_sentiment(reviews):
    """Calculate aggregate sentiment metrics."""
    if not reviews:
        return {}
    ratings = [r.rating for r in reviews if r.rating]
    return {
        "total_reviews": len(reviews),
        "avg_rating": sum(ratings) / len(ratings) if ratings else 0,
        "rating_distribution": {
            str(i): len([r for r in reviews if r.rating == i])
            for i in range(1, 6)
        },
        "verified_pct": (
            len([r for r in reviews if r.verified]) / len(reviews) * 100
            if reviews else 0
        ),
    }

Scaling to Millions of Reviews

When your target list grows to thousands of products across multiple platforms, architecture matters.

Queue-Based Architecture

  • Use a message queue (Redis, RabbitMQ) to manage the product list and distribute work across workers.
  • Each worker handles one product at a time: paginate through all reviews, store results, move to the next product.
  • Separate queues per platform to respect different rate limits.

Storage Strategy

  • Store raw HTML in object storage (S3) for reprocessing when parsers change.
  • Store parsed reviews in PostgreSQL with full-text search for analysis.
  • Use deduplication based on review ID or hash to avoid storing duplicates on re-scrapes.

Incremental Scraping

For ongoing monitoring, you do not need to re-scrape all reviews every time. Sort by most recent and stop when you hit a review you have already collected. This dramatically reduces proxy usage and speeds up collection.

Key takeaway: Sort reviews by newest first and stop scraping when you hit previously collected content. This turns a full re-scrape into an incremental update.

Best Practices

  • Use sticky sessions for pagination: Maintain the same IP across review pages for a single product to avoid triggering anti-bot detection.
  • Respect rate limits: 2-5 second delays between pages, longer delays between products. Different platforms have different tolerances.
  • Handle empty pages: An empty review page means you have reached the end. Do not keep trying more pages.
  • Validate data quality: Check for CAPTCHA pages, empty content, and duplicate reviews in your pipeline.
  • Use residential proxies: Essential for Amazon and other heavily protected platforms.
  • Store incrementally: Process and store reviews as you scrape them, not in one batch at the end.

Key Takeaways

  • Review data provides unique competitive intelligence that no other data source offers.
  • Different platforms require different scraping strategies — build modular scrapers per platform.
  • Use sticky sessions for review pagination and per-request rotation between products.
  • Sort by newest first and stop at previously collected reviews for efficient incremental scraping.
  • Preprocess review text for sentiment analysis: clean, deduplicate, and extract key phrases.
  • Use ProxyHat's residential proxies with geo-targeting for reliable access to review pages across all platforms.

Ready to start collecting review data? See our Amazon scraping guide for platform-specific details and our e-commerce data scraping guide for the full strategy. Check using proxies in Python and using proxies in Node.js for implementation patterns.

Pronto para começar?

Acesse mais de 50M de IPs residenciais em mais de 148 países com filtragem por IA.

Ver preçosProxies residenciais
← Voltar ao Blog