E legal fazer scraping de avaliacoes de produtos?

Fazer scraping de avaliacoes de produtos publicamente visiveis e geralmente legal na maioria das jurisdicoes, pois sao informacoes publicamente disponiveis. No entanto, voce nao deve fazer scraping de dados privados de usuarios, burlar autenticacao ou violar termos da plataforma de forma prejudicial. Sempre consulte um advogado para seu caso especifico e jurisdicao.

Como lido com paginacao de avaliacoes na Amazon?

A Amazon mostra 10 avaliacoes por pagina com ate 500 paginas por produto. Use sessoes sticky de proxy para manter o mesmo IP entre paginas de um produto. Filtre por nota para acessar mais avaliacoes alem do limite de 5.000. Ordene por mais recente primeiro para scraping incremental eficiente.

Quais proxies funcionam melhor para scraping de avaliacoes?

Proxies residenciais sao essenciais para fazer scraping de avaliacoes da Amazon e outras plataformas importantes. Use sessoes sticky para paginar avaliacoes de um produto e rotacione IPs ao mudar entre produtos. Direcione geo seus proxies para corresponder a regiao do marketplace.

Como posso fazer scraping de avaliacoes de multiplas plataformas eficientemente?

Construa scrapers modulares para cada plataforma com modelo de dados compartilhado. Use uma fila de mensagens para distribuir trabalho, filas separadas por plataforma para respeitar diferentes rate limits e armazene resultados em banco de dados unificado. Use scraping incremental (pare em avaliacoes ja coletadas) para minimizar uso de proxy.

Como preparo avaliacoes coletadas para analise de sentimento?

Limpe o texto removendo entidades HTML e espacos excessivos, filtre avaliacoes muito curtas, desduplique baseado em IDs de avaliacao e normalize notas entre plataformas. Extraia frases-chave usando analise de bigramas e calcule metricas agregadas de sentimento como nota media e distribuicao de notas.

Scraping de Avaliacoes em Escala | ProxyHat

Why Scrape Product Reviews at Scale?

Product reviews are one of the most valuable data sources in e-commerce. They reveal customer sentiment, product quality issues, feature requests, and competitive positioning — information that no other data source can provide. At scale, review data enables:

Sentiment analysis: Track how customers feel about your products and competitors' products over time.
Product development: Identify recurring complaints and feature requests across thousands of reviews.
Competitive intelligence: Understand competitor strengths and weaknesses from their customers' own words.
Market research: Discover unmet needs and emerging trends by analyzing review patterns across categories.
Quality monitoring: Detect product quality issues early by monitoring review sentiment trends.

The challenge is that review data is spread across multiple platforms (Amazon, Walmart, Best Buy, Trustpilot, Google), each with different structures and anti-bot protections. Scraping reviews at scale requires platform-specific strategies and robust proxy infrastructure. For foundational e-commerce scraping patterns, see our e-commerce data scraping guide.

Review Data Structure Across Platforms

Platform	Review Fields	Pagination	Anti-Bot Level
Amazon	Rating, title, text, date, verified, helpful votes	Page-based (10/page)	High
Walmart	Rating, title, text, date, submission source	Offset-based API	Medium
Best Buy	Rating, title, text, date, helpful/unhelpful	Page-based API	Medium
Trustpilot	Rating, title, text, date, reply	Page-based	Low-Medium
Google Shopping	Rating, text, date, source	Scroll-based	High

Proxy Configuration for Review Scraping

Review scraping involves paginated navigation, which means maintaining sessions across multiple requests. ProxyHat's sticky sessions are ideal for this pattern.

ProxyHat Setup

# Per-request rotation for initial product lookups
http://USERNAME:PASSWORD@gate.proxyhat.com:8080
# Sticky session for paginating through reviews of one product
http://USERNAME-session-rev001:PASSWORD@gate.proxyhat.com:8080
# Geo-targeted for region-specific review pages
http://USERNAME-country-US:PASSWORD@gate.proxyhat.com:8080

For review scraping, use sticky sessions when paginating through all reviews for a single product, and per-request rotation when moving between different products. This mimics natural browsing behavior where a user reads multiple pages of reviews for one product before moving to the next.

Python Implementation

Here is a multi-platform review scraper using ProxyHat's Python SDK.

Amazon Review Scraper

import requests
from bs4 import BeautifulSoup
import json
import time
import random
from dataclasses import dataclass
from datetime import datetime
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]
@dataclass
class Review:
    platform: str
    product_id: str
    rating: float
    title: str
    text: str
    date: str
    author: str
    verified: bool
    helpful_votes: int
def scrape_amazon_reviews(asin, max_pages=10):
    """Scrape all reviews for an Amazon product."""
    reviews = []
    session_id = f"rev-{asin}-{random.randint(1000, 9999)}"
    proxy = f"http://USERNAME-session-{session_id}:PASSWORD@gate.proxyhat.com:8080"
    session = requests.Session()
    session.proxies = {"http": proxy, "https": proxy}
    session.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    })
    for page in range(1, max_pages + 1):
        url = (f"https://www.amazon.com/product-reviews/{asin}"
               f"?pageNumber={page}&sortBy=recent")
        try:
            response = session.get(url, timeout=30)
            if response.status_code != 200:
                break
            if "captcha" in response.text.lower():
                print(f"CAPTCHA on page {page}, switching session")
                break
            soup = BeautifulSoup(response.text, "html.parser")
            review_divs = soup.find_all("div", {"data-hook": "review"})
            if not review_divs:
                break
            for div in review_divs:
                review = parse_amazon_review(div, asin)
                if review:
                    reviews.append(review)
            print(f"Page {page}: {len(review_divs)} reviews (total: {len(reviews)})")
            time.sleep(random.uniform(2, 5))
        except requests.RequestException as e:
            print(f"Error on page {page}: {e}")
            break
    return reviews
def parse_amazon_review(div, asin):
    """Parse a single Amazon review element."""
    try:
        rating_el = div.find("i", {"data-hook": "review-star-rating"})
        rating = float(rating_el.get_text().split(" ")[0]) if rating_el else None
        title_el = div.find("a", {"data-hook": "review-title"})
        title = title_el.get_text(strip=True) if title_el else ""
        body_el = div.find("span", {"data-hook": "review-body"})
        text = body_el.get_text(strip=True) if body_el else ""
        date_el = div.find("span", {"data-hook": "review-date"})
        date_str = date_el.get_text(strip=True) if date_el else ""
        author_el = div.find("span", {"class": "a-profile-name"})
        author = author_el.get_text(strip=True) if author_el else ""
        verified = bool(div.find("span", {"data-hook": "avp-badge"}))
        helpful_el = div.find("span", {"data-hook": "helpful-vote-statement"})
        helpful = 0
        if helpful_el:
            text_h = helpful_el.get_text()
            if "one" in text_h.lower():
                helpful = 1
            else:
                nums = [int(s) for s in text_h.split() if s.isdigit()]
                helpful = nums[0] if nums else 0
        return Review(
            platform="amazon",
            product_id=asin,
            rating=rating,
            title=title,
            text=text,
            date=date_str,
            author=author,
            verified=verified,
            helpful_votes=helpful,
        )
    except Exception:
        return None

Multi-Platform Review Collector

class ReviewCollector:
    """Collect reviews from multiple platforms for a product."""
    def __init__(self):
        self.scrapers = {
            "amazon": scrape_amazon_reviews,
        }
    def collect_all(self, product_ids: dict) -> list[Review]:
        """
        Collect reviews from all platforms.
        product_ids: {"amazon": "B0CHX3QBCH", "walmart": "12345"}
        """
        all_reviews = []
        for platform, product_id in product_ids.items():
            if platform in self.scrapers:
                print(f"\nScraping {platform} reviews for {product_id}")
                reviews = self.scrapers[platform](product_id)
                all_reviews.extend(reviews)
                print(f"Collected {len(reviews)} reviews from {platform}")
                time.sleep(random.uniform(5, 10))
        return all_reviews
    def to_dataframe(self, reviews: list[Review]):
        """Convert reviews to a pandas DataFrame for analysis."""
        import pandas as pd
        return pd.DataFrame([vars(r) for r in reviews])
# Usage
collector = ReviewCollector()
reviews = collector.collect_all({
    "amazon": "B0CHX3QBCH",
})
print(f"\nTotal reviews collected: {len(reviews)}")

Node.js Implementation

A Node.js review scraper using ProxyHat's Node SDK.

const axios = require("axios");
const cheerio = require("cheerio");
const { HttpsProxyAgent } = require("https-proxy-agent");
function getProxy(sessionId = null) {
  if (sessionId) {
    return `http://USERNAME-session-${sessionId}:PASSWORD@gate.proxyhat.com:8080`;
  }
  return "http://USERNAME:PASSWORD@gate.proxyhat.com:8080";
}
async function scrapeAmazonReviews(asin, maxPages = 10) {
  const reviews = [];
  const sessionId = `rev-${asin}-${Math.floor(Math.random() * 9000 + 1000)}`;
  const agent = new HttpsProxyAgent(getProxy(sessionId));
  for (let page = 1; page <= maxPages; page++) {
    const url = `https://www.amazon.com/product-reviews/${asin}?pageNumber=${page}&sortBy=recent`;
    try {
      const { data } = await axios.get(url, {
        httpsAgent: agent,
        headers: {
          "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
          "Accept-Language": "en-US,en;q=0.9",
        },
        timeout: 30000,
      });
      if (data.toLowerCase().includes("captcha")) {
        console.log(`CAPTCHA on page ${page}`);
        break;
      }
      const $ = cheerio.load(data);
      const reviewDivs = $('[data-hook="review"]');
      if (reviewDivs.length === 0) break;
      reviewDivs.each((_, el) => {
        const $el = $(el);
        const ratingText = $el.find('[data-hook="review-star-rating"]').text();
        const rating = parseFloat(ratingText.split(" ")[0]) || null;
        reviews.push({
          platform: "amazon",
          productId: asin,
          rating,
          title: $el.find('[data-hook="review-title"]').text().trim(),
          text: $el.find('[data-hook="review-body"]').text().trim(),
          date: $el.find('[data-hook="review-date"]').text().trim(),
          author: $el.find(".a-profile-name").text().trim(),
          verified: $el.find('[data-hook="avp-badge"]').length > 0,
        });
      });
      console.log(`Page ${page}: ${reviewDivs.length} reviews (total: ${reviews.length})`);
      await new Promise((r) => setTimeout(r, 2000 + Math.random() * 3000));
    } catch (err) {
      console.error(`Error page ${page}: ${err.message}`);
      break;
    }
  }
  return reviews;
}
// Usage
scrapeAmazonReviews("B0CHX3QBCH", 5).then((reviews) => {
  console.log(`Collected ${reviews.length} reviews`);
  console.log(JSON.stringify(reviews.slice(0, 2), null, 2));
});

Handling Pagination at Scale

Review pagination is one of the biggest challenges in large-scale review scraping.

Amazon Pagination Strategy

Amazon limits review pages to 10 reviews each and typically shows up to 500 pages (5,000 reviews). For products with more reviews, use filter parameters to segment:

# Filter by star rating to get more reviews
star_filters = [
    "one_star", "two_star", "three_star",
    "four_star", "five_star"
]
for star in star_filters:
    url = (f"https://www.amazon.com/product-reviews/{asin}"
           f"?filterByStar={star}&pageNumber={page}")
    # This lets you access more reviews per product

Session Management for Pagination

Each product's review pagination should use its own sticky session. When you finish one product and move to the next, create a new session with a different IP.

Phase	Proxy Strategy	Reason
Finding products	Per-request rotation	Independent lookups, no session needed
Paginating reviews	Sticky session per product	Same IP across pages looks natural
Between products	New session/IP	Fresh identity for each product

Preparing Data for Sentiment Analysis

Raw review text needs preprocessing before sentiment analysis.

import re
from collections import Counter
def clean_review_text(text):
    """Clean review text for analysis."""
    # Remove HTML entities
    text = re.sub(r'&\w+;', ' ', text)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove very short reviews (likely not useful)
    if len(text) < 20:
        return None
    return text
def extract_key_phrases(reviews, min_frequency=3):
    """Extract frequently mentioned phrases from reviews."""
    from collections import Counter
    import re
    words = []
    for review in reviews:
        if review.text:
            # Simple bigram extraction
            tokens = re.findall(r'\b\w+\b', review.text.lower())
            for i in range(len(tokens) - 1):
                bigram = f"{tokens[i]} {tokens[i+1]}"
                words.append(bigram)
    return Counter(words).most_common(50)
def aggregate_sentiment(reviews):
    """Calculate aggregate sentiment metrics."""
    if not reviews:
        return {}
    ratings = [r.rating for r in reviews if r.rating]
    return {
        "total_reviews": len(reviews),
        "avg_rating": sum(ratings) / len(ratings) if ratings else 0,
        "rating_distribution": {
            str(i): len([r for r in reviews if r.rating == i])
            for i in range(1, 6)
        },
        "verified_pct": (
            len([r for r in reviews if r.verified]) / len(reviews) * 100
            if reviews else 0
        ),
    }

Scaling to Millions of Reviews

When your target list grows to thousands of products across multiple platforms, architecture matters.

Queue-Based Architecture

Use a message queue (Redis, RabbitMQ) to manage the product list and distribute work across workers.
Each worker handles one product at a time: paginate through all reviews, store results, move to the next product.
Separate queues per platform to respect different rate limits.

Storage Strategy

Store raw HTML in object storage (S3) for reprocessing when parsers change.
Store parsed reviews in PostgreSQL with full-text search for analysis.
Use deduplication based on review ID or hash to avoid storing duplicates on re-scrapes.

Incremental Scraping

For ongoing monitoring, you do not need to re-scrape all reviews every time. Sort by most recent and stop when you hit a review you have already collected. This dramatically reduces proxy usage and speeds up collection.

Key takeaway: Sort reviews by newest first and stop scraping when you hit previously collected content. This turns a full re-scrape into an incremental update.

Best Practices

Use sticky sessions for pagination: Maintain the same IP across review pages for a single product to avoid triggering anti-bot detection.
Respect rate limits: 2-5 second delays between pages, longer delays between products. Different platforms have different tolerances.
Handle empty pages: An empty review page means you have reached the end. Do not keep trying more pages.
Validate data quality: Check for CAPTCHA pages, empty content, and duplicate reviews in your pipeline.
Use residential proxies: Essential for Amazon and other heavily protected platforms.
Store incrementally: Process and store reviews as you scrape them, not in one batch at the end.

Key Takeaways

Review data provides unique competitive intelligence that no other data source offers.
Different platforms require different scraping strategies — build modular scrapers per platform.
Use sticky sessions for review pagination and per-request rotation between products.
Sort by newest first and stop at previously collected reviews for efficient incremental scraping.
Preprocess review text for sentiment analysis: clean, deduplicate, and extract key phrases.
Use ProxyHat's residential proxies with geo-targeting for reliable access to review pages across all platforms.

Ready to start collecting review data? See our Amazon scraping guide for platform-specific details and our e-commerce data scraping guide for the full strategy. Check using proxies in Python and using proxies in Node.js for implementation patterns.

Como Fazer Scraping de Avaliacoes de Produtos em Escala com Proxies

Why Scrape Product Reviews at Scale?

Review Data Structure Across Platforms

Proxy Configuration for Review Scraping

ProxyHat Setup

Python Implementation

Amazon Review Scraper

Multi-Platform Review Collector

Node.js Implementation

Handling Pagination at Scale

Amazon Pagination Strategy

Session Management for Pagination

Preparing Data for Sentiment Analysis

Scaling to Millions of Reviews

Queue-Based Architecture

Storage Strategy

Incremental Scraping

Best Practices

Key Takeaways

Pronto para começar?

Why Scrape Product Reviews at Scale?

Review Data Structure Across Platforms

Proxy Configuration for Review Scraping

ProxyHat Setup

Python Implementation

Amazon Review Scraper

Multi-Platform Review Collector

Node.js Implementation

Handling Pagination at Scale

Amazon Pagination Strategy

Session Management for Pagination

Preparing Data for Sentiment Analysis

Scaling to Millions of Reviews

Queue-Based Architecture

Storage Strategy

Incremental Scraping

Best Practices

Key Takeaways

Pronto para começar?

Você também pode se interessar por

Como Fazer Scraping de Lojas Shopify com Proxies: Guia Completo

Como Fazer Scraping de Dados de Produtos da Amazon com Proxies

Proxies para Scraping de Dados de E-Commerce: Guia Completo

Monitoramento de Precos Geo-Targeted: Rastreie Precos Entre Mercados