Is it legal to scrape product reviews from Amazon and Google?

Reviews are publicly visible information, and collecting them for internal analysis is generally considered acceptable. However, platform Terms of Service typically prohibit automated scraping. The key guidelines: don't collect reviewer PII, don't republish scraped content, don't circumvent authentication, and respect robots.txt. Always consult legal counsel for large-scale operations.

Do I need residential proxies to scrape reviews?

It depends on the source. Amazon and Google Reviews require residential proxies due to aggressive bot detection — datacenter IPs will be blocked quickly. Trustpilot and G2/Capterra can be scraped with datacenter proxies at moderate volumes. App Store scraping benefits from mobile proxies. Using the wrong proxy type is the most common reason review-scraping projects fail.

How do I handle reviews in multiple languages?

Use a language detection library (langdetect or fastText) to tag each review's language, then translate non-English reviews using a translation API (DeepL, Google Cloud Translation) before running sentiment analysis. English-trained sentiment models perform significantly better on translated text than multilingual models do on low-resource languages.

What's the best way to extract themes from product reviews?

Use an LLM with a structured prompt that asks for sentiment, sentiment score, specific themes, complaints, and praises in JSON format. This approach outperforms traditional keyword extraction because it captures context — for example, 'battery' in 'great battery life' vs. 'battery dies in 2 hours' gets classified correctly as a praise vs. complaint.

How much does it cost to scrape and analyze 100,000 product reviews?

Proxy costs run $200–600/month for residential at that volume, plus $50–150 for LLM-based sentiment extraction, and $10–30 for translation if you're processing multilingual reviews. Total: roughly $300–800/month. Compare this to the cost of one misprioritized feature ($100K+) and the ROI is typically 10–30×.

Scrape Product Reviews for Sentiment Analysis | ProxyHat

Why Product Teams Need Review Data at Scale

Your competitors' customers are telling you exactly what they hate, what they love, and what they wish existed. Right now. Publicly. The question isn't whether that data is valuable — it's whether you can collect and analyze it fast enough to act on it.

Product managers who rely on quarterly NPS surveys or manual review browsing are working with a fraction of the signal available online. Amazon alone hosts over 200 million reviews. Trustpilot holds 50 million and counting. G2 and Capterra catalog detailed B2B software feedback. The App Stores are sentiment goldmines with version-by-version granularity.

Scraping product reviews for sentiment analysis lets you move from anecdotal understanding to systematic insight — tracking perception across markets, catching emerging complaints before they trend, and quantifying competitor weaknesses your roadmap can exploit.

This guide walks through the full pipeline: which sources to target, what data you can actually extract, how to pick the right proxy infrastructure, how to process reviews into sentiment, and the legal boundaries you must respect.

Target Sources and What You'll Find

Not all review platforms are created equal. Each serves a different market segment and exposes different data fields. Here's a breakdown of the five source categories that matter most.

Amazon Product Reviews

Amazon is the largest single repository of consumer product sentiment on the internet. For any physical product or digital tool sold on Amazon, you'll find:

Star ratings (1–5) and review counts per product
Review text, often detailed and multi-paragraph
Verified purchase flags — critical for filtering out fake or incentivized reviews
Helpful-vote counts — a proxy for review influence and representativeness
Review date and sometimes product variant (size, color)
Reviewer metadata — anonymized profile names, review history counts

Amazon aggressively rate-limits and blocks scraping. Residential proxies are essential. We'll cover proxy strategy in detail below.

Trustpilot

Trustpilot dominates for brand-level and service-level sentiment, especially in Europe. Data available includes star ratings, review text, reviewer country, and whether the reviewer was invited. Trustpilot's anti-scraping is moderate — datacenter proxies can work for smaller volumes, though residential is safer at scale.

Google Reviews

Google Reviews are attached to Google Maps business listings. They're invaluable for local businesses, hospitality, and any company with a physical presence. You get star ratings, review text, reviewer names (often anonymized), and review timestamps. Google's infrastructure is sophisticated — residential proxies with geo-targeting are required.

G2 and Capterra (B2B SaaS)

For B2B product teams, G2 and Capterra are where your buyers research. You'll find:

Detailed pros/cons sections
Star ratings broken into categories (ease of use, support, features)
Reviewer role and company size — crucial for segment-level analysis
Review text with specific feature callouts

These platforms have lighter anti-bot measures than Amazon or Google. Datacenter proxies are generally sufficient for moderate-scale collection.

App Store and Google Play Store

Mobile app reviews offer version-by-version sentiment tracking — you can correlate a release with a spike in negative reviews. Both stores expose star ratings, review text, reviewer names (anonymized), and app version metadata. Apple provides an official RSS-like endpoint for reviews, but it's limited. Google Play requires scraping. Residential proxies are recommended for Play Store at volume.

What Data Is Actually Accessible

Before you build a pipeline, understand what each platform realistically gives you. Here's a comparison:

Data Point	Amazon	Trustpilot	Google Reviews	G2/Capterra	App Stores
Star rating	Yes	Yes	Yes	Yes	Yes
Review text	Yes	Yes	Yes	Yes	Yes
Verified purchase	Yes	Partial	No	Yes (verified user)	Yes (verified download)
Helpful votes	Yes	No	No	Yes (upvotes)	No
Reviewer metadata	Anonymized	Anonymized	Anonymized	Role + company size	Anonymized
Date / version	Yes	Yes	Yes	Yes	Yes + app version
Anti-scraping strength	High	Low–Medium	High	Low	Medium–High

Key principle: never collect or store personally identifiable information (PII). Reviewer display names, email addresses, and profile photos should be excluded from your pipeline or immediately hashed. More on this in the legal section below.

Proxy Selection: Matching Infrastructure to Source Difficulty

Proxy choice is the single biggest infrastructure decision in a review-scraping project. Pick wrong and you'll burn time on blocked requests, CAPTCHAs, and incomplete datasets. Here's the framework.

Residential Proxies: Required for Amazon and Google

Amazon and Google operate some of the most sophisticated bot-detection systems on the internet. They fingerprint connection patterns, check IP reputation, and flag datacenter IP ranges aggressively. If you send requests from a datacenter IP to Amazon, you'll get CAPTCHAs or silent blocks within dozens of requests.

Residential proxies route your requests through real ISP-assigned IPs, making each request appear to come from a legitimate home or mobile connection. This is non-negotiable for Amazon and Google review scraping.

For Amazon, you'll also want sticky sessions (maintaining the same IP for 10–30 minutes) so you can paginate through multiple review pages without triggering session-consistency checks.

Example request with ProxyHat residential proxies, geo-targeted to the US:

curl -x "http://user-country-US-session-rv42:password@gate.proxyhat.com:8080" \
  "https://www.amazon.com/product-reviews/B0EXAMPLE/ref=cm_cr_dp_d_show_all_btm"

Datacenter Proxies: Sufficient for Trustpilot and G2

Platforms with lighter anti-bot measures — Trustpilot, G2, Capterra — can be scraped reliably with datacenter proxies at a fraction of the cost. These IPs are faster and more stable, which matters when you're collecting hundreds of thousands of reviews across many products.

The tradeoff: datacenter IPs are more easily identified as non-residential. If you need very high volumes from Trustpilot (millions of reviews), you may still want to blend in residential requests to avoid rate-limiting.

Mobile Proxies: For App Store Scraping

Google Play Store scraping benefits from mobile proxy IPs because Google expects Play Store traffic from mobile devices. Mobile proxies carry mobile carrier IP ranges, reducing the chance of blocks and improving data fidelity.

Proxy Decision Matrix

Source	Proxy Type	Rotation Strategy	Estimated Cost Impact
Amazon	Residential (sticky)	Rotate every 10–30 min	High
Google Reviews	Residential (rotating)	Per-request or every 5 min	High
Trustpilot	Datacenter (rotating)	Per-request	Low
G2 / Capterra	Datacenter (rotating)	Per-request	Low
App Stores	Mobile or Residential	Rotate every 10 min	Medium–High

The Downstream Pipeline: From Raw HTML to Sentiment

Scraping is only 20% of the work. The other 80% is turning messy, multilingual review text into structured sentiment you can act on. Here's the pipeline that works.

Step 1: Deduplication

Reviews get duplicated across scraping runs, and some cross-post across platforms (the same person reviewing on Amazon and Trustpilot). Deduplicate on a composite key: platform + product_id + reviewer_hash + review_date + first_50_chars. This catches exact duplicates and near-duplicates from re-runs.

For cross-platform dedup, use fuzzy matching on review text (Levenshtein distance or MinHash) to identify the same person posting the same review on multiple sites. This matters for accurate sentiment counts — you don't want to double-count a vocal complainer.

Step 2: Language Detection and Translation

If you're analyzing global products, reviews will come in dozens of languages. Use a fast language-detection library (like langdetect or fastText's language model) to tag each review's language. Then translate non-English reviews using a high-quality translation API (DeepL, Google Cloud Translation, or a local LLM).

Why translate instead of analyzing in-language? Most sentiment models are trained on English. Translating first gives you higher accuracy than running multilingual models on low-resource languages. Budget roughly $2–5 per 10,000 reviews for translation.

Step 3: LLM-Based Sentiment and Theme Extraction

This is where the pipeline becomes genuinely useful. Traditional sentiment analysis (TextBlob, VADER) gives you a polarity score but no why. Modern LLM-based extraction gives you both sentiment and the specific themes driving it.

Design your extraction prompt to output structured JSON:

{
  "sentiment": "negative",
  "sentiment_score": 0.15,
  "themes": ["battery life", "charging speed"],
  "complaints": ["dies within 4 hours", "takes 3+ hours to charge"],
  "praises": [],
  "competitor_mentions": []
}

Running this over 50,000 reviews of your competitor's flagship product and aggregating themes by frequency gives you a ranked list of customer pain points — your product roadmap's best input.

Step 4: Aggregation and Visualization

Aggregate sentiment by theme, time period, product variant, and geography. Build dashboards that let product managers filter to: "Show me all negative themes about battery life for Product X in Germany over the last 90 days." Tools like Metabase, Looker, or even a well-structured Google Sheet can work for this.

Strategic Use Cases with Real Numbers

Use Case 1: Pre-Launch Market Research

A SaaS company planning to launch a project-management tool scrapes 30,000 reviews across 15 competing products on G2 and Capterra. Theme extraction reveals that 37% of negative reviews mention "reporting" as a pain point, and 24% of 1-star reviews specifically complain about "inability to customize dashboards." The product team uses this to justify prioritizing custom dashboards in v1, with a clear quantified TAM argument.

ROI: The scraping and analysis cost approximately $800 in proxy fees and LLM API costs. The insight led to a feature prioritization that the team estimates will improve trial conversion by 15–20%, representing ~$300K ARR impact.

Use Case 2: Post-Launch Sentiment Tracking

After a major release, a mobile app team scrapes their own App Store and Play Store reviews daily, running theme extraction on each batch. Within 48 hours of release, negative sentiment spikes around "login crashes on Android 14." The theme appears in 31% of 1-star reviews, up from 2% the previous week. The team ships a hotfix within 72 hours, and negative sentiment drops back to baseline within 10 days.

ROI: Without automated review monitoring, this issue would have been caught through support tickets 5–7 days later. The faster detection saved an estimated 2,000 churned users, worth ~$120K in annual revenue.

Use Case 3: Competitor Weakness Detection

An e-commerce brand scrapes 100,000 Amazon reviews across 5 competitors in their category. They discover that Competitor B's top complaint theme is "packaging damage" (mentioned in 18% of negative reviews), while their own product's top complaint is "price." They launch a marketing campaign highlighting their superior packaging, targeting Competitor B's dissatisfied customers. The campaign achieves a 3.2× higher click-through rate than generic competitor-comparison ads.

Build vs. Buy: Infrastructure Decisions

Every product team faces this question: should you build the scraping and analysis pipeline in-house or use an off-the-shelf solution?

Build In-House When:

You need continuous, high-volume collection (millions of reviews per quarter)
You have engineering bandwidth to maintain scrapers across 5+ platforms
Your analysis requires custom LLM prompts or proprietary taxonomy
Data privacy requires on-premise processing

Buy (Use Existing Tools) When:

You need one-time or quarterly snapshots
Your team lacks scraping expertise or maintenance bandwidth
Speed to insight matters more than pipeline control
You're validating a hypothesis before committing to ongoing monitoring

Most teams start with a hybrid: self-managed proxies (like ProxyHat) for the infrastructure layer, custom scrapers for their primary sources, and off-the-shelf sentiment APIs for the analysis. This gives you control over data freshness and cost while avoiding reinventing the wheel on NLP.

Legal and Ethical Boundaries

Review data sits in a legal gray zone. Reviews are publicly visible, but platforms' Terms of Service almost universally prohibit scraping. Here's how to navigate this responsibly.

What's Generally Acceptable

Collecting publicly visible review text, ratings, and dates — this information is freely accessible to any browser user
Analyzing aggregated sentiment trends — this is transformative use, not reproduction
Storing review data for internal analysis — not republishing it verbatim

What's Not Acceptable

Collecting or storing reviewer PII — real names, email addresses, social profiles linked from reviews
Republishing scraped reviews — creating a review aggregation site with scraped content
Circumventing authentication — scraping reviews behind login walls without authorization
Ignoring robots.txt — at minimum, check and respect the platform's crawling directives

Practical Guidelines

Hash or drop reviewer names immediately in your pipeline
Rate-limit your scraping to avoid degrading the platform's service
Use the data for internal decision-making, not public redistribution
Consult legal counsel if you're scraping at very large scale or in regulated industries
Be transparent internally about data sources and collection methods

Key principle: The ethical line isn't about whether you can technically access the data — it's about whether your use respects the reviewers who shared their opinions and the platforms that host them. Anonymize, aggregate, and act on insights without reproducing the raw data.

Calculating ROI on Review Intelligence

Before investing in a scraping and analysis pipeline, quantify the expected return. Here's a simple framework:

Cost side: Proxy costs ($200–800/month for residential at scale) + LLM API costs ($50–200/month for sentiment extraction) + engineering time (40–80 hours for initial build)
Value side: Revenue protected by faster issue detection + revenue gained from better feature prioritization + marketing efficiency from targeted competitor campaigns

A conservative estimate: if review intelligence helps you avoid one major feature misstep per year (worth ~$100K in wasted development) and catch one post-launch crisis 3 days earlier (worth ~$50K in reduced churn), the annual value is $150K against a cost of $5–15K. That's a 10–30× ROI.

Key Takeaways

Match your proxy type to the source: residential for Amazon and Google, datacenter for Trustpilot and G2, mobile for app stores.
Deduplication and translation are pipeline essentials, not afterthoughts — budget time and compute for both.
LLM-based theme extraction turns reviews from noise into a ranked list of customer pain points and feature requests.
The highest-ROI use cases are pre-launch competitor analysis, post-launch issue detection, and targeted competitor-weakness campaigns.
Anonymize reviewer data, respect platform ToS, and use insights internally — don't republish scraped content.
Start small: scrape one competitor on one platform, run sentiment analysis, and prove value before scaling the pipeline.

Ready to start collecting review data? ProxyHat's residential and datacenter proxy plans give you the infrastructure to scrape Amazon, Google, Trustpilot, and app stores reliably — with geo-targeting and sticky sessions built in. For a broader overview of scraping use cases, see our web scraping guide.

How to Scrape Product Reviews for Sentiment Analysis at Scale

Why Product Teams Need Review Data at Scale