Can you scrape Walmart legally?

Scraping publicly visible product data (prices, names, availability) from Walmart.com is generally permissible for personal or internal business use. However, you must respect Walmart's Terms of Service, avoid scraping authenticated or gated content, and comply with GDPR/CCPA if you process personal data. Always consult legal counsel for your specific use case.

Why does Walmart block my scraper?

Walmart uses Akamai Bot Manager and PerimeterX (HUMAN) to detect and block automated traffic. These systems fingerprint your TLS handshake, HTTP/2 frames, and browser behavior. Datacenter IPs are flagged immediately. Using residential proxies with realistic headers and request pacing significantly reduces block rates.

What is __NEXT_DATA__ on Walmart pages?

__NEXT_DATA__ is a JSON blob embedded in a tag on every Walmart product page. It contains the full product data — price, availability, ratings, seller info, variants — in structured JSON. Extracting this blob is far more reliable than parsing HTML with CSS selectors.

How do I tell 1P vs 3P sellers on Walmart?

In the __NEXT_DATA__ payload, the sellerId field identifies the seller. A value of "0" means Walmart.com (1P). Any other value is a third-party Marketplace seller. The sellerName and fulfillmentType fields give additional context, such as whether the item uses Walmart Fulfillment Services (WFS).

What proxy type works best for scraping Walmart?

Residential rotating proxies are the best choice for Walmart. Datacenter IPs are blocked outright by Akamai. Residential IPs appear as real consumer traffic, achieving 85–98% success rates. Sticky sessions are useful for multi-page flows, while per-request rotation maximizes throughput for bulk crawls.

How to Scrape Walmart Product Data 2025 | ProxyHat

Scrape Walmart: API vs HTML — Pick Your Battle

If you're building a competitive intelligence pipeline for CPG or retail, Walmart.com is the crown jewel of product data. But before you write a single line of code, you need to answer one question: are you calling Walmart's Affiliate API, or are you scraping the HTML?

Walmart's Affiliate Product API returns structured JSON for products, prices, and availability — but it requires an affiliate key, returns limited fields, and throttles at roughly 5 requests per second. For price monitoring, SERP tracking, or marketplace analysis at scale, that ceiling is a non-starter.

That leaves HTML scraping. Walmart's storefront runs on Next.js, which means every product page ships a rich JSON payload inside __NEXT_DATA__. This is the easiest, most reliable way to scrape Walmart product data — if you can get past the anti-bot wall.

Walmart's Catalog Structure: URLs That Matter

Walmart organizes its catalog across three page types. Understanding these URL patterns is the first step to building a scraper that doesn't break on day two.

Product pages

Every item lives at a URL like:

https://www.walmart.com/ip/Ozark-Trail-4-Person-Dome-Tent/553491704

The format is /ip/{slug}/{itemId}. The itemId (here 553491704) is the canonical identifier — the slug can change over time, but the itemId stays constant. Always key your database on the itemId.

You can also load a product using only the itemId:

https://www.walmart.com/ip/item/553491704

This redirect-safe format is preferred for automated crawls because it avoids stale slugs.

Search results

https://www.walmart.com/search?q=laptop&sort=price_low

Search pages return up to 40 items per page. You can paginate with the page query parameter. The sort and facet parameters let you filter by price, brand, and category — useful for focused crawls.

Category and department pages

https://www.walmart.com/cp/electronics/3944
https://www.walmart.com/cp/health/976760

Category pages use /cp/{name}/{categoryId}. They're the best starting point when you need to crawl an entire vertical — electronics, grocery, home — rather than specific search terms.

The Anti-Bot Wall: Akamai + PerimeterX (HUMAN)

Walmart doesn't serve its product pages to just anyone. The site runs a two-layer anti-bot stack that blocks the majority of naive scrapers:

Akamai Bot Manager — fingerprints your TLS handshake, HTTP/2 frame order, and cipher suite list. Datacenter IP ranges are flagged at the edge. Akamai also injects a sensor script that collects behavioral signals (mouse movements, keystroke timing, scroll patterns).
PerimeterX (now HUMAN) — a client-side challenge that runs JavaScript to validate browser environments. If the challenge fails, you get a 403 or a CAPTCHA wall. PerimeterX is particularly aggressive on search and category pages.

The result: a plain requests.get() call from a datacenter IP will almost always receive a 403 or a CAPTCHA page. This is why a Walmart proxy strategy built on residential IPs is non-negotiable.

Why residential proxies — not datacenter

Proxy Type	Walmart Success Rate	Speed	Best For
Datacenter	< 10%	Fast	Not recommended for Walmart
Residential (rotating)	85–95%	Medium	Bulk product crawling, price monitoring
Residential (sticky session)	90–98%	Medium	Login-protected pages, cart-level data
Mobile	95%+	Slower	Mobile-specific endpoints, app data

Datacenter IPs are on Akamai's known-bot lists. Residential IPs blend in with real consumer traffic, making them far harder to fingerprint. For Walmart specifically, rotating residential proxies with per-request IP rotation give you the best balance of speed and stealth. Check ProxyHat pricing for residential proxy plans that fit crawl volumes from 1,000 to 10 million requests.

Parse the Hidden JSON: `__NEXT_DATA__`

Here's the insight that saves you hours: Walmart's product pages are Next.js apps, and Next.js hydrates its pages by embedding a JSON blob in a script tag:

<script id="__NEXT_DATA__" type="application/json">
  { ... massive JSON payload ... }
</script>

This blob contains everything — price, inventory status, ratings, seller info, variant data, shipping options. Instead of parsing brittle CSS selectors, you extract this JSON and navigate it like a dictionary. It's the single most important trick for anyone who wants to scrape Walmart reliably.

What's inside `__NEXT_DATA__`

The payload is nested under props.pageProps.initialData.data. A truncated example:

{
  "product": {
    "itemId": "553491704",
    "name": "Ozark Trail 4-Person Dome Tent",
    "priceInfo": {
      "currentPrice": {
        "price": 49.97,
        "currencyUnit": "USD"
      },
      "priceRangeString": "$49.97"
    },
    "availabilityStatus": "IN_STOCK",
    "rating": { "averageRating": 4.2, "numberOfReviews": 1847 },
    "sellerId": "0",
    "sellerName": "Walmart.com",
    "variantCategories": []
  }
}

Note the sellerId field — "0" means 1P (Walmart itself). Any other value is a 3P Marketplace seller. We'll dig into that distinction shortly.

Python: Fetch and Parse Walmart Product Data

Below is a production-ready snippet that fetches a Walmart product page through ProxyHat residential proxies, extracts __NEXT_DATA__, and returns structured fields.

import requests
import json

PROXY_USER = "user-country-US"      # geo-target to US
PROXY_PASS = "your_password"
PROXY_URL = f"http://{PROXY_USER}:{PROXY_PASS}@gate.proxyhat.com:8080"

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

def fetch_product(item_id: str) -> str:
    url = f"https://www.walmart.com/ip/item/{item_id}"
    proxies = {"http": PROXY_URL, "https": PROXY_URL}
    resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=15)
    resp.raise_for_status()
    return resp.text

def parse_next_data(html: str) -> dict:
    """Extract __NEXT_DATA__ JSON from Walmart HTML."""
    marker = '<script id="__NEXT_DATA__" type="application/json">'
    start = html.find(marker)
    if start == -1:
        raise ValueError("__NEXT_DATA__ not found — possible CAPTCHA page")
    start += len(marker)
    end = html.find("</script>", start)
    payload = json.loads(html[start:end])

    product = payload["props"]["pageProps"]["initialData"]["data"]["product"]

    return {
        "item_id": product["itemId"],
        "name": product["name"],
        "price": product["priceInfo"]["currentPrice"]["price"],
        "currency": product["priceInfo"]["currentPrice"]["currencyUnit"],
        "availability": product["availabilityStatus"],
        "rating": product["rating"]["averageRating"],
        "review_count": product["rating"]["numberOfReviews"],
        "seller_id": product.get("sellerId", "0"),
        "seller_name": product.get("sellerName", "Walmart.com"),
    }

# --- Usage ---
html = fetch_product("553491704")
data = parse_next_data(html)
print(json.dumps(data, indent=2))

A few notes on this code:

The user-country-US flag in the ProxyHat username routes your request through a US residential IP — critical for Walmart, which serves different catalogs and prices to non-US visitors.
If __NEXT_DATA__ is missing, you likely hit a CAPTCHA or block page. Retry with a fresh IP by omitting the session flag (per-request rotation is the default).
For sticky sessions — say you need to maintain cookies across paginated requests — use user-session-abc123-country-US in the username string.

Marketplace (3P) vs First-Party (1P) Catalog

Walmart Marketplace has grown to over 100,000 third-party sellers. When you scrape Walmart product data, you need to know whether you're looking at a 1P or 3P listing — the pricing dynamics, fulfillment, and competitive implications are entirely different.

How to identify 1P vs 3P in the data

sellerId = "0" → Walmart 1P (sold and shipped by Walmart).
sellerId ≠ "0" → 3P Marketplace seller. The sellerName field gives you the seller's display name.
fulfillmentType → Look for "WFS" (Walmart Fulfillment Services) or "SELLER_FULFILLED". WFS items are stored and shipped by Walmart on behalf of the 3P seller.

For competitive intel, 3P data is often more valuable: it reveals MAP violations, seller proliferation, and fulfillment strategies. Make sure your pipeline tags every record with is_1p and seller_id so you can segment your analysis downstream.

Extracting all offers for a product

A single product page may show multiple sellers under "Other sellers on Walmart.com." The __NEXT_DATA__ payload includes these in the offers array:

offers = product.get("offers", [])
for offer in offers:
    print({
        "seller_id": offer.get("sellerId"),
        "seller_name": offer.get("sellerName"),
        "price": offer.get("priceInfo", {}).get("currentPrice", {}).get("price"),
        "fulfillment": offer.get("fulfillmentType"),
    })

This is gold for price monitoring: you can track every seller's price for a given SKU over time.

Rate-Limit-Aware Scheduling

Even with residential proxies, Walmart will rate-limit IPs that make too many requests. From extensive testing, here are the practical thresholds:

Product pages: ~30 requests per minute per IP before you see soft blocks (CAPTCHA challenges).
Search pages: ~15 requests per minute per IP — PerimeterX is more aggressive here.
Category pages: ~20 requests per minute per IP.

With per-request IP rotation (the default on ProxyHat residential proxies), each request goes out from a different IP, so these per-IP limits don't constrain your overall throughput. But you should still pace your total request volume to avoid pattern detection across the proxy pool.

A rate-limited crawl scheduler

import time
import random
from concurrent.futures import ThreadPoolExecutor, as_completed

BASE_DELAY = 2.0  # seconds between requests per thread
JITTER = 0.5      # randomize to avoid clockwork patterns

def crawl_item(item_id: str) -> dict:
    delay = BASE_DELAY + random.uniform(-JITTER, JITTER)
    time.sleep(delay)
    html = fetch_product(item_id)
    return parse_next_data(html)

item_ids = ["553491704", "193640502", "44981213", "843761921"]

# Use 3-5 concurrent threads; each request gets its own rotating IP
with ThreadPoolExecutor(max_workers=4) as pool:
    futures = {pool.submit(crawl_item, iid): iid for iid in item_ids}
    for future in as_completed(futures):
        try:
            result = future.result()
            print(f"Done {result['item_id']}: ${result['price']}")
        except Exception as e:
            print(f"Failed {futures[future]}: {e}")

For large-scale crawls (10,000+ SKUs), consider a job queue like Celery or Temporal with retry logic and dead-letter handling. Store raw HTML in object storage (S3, GCS) so you can re-parse without re-fetching if your extraction logic changes.

Search and Category Scraping

Product pages are straightforward, but search and category pages require extra care. The __NEXT_DATA__ on these pages contains a searchResult object with summary data for each item.

def parse_search_page(html: str) -> list:
    payload = json.loads(extract_next_data_json(html))
    results = (
        payload["props"]["pageProps"]
              ["initialData"]["searchResult"]
              ["itemStacks"][0]["items"]
    )
    return [
        {
            "item_id": item["itemId"],
            "name": item["name"],
            "price": item["priceInfo"]["linePrice"],
            "availability": item.get("availabilityStatus"),
        }
        for item in results
    ]

Search pages are more heavily protected. If you're getting blocked, try SERP tracking proxies with mobile user agents — Walmart's mobile endpoints are sometimes less aggressive with bot detection.

Error Handling and Retry Strategy

Even with the best proxies, some requests will fail. Build your pipeline to handle these gracefully:

HTTP 403: IP was flagged. Retry immediately with a new rotating IP (the default on ProxyHat).
HTTP 429: Rate limit hit. Back off exponentially — start at 10 seconds, double each retry, cap at 5 minutes.
CAPTCHA page: Detected as bot. The page HTML will contain "cf-captcha" or "px-captcha" markers. Discard the response, rotate IP, and retry after a delay.
Missing __NEXT_DATA__: The page loaded but didn't render the JSON blob. This can happen on redirect pages or out-of-stock items that redirect to search. Check for a redirect and log the URL.

Here's a minimal retry wrapper:

import time

def fetch_with_retry(item_id: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        try:
            html = fetch_product(item_id)
            data = parse_next_data(html)
            return data
        except ValueError as e:
            if "CAPTCHA" in str(e) or "not found" in str(e):
                wait = 5 * (2 ** attempt)  # 5s, 10s, 20s
                print(f"Retry {attempt+1}/{max_retries} for {item_id}, waiting {wait}s")
                time.sleep(wait)
            else:
                raise
    raise RuntimeError(f"Failed after {max_retries} retries: {item_id}")

Key Takeaways

1. Skip the API, scrape __NEXT_DATA__. Walmart's Affiliate API is throttled and limited. The Next.js JSON blob embedded in every product page gives you richer data with zero authentication overhead.
2. Residential proxies are mandatory. Akamai and PerimeterX block datacenter IPs outright. Use rotating residential proxies with US geo-targeting to blend in with real shoppers.
3. Key on itemId, not the slug. Slugs change; itemIds don't. Build your database around the numeric ID.
4. Tag 1P vs 3P on every record. The sellerId field tells you whether a listing is Walmart first-party or a Marketplace seller. This distinction is critical for competitive analysis.
5. Pace your requests. Even with rotating IPs, add random delays and limit concurrency to avoid behavioral fingerprinting.
6. Store raw HTML. Archive responses in object storage so you can re-parse without re-crawling when your extraction logic evolves.

Ready to start crawling? ProxyHat's residential proxy network covers every US state and 190+ countries — exactly the geographic diversity you need for Walmart at scale. See available locations or get started with a plan.

How to Scrape Walmart Product Data in 2025

Scrape Walmart: API vs HTML — Pick Your Battle

Walmart's Catalog Structure: URLs That Matter

Product pages

Search results

Category and department pages

The Anti-Bot Wall: Akamai + PerimeterX (HUMAN)

Why residential proxies — not datacenter

Parse the Hidden JSON: `__NEXT_DATA__`

What's inside `__NEXT_DATA__`

Python: Fetch and Parse Walmart Product Data

Marketplace (3P) vs First-Party (1P) Catalog

How to identify 1P vs 3P in the data

Extracting all offers for a product

Rate-Limit-Aware Scheduling

A rate-limited crawl scheduler

Search and Category Scraping

Error Handling and Retry Strategy

Key Takeaways

Ready to get started?

Scrape Walmart: API vs HTML — Pick Your Battle

Walmart's Catalog Structure: URLs That Matter

Product pages

Search results

Category and department pages

The Anti-Bot Wall: Akamai + PerimeterX (HUMAN)

Why residential proxies — not datacenter

Parse the Hidden JSON: __NEXT_DATA__

What's inside __NEXT_DATA__

Python: Fetch and Parse Walmart Product Data

Marketplace (3P) vs First-Party (1P) Catalog

How to identify 1P vs 3P in the data

Extracting all offers for a product

Rate-Limit-Aware Scheduling

A rate-limited crawl scheduler

Search and Category Scraping

Error Handling and Retry Strategy

Key Takeaways

Ready to get started?

You might also be interested in

How to Scrape Etsy for Niche Research: A Pragmatic Guide for POD Teams

How to Scrape Product Reviews for Sentiment Analysis at Scale

News Scraping Proxies: A Strategic Guide for Media Monitoring at Scale

How to Scrape AliExpress for Product Research: APIs, Proxies & Data Pipelines

Parse the Hidden JSON: `__NEXT_DATA__`

What's inside `__NEXT_DATA__`