Scrape Walmart: API vs HTML — Pick Your Battle
If you're building a competitive intelligence pipeline for CPG or retail, Walmart.com is the crown jewel of product data. But before you write a single line of code, you need to answer one question: are you calling Walmart's Affiliate API, or are you scraping the HTML?
Walmart's Affiliate Product API returns structured JSON for products, prices, and availability — but it requires an affiliate key, returns limited fields, and throttles at roughly 5 requests per second. For price monitoring, SERP tracking, or marketplace analysis at scale, that ceiling is a non-starter.
That leaves HTML scraping. Walmart's storefront runs on Next.js, which means every product page ships a rich JSON payload inside __NEXT_DATA__. This is the easiest, most reliable way to scrape Walmart product data — if you can get past the anti-bot wall.
Walmart's Catalog Structure: URLs That Matter
Walmart organizes its catalog across three page types. Understanding these URL patterns is the first step to building a scraper that doesn't break on day two.
Product pages
Every item lives at a URL like:
https://www.walmart.com/ip/Ozark-Trail-4-Person-Dome-Tent/553491704The format is /ip/{slug}/{itemId}. The itemId (here 553491704) is the canonical identifier — the slug can change over time, but the itemId stays constant. Always key your database on the itemId.
You can also load a product using only the itemId:
https://www.walmart.com/ip/item/553491704This redirect-safe format is preferred for automated crawls because it avoids stale slugs.
Search results
https://www.walmart.com/search?q=laptop&sort=price_lowSearch pages return up to 40 items per page. You can paginate with the page query parameter. The sort and facet parameters let you filter by price, brand, and category — useful for focused crawls.
Category and department pages
https://www.walmart.com/cp/electronics/3944
https://www.walmart.com/cp/health/976760Category pages use /cp/{name}/{categoryId}. They're the best starting point when you need to crawl an entire vertical — electronics, grocery, home — rather than specific search terms.
The Anti-Bot Wall: Akamai + PerimeterX (HUMAN)
Walmart doesn't serve its product pages to just anyone. The site runs a two-layer anti-bot stack that blocks the majority of naive scrapers:
- Akamai Bot Manager — fingerprints your TLS handshake, HTTP/2 frame order, and cipher suite list. Datacenter IP ranges are flagged at the edge. Akamai also injects a sensor script that collects behavioral signals (mouse movements, keystroke timing, scroll patterns).
- PerimeterX (now HUMAN) — a client-side challenge that runs JavaScript to validate browser environments. If the challenge fails, you get a
403or a CAPTCHA wall. PerimeterX is particularly aggressive on search and category pages.
The result: a plain requests.get() call from a datacenter IP will almost always receive a 403 or a CAPTCHA page. This is why a Walmart proxy strategy built on residential IPs is non-negotiable.
Why residential proxies — not datacenter
| Proxy Type | Walmart Success Rate | Speed | Best For |
|---|---|---|---|
| Datacenter | < 10% | Fast | Not recommended for Walmart |
| Residential (rotating) | 85–95% | Medium | Bulk product crawling, price monitoring |
| Residential (sticky session) | 90–98% | Medium | Login-protected pages, cart-level data |
| Mobile | 95%+ | Slower | Mobile-specific endpoints, app data |
Datacenter IPs are on Akamai's known-bot lists. Residential IPs blend in with real consumer traffic, making them far harder to fingerprint. For Walmart specifically, rotating residential proxies with per-request IP rotation give you the best balance of speed and stealth. Check ProxyHat pricing for residential proxy plans that fit crawl volumes from 1,000 to 10 million requests.
Parse the Hidden JSON: __NEXT_DATA__
Here's the insight that saves you hours: Walmart's product pages are Next.js apps, and Next.js hydrates its pages by embedding a JSON blob in a script tag:
<script id="__NEXT_DATA__" type="application/json">
{ ... massive JSON payload ... }
</script>This blob contains everything — price, inventory status, ratings, seller info, variant data, shipping options. Instead of parsing brittle CSS selectors, you extract this JSON and navigate it like a dictionary. It's the single most important trick for anyone who wants to scrape Walmart reliably.
What's inside __NEXT_DATA__
The payload is nested under props.pageProps.initialData.data. A truncated example:
{
"product": {
"itemId": "553491704",
"name": "Ozark Trail 4-Person Dome Tent",
"priceInfo": {
"currentPrice": {
"price": 49.97,
"currencyUnit": "USD"
},
"priceRangeString": "$49.97"
},
"availabilityStatus": "IN_STOCK",
"rating": { "averageRating": 4.2, "numberOfReviews": 1847 },
"sellerId": "0",
"sellerName": "Walmart.com",
"variantCategories": []
}
}Note the sellerId field — "0" means 1P (Walmart itself). Any other value is a 3P Marketplace seller. We'll dig into that distinction shortly.
Python: Fetch and Parse Walmart Product Data
Below is a production-ready snippet that fetches a Walmart product page through ProxyHat residential proxies, extracts __NEXT_DATA__, and returns structured fields.
import requests
import json
PROXY_USER = "user-country-US" # geo-target to US
PROXY_PASS = "your_password"
PROXY_URL = f"http://{PROXY_USER}:{PROXY_PASS}@gate.proxyhat.com:8080"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
def fetch_product(item_id: str) -> str:
url = f"https://www.walmart.com/ip/item/{item_id}"
proxies = {"http": PROXY_URL, "https": PROXY_URL}
resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=15)
resp.raise_for_status()
return resp.text
def parse_next_data(html: str) -> dict:
"""Extract __NEXT_DATA__ JSON from Walmart HTML."""
marker = '<script id="__NEXT_DATA__" type="application/json">'
start = html.find(marker)
if start == -1:
raise ValueError("__NEXT_DATA__ not found — possible CAPTCHA page")
start += len(marker)
end = html.find("</script>", start)
payload = json.loads(html[start:end])
product = payload["props"]["pageProps"]["initialData"]["data"]["product"]
return {
"item_id": product["itemId"],
"name": product["name"],
"price": product["priceInfo"]["currentPrice"]["price"],
"currency": product["priceInfo"]["currentPrice"]["currencyUnit"],
"availability": product["availabilityStatus"],
"rating": product["rating"]["averageRating"],
"review_count": product["rating"]["numberOfReviews"],
"seller_id": product.get("sellerId", "0"),
"seller_name": product.get("sellerName", "Walmart.com"),
}
# --- Usage ---
html = fetch_product("553491704")
data = parse_next_data(html)
print(json.dumps(data, indent=2))A few notes on this code:
- The
user-country-USflag in the ProxyHat username routes your request through a US residential IP — critical for Walmart, which serves different catalogs and prices to non-US visitors. - If
__NEXT_DATA__is missing, you likely hit a CAPTCHA or block page. Retry with a fresh IP by omitting the session flag (per-request rotation is the default). - For sticky sessions — say you need to maintain cookies across paginated requests — use
user-session-abc123-country-USin the username string.
Marketplace (3P) vs First-Party (1P) Catalog
Walmart Marketplace has grown to over 100,000 third-party sellers. When you scrape Walmart product data, you need to know whether you're looking at a 1P or 3P listing — the pricing dynamics, fulfillment, and competitive implications are entirely different.
How to identify 1P vs 3P in the data
- sellerId = "0" → Walmart 1P (sold and shipped by Walmart).
- sellerId ≠ "0" → 3P Marketplace seller. The
sellerNamefield gives you the seller's display name. - fulfillmentType → Look for
"WFS"(Walmart Fulfillment Services) or"SELLER_FULFILLED". WFS items are stored and shipped by Walmart on behalf of the 3P seller.
For competitive intel, 3P data is often more valuable: it reveals MAP violations, seller proliferation, and fulfillment strategies. Make sure your pipeline tags every record with is_1p and seller_id so you can segment your analysis downstream.
Extracting all offers for a product
A single product page may show multiple sellers under "Other sellers on Walmart.com." The __NEXT_DATA__ payload includes these in the offers array:
offers = product.get("offers", [])
for offer in offers:
print({
"seller_id": offer.get("sellerId"),
"seller_name": offer.get("sellerName"),
"price": offer.get("priceInfo", {}).get("currentPrice", {}).get("price"),
"fulfillment": offer.get("fulfillmentType"),
})This is gold for price monitoring: you can track every seller's price for a given SKU over time.
Rate-Limit-Aware Scheduling
Even with residential proxies, Walmart will rate-limit IPs that make too many requests. From extensive testing, here are the practical thresholds:
- Product pages: ~30 requests per minute per IP before you see soft blocks (CAPTCHA challenges).
- Search pages: ~15 requests per minute per IP — PerimeterX is more aggressive here.
- Category pages: ~20 requests per minute per IP.
With per-request IP rotation (the default on ProxyHat residential proxies), each request goes out from a different IP, so these per-IP limits don't constrain your overall throughput. But you should still pace your total request volume to avoid pattern detection across the proxy pool.
A rate-limited crawl scheduler
import time
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
BASE_DELAY = 2.0 # seconds between requests per thread
JITTER = 0.5 # randomize to avoid clockwork patterns
def crawl_item(item_id: str) -> dict:
delay = BASE_DELAY + random.uniform(-JITTER, JITTER)
time.sleep(delay)
html = fetch_product(item_id)
return parse_next_data(html)
item_ids = ["553491704", "193640502", "44981213", "843761921"]
# Use 3-5 concurrent threads; each request gets its own rotating IP
with ThreadPoolExecutor(max_workers=4) as pool:
futures = {pool.submit(crawl_item, iid): iid for iid in item_ids}
for future in as_completed(futures):
try:
result = future.result()
print(f"Done {result['item_id']}: ${result['price']}")
except Exception as e:
print(f"Failed {futures[future]}: {e}")For large-scale crawls (10,000+ SKUs), consider a job queue like Celery or Temporal with retry logic and dead-letter handling. Store raw HTML in object storage (S3, GCS) so you can re-parse without re-fetching if your extraction logic changes.
Search and Category Scraping
Product pages are straightforward, but search and category pages require extra care. The __NEXT_DATA__ on these pages contains a searchResult object with summary data for each item.
def parse_search_page(html: str) -> list:
payload = json.loads(extract_next_data_json(html))
results = (
payload["props"]["pageProps"]
["initialData"]["searchResult"]
["itemStacks"][0]["items"]
)
return [
{
"item_id": item["itemId"],
"name": item["name"],
"price": item["priceInfo"]["linePrice"],
"availability": item.get("availabilityStatus"),
}
for item in results
]Search pages are more heavily protected. If you're getting blocked, try SERP tracking proxies with mobile user agents — Walmart's mobile endpoints are sometimes less aggressive with bot detection.
Error Handling and Retry Strategy
Even with the best proxies, some requests will fail. Build your pipeline to handle these gracefully:
- HTTP 403: IP was flagged. Retry immediately with a new rotating IP (the default on ProxyHat).
- HTTP 429: Rate limit hit. Back off exponentially — start at 10 seconds, double each retry, cap at 5 minutes.
- CAPTCHA page: Detected as bot. The page HTML will contain
"cf-captcha"or"px-captcha"markers. Discard the response, rotate IP, and retry after a delay. - Missing
__NEXT_DATA__: The page loaded but didn't render the JSON blob. This can happen on redirect pages or out-of-stock items that redirect to search. Check for a redirect and log the URL.
Here's a minimal retry wrapper:
import time
def fetch_with_retry(item_id: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
try:
html = fetch_product(item_id)
data = parse_next_data(html)
return data
except ValueError as e:
if "CAPTCHA" in str(e) or "not found" in str(e):
wait = 5 * (2 ** attempt) # 5s, 10s, 20s
print(f"Retry {attempt+1}/{max_retries} for {item_id}, waiting {wait}s")
time.sleep(wait)
else:
raise
raise RuntimeError(f"Failed after {max_retries} retries: {item_id}")Key Takeaways
1. Skip the API, scrape
__NEXT_DATA__. Walmart's Affiliate API is throttled and limited. The Next.js JSON blob embedded in every product page gives you richer data with zero authentication overhead.2. Residential proxies are mandatory. Akamai and PerimeterX block datacenter IPs outright. Use rotating residential proxies with US geo-targeting to blend in with real shoppers.
3. Key on
itemId, not the slug. Slugs change; itemIds don't. Build your database around the numeric ID.4. Tag 1P vs 3P on every record. The
sellerIdfield tells you whether a listing is Walmart first-party or a Marketplace seller. This distinction is critical for competitive analysis.5. Pace your requests. Even with rotating IPs, add random delays and limit concurrency to avoid behavioral fingerprinting.
6. Store raw HTML. Archive responses in object storage so you can re-parse without re-crawling when your extraction logic evolves.
Ready to start crawling? ProxyHat's residential proxy network covers every US state and 190+ countries — exactly the geographic diversity you need for Walmart at scale. See available locations or get started with a plan.






