如何使用代理爬取Shopify店铺:完整指南

学习如何使用代理爬取Shopify店铺的产品数据、价格信息和库存状态。涵盖Shopify页面结构、API端点、代理策略和反检测技术。

如何使用代理爬取Shopify店铺:完整指南

Why Scrape Shopify Stores?

Shopify powers over 4 million online stores worldwide, from small independent brands to major retailers. This makes it one of the richest sources of e-commerce intelligence. By scraping Shopify stores, you can track competitor pricing, monitor product launches, analyze market trends, and build comprehensive product databases.

The good news is that Shopify has a predictable structure that makes scraping more systematic than most e-commerce platforms. Every Shopify store exposes certain data through standardized endpoints, which means a single scraper architecture can work across thousands of different stores. For a broader overview of e-commerce scraping strategies, see our e-commerce data scraping guide.

Understanding Shopify's Store Structure

Every Shopify store follows the same URL and data patterns, regardless of the theme or customization.

Public JSON Endpoints

Shopify exposes product data through JSON endpoints that do not require authentication. These are the most efficient way to scrape Shopify stores because you get structured data without HTML parsing.

EndpointData ReturnedPagination
/products.jsonAll products with variants, prices, images?page=N&limit=250
/products/{handle}.jsonSingle product detailN/A
/collections.jsonAll collections?page=N
/collections/{handle}/products.jsonProducts in a collection?page=N&limit=250
/meta.jsonStore metadata (name, description)N/A

Product Data Structure

Each product object from the JSON API includes:

  • Basic info: title, handle (slug), body_html (description), vendor, product_type, tags
  • Variants: Each variant has its own price, compare_at_price, SKU, inventory status, and option values (size, color, etc.)
  • Images: URLs for all product images with alt text
  • Dates: created_at, updated_at, published_at

Rate Limiting

Shopify applies rate limits to protect store performance. The public JSON endpoints typically allow 2-4 requests per second per IP before throttling kicks in. This is where residential proxies become essential — spreading requests across multiple IPs lets you maintain throughput without hitting rate limits on any single IP.

Proxy Configuration for Shopify

Shopify's rate limiting is IP-based, making proxy rotation the primary strategy for scraping at scale.

ProxyHat Setup

# Rotating residential proxy (new IP per request)
http://USERNAME:PASSWORD@gate.proxyhat.com:8080
# Geo-targeted for region-specific stores
http://USERNAME-country-US:PASSWORD@gate.proxyhat.com:8080
# Sticky session for paginated scraping of one store
http://USERNAME-session-shopify001:PASSWORD@gate.proxyhat.com:8080

For Shopify scraping, use per-request rotation when scraping different stores, and sticky sessions when paginating through a single store's product catalog. This pattern mimics natural browsing behavior.

Python Implementation

Here is a production-ready Shopify scraper using ProxyHat's Python SDK.

JSON API Scraper

import requests
import json
import time
import random
from dataclasses import dataclass, field
from typing import Optional
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]
@dataclass
class ShopifyProduct:
    id: int
    title: str
    handle: str
    vendor: str
    product_type: str
    tags: list[str]
    variants: list[dict]
    images: list[str]
    min_price: float
    max_price: float
    created_at: str
    updated_at: str
def get_session(store_domain: str) -> requests.Session:
    """Create a session with proxy and headers configured."""
    session = requests.Session()
    session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
    session.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "application/json",
        "Accept-Language": "en-US,en;q=0.9",
    })
    return session
def scrape_all_products(store_domain: str) -> list[ShopifyProduct]:
    """Scrape all products from a Shopify store via JSON API."""
    products = []
    page = 1
    session = get_session(store_domain)
    while True:
        url = f"https://{store_domain}/products.json?page={page}&limit=250"
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
        except requests.RequestException as e:
            print(f"Error on page {page}: {e}")
            break
        data = response.json()
        page_products = data.get("products", [])
        if not page_products:
            break
        for p in page_products:
            prices = [float(v["price"]) for v in p.get("variants", [])
                      if v.get("price")]
            product = ShopifyProduct(
                id=p["id"],
                title=p["title"],
                handle=p["handle"],
                vendor=p.get("vendor", ""),
                product_type=p.get("product_type", ""),
                tags=p.get("tags", "").split(", ") if p.get("tags") else [],
                variants=[{
                    "id": v["id"],
                    "title": v["title"],
                    "price": v["price"],
                    "compare_at_price": v.get("compare_at_price"),
                    "sku": v.get("sku"),
                    "available": v.get("available", False),
                } for v in p.get("variants", [])],
                images=[img["src"] for img in p.get("images", [])],
                min_price=min(prices) if prices else 0,
                max_price=max(prices) if prices else 0,
                created_at=p.get("created_at", ""),
                updated_at=p.get("updated_at", ""),
            )
            products.append(product)
        print(f"Page {page}: {len(page_products)} products (total: {len(products)})")
        page += 1
        time.sleep(random.uniform(1, 3))
    return products
def scrape_collections(store_domain: str) -> list[dict]:
    """Scrape all collections from a Shopify store."""
    collections = []
    page = 1
    session = get_session(store_domain)
    while True:
        url = f"https://{store_domain}/collections.json?page={page}"
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
        except requests.RequestException:
            break
        data = response.json()
        page_collections = data.get("collections", [])
        if not page_collections:
            break
        collections.extend(page_collections)
        page += 1
        time.sleep(random.uniform(1, 2))
    return collections
# Example: Scrape multiple Shopify stores
if __name__ == "__main__":
    stores = [
        "example-store-1.myshopify.com",
        "example-store-2.com",
        "example-store-3.com",
    ]
    for store in stores:
        print(f"\nScraping: {store}")
        products = scrape_all_products(store)
        print(f"Found {len(products)} products")
        # Save to JSON
        with open(f"{store.replace('.', '_')}_products.json", "w") as f:
            json.dump([vars(p) for p in products], f, indent=2)
        time.sleep(random.uniform(3, 7))

Monitoring Price Changes Across Stores

def compare_prices(store_domain: str, previous_data: dict) -> list[dict]:
    """Compare current prices with previously stored data."""
    changes = []
    products = scrape_all_products(store_domain)
    for product in products:
        prev = previous_data.get(product.handle)
        if not prev:
            changes.append({
                "type": "new_product",
                "handle": product.handle,
                "title": product.title,
                "price": product.min_price,
            })
            continue
        if product.min_price != prev.get("min_price"):
            changes.append({
                "type": "price_change",
                "handle": product.handle,
                "title": product.title,
                "old_price": prev["min_price"],
                "new_price": product.min_price,
                "change_pct": ((product.min_price - prev["min_price"])
                               / prev["min_price"] * 100)
                              if prev["min_price"] else 0,
            })
    return changes

Node.js Implementation

A Node.js version using ProxyHat's Node SDK.

const axios = require("axios");
const { HttpsProxyAgent } = require("https-proxy-agent");
const fs = require("fs");
const PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080";
const agent = new HttpsProxyAgent(PROXY_URL);
async function scrapeShopifyProducts(storeDomain) {
  const products = [];
  let page = 1;
  while (true) {
    const url = `https://${storeDomain}/products.json?page=${page}&limit=250`;
    try {
      const { data } = await axios.get(url, {
        httpsAgent: agent,
        headers: {
          "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
          Accept: "application/json",
        },
        timeout: 30000,
      });
      const pageProducts = data.products || [];
      if (pageProducts.length === 0) break;
      for (const p of pageProducts) {
        const prices = p.variants.map((v) => parseFloat(v.price)).filter(Boolean);
        products.push({
          id: p.id,
          title: p.title,
          handle: p.handle,
          vendor: p.vendor,
          productType: p.product_type,
          tags: p.tags ? p.tags.split(", ") : [],
          minPrice: Math.min(...prices),
          maxPrice: Math.max(...prices),
          variants: p.variants.map((v) => ({
            id: v.id,
            title: v.title,
            price: v.price,
            compareAtPrice: v.compare_at_price,
            sku: v.sku,
            available: v.available,
          })),
          images: p.images.map((img) => img.src),
          updatedAt: p.updated_at,
        });
      }
      console.log(`Page ${page}: ${pageProducts.length} products (total: ${products.length})`);
      page++;
      // Random delay 1-3 seconds
      await new Promise((r) => setTimeout(r, 1000 + Math.random() * 2000));
    } catch (err) {
      console.error(`Error on page ${page}: ${err.message}`);
      break;
    }
  }
  return products;
}
async function scrapeMultipleStores(stores) {
  const results = {};
  for (const store of stores) {
    console.log(`\nScraping: ${store}`);
    const products = await scrapeShopifyProducts(store);
    results[store] = products;
    console.log(`Found ${products.length} products`);
    // Delay between stores
    await new Promise((r) => setTimeout(r, 3000 + Math.random() * 4000));
  }
  return results;
}
// Usage
scrapeMultipleStores([
  "example-store-1.myshopify.com",
  "example-store-2.com",
]).then((results) => {
  fs.writeFileSync("shopify_data.json", JSON.stringify(results, null, 2));
  console.log("Data saved to shopify_data.json");
});

Shopify-Specific Scraping Strategies

Discovering Shopify Stores

Before scraping, you need to identify which competitor sites run on Shopify. Common indicators include:

  • The /products.json endpoint returns valid JSON
  • HTML source contains Shopify.theme or cdn.shopify.com
  • The x-shopify-stage header is present in responses

Handling Passworded Stores

Some Shopify stores require a password to access. These are typically pre-launch or wholesale stores. The JSON endpoints will return a redirect to the password page. Skip these stores in your scraping pipeline unless you have authorized access.

Dealing with Custom Domains

Shopify stores often use custom domains instead of .myshopify.com. The JSON API works the same way on custom domains. Just use the store's public-facing domain in your requests.

Inventory Tracking

Product variants include an available field that indicates stock status. By tracking this field over time, you can monitor competitor inventory levels and identify when products go out of stock — useful intelligence for pricing and restocking decisions.

Avoiding Blocks and Rate Limits

While Shopify is more scraper-friendly than Amazon, it still enforces protections.

ProtectionDetailsMitigation
IP Rate Limiting~2-4 req/sec per IP for JSON endpointsRotate residential proxies across requests
Cloudflare ProtectionSome stores use CloudflareResidential IPs with browser-like headers
Bot DetectionBehavioral patterns monitoredRandomize delays and User-Agents
Password PagesPre-launch/wholesale stores lockedSkip or use authorized access

For more on handling anti-bot systems, read our guide on how to scrape websites without getting blocked.

Key takeaway: Shopify's JSON API is the most efficient scraping approach — it gives you structured data without HTML parsing. Use it before falling back to HTML scraping.

Data Use Cases

Once you have collected Shopify product data, here are the most valuable applications:

  • Competitive pricing: Track competitor prices across product categories and adjust your pricing strategy in real time.
  • Product research: Identify trending products, new launches, and market gaps by monitoring multiple stores.
  • Market analysis: Aggregate data across hundreds of Shopify stores to understand market trends, pricing distribution, and category growth.
  • Catalog enrichment: Use competitor product descriptions, images, and specifications to improve your own listings.
  • Brand monitoring: Track unauthorized sellers of your products and monitor MAP compliance across Shopify storefronts.

Key Takeaways

  • Shopify's /products.json endpoint is the most efficient scraping method — use it before HTML parsing.
  • A single scraper architecture works across all Shopify stores due to the standardized structure.
  • Residential proxies with rotation overcome Shopify's IP-based rate limiting.
  • Use sticky sessions when paginating through a single store's catalog.
  • Track variant-level pricing and availability for comprehensive competitive intelligence.
  • Start with ProxyHat's residential proxies to scale your Shopify scraping reliably.

Ready to start scraping Shopify stores? Explore our e-commerce data scraping guide for the full strategy, and check our Python proxy guide and Node.js proxy guide for implementation details. Visit our pricing page to get started.

准备开始了吗?

通过AI过滤访问148多个国家的5000多万个住宅IP。

查看价格住宅代理
← 返回博客