使用住宅代理在Python中构建Google排名追踪器:完整代码指南

本指南面向SEO工程师和Python开发者,演示如何用curl_cffi和ProxyHat住宅代理构建生产级Google排名追踪器,涵盖数据模型、SERP分页抓取、TLS指纹规避、重试策略与并发控制。

Build a Google Rank Tracker in Python with Residential Proxies

如果你需要大规模追踪关键词排名,手动查询Google既不可靠也不可扩展。本文将教你使用住宅代理在Python中构建Google排名追踪器,从数据模型到生产级加固,提供完整可运行的代码示例。

为什么使用住宅代理在Python中构建Google排名追踪器至关重要

Google的搜索结果页(SERP)会根据用户IP、设备、语言和地理位置动态变化。当你用同一个数据中心IP反复查询相同关键词时,Google的反爬系统会在几十次请求后触发CAPTCHA或返回空结果。根据Google Search Central文档,Google明确限制自动化抓取行为,其反爬机制包括TLS指纹检测(JA3/JA4)、IP信誉评分和行为分析。

住宅代理通过真实ISP分配的IP地址发送请求,使流量看起来来自普通家庭用户。结合城市级地理定位和粘性会话,你可以模拟不同地区的真实搜索行为,大幅降低被封锁的概率。

数据模型设计:为什么每日SERP快照优于一次性检查

排名是动态的——同一天上午和下午的排名可能不同。一次性检查只能给你一个时间点的快照,而每日定时抓取能揭示排名趋势、波动模式和算法更新的影响。

核心数据模型如下:

关键词(keyword) | 目标域名(target_domain) | 国家(country)
设备(device) | 排名位置(position) | 抓取时间(captured_at)

用SQLite存储历史数据,既能支持趋势分析,又无需额外数据库服务。下面是建表和插入的代码:

import sqlite3
from datetime import datetime

def init_db(db_path="rank_tracker.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS rankings (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            keyword TEXT NOT NULL,
            target_domain TEXT NOT NULL,
            country TEXT NOT NULL DEFAULT 'US',
            device TEXT NOT NULL DEFAULT 'desktop',
            position INTEGER,
            result_url TEXT,
            captured_at TEXT NOT NULL,
            UNIQUE(keyword, target_domain, country, device, captured_at)
        )
    """)
    conn.execute(
        "CREATE INDEX IF NOT EXISTS idx_keyword_date "
        "ON rankings(keyword, captured_at)"
    )
    conn.commit()
    return conn

def save_ranking(conn, keyword, target_domain, country, device, position, result_url):
    conn.execute(
        """INSERT OR REPLACE INTO rankings
           (keyword, target_domain, country, device, position, result_url, captured_at)
           VALUES (?, ?, ?, ?, ?, ?, ?)""",
        (keyword, target_domain, country, device, position, result_url, datetime.utcnow().isoformat())
    )
    conn.commit()

Google移除num=100后的SERP分页策略

2025年9月,Google正式移除了num=100参数,这意味着你无法再通过单次请求获取100条结果。现在每页最多返回10条自然结果,需要通过start=0,10,20...90参数分页抓取前100条结果。

分页抓取时需注意:

  • 每页之间添加随机延迟(2-5秒),避免触发频率限制
  • 跳过广告( sponsored标签)和SERP特性(知识面板、People Also Ask等)
  • 只解析div.g中的自然结果
  • 记录每页实际返回的结果数,如果某页返回0条则提前终止

为什么必须使用住宅代理:TLS指纹与IP信誉

Google使用多种技术检测自动化流量:

检测维度数据中心代理住宅代理
TLS/JA3-JA4指纹需curl_cffi模拟需curl_cffi模拟
IP信誉评分低——容易被标记高——真实ISP分配
地理定位精度通常仅国家级城市级(如芝加哥)
会话保持有限支持支持粘性会话
被封锁概率高(约50%以上请求可能被拦截)低(成功率达99%以上)

即使使用住宅代理,你仍需用curl_cffiimpersonate='chrome'参数模拟真实浏览器的TLS指纹。根据curl_cffi项目文档,该库通过编译版libcurl支持JA3/JA4指纹伪装,这是普通requests库无法做到的。

实战:curl_cffi + ProxyHat住宅代理抓取SERP

基础配置

ProxyHat的住宅代理通过gate.proxyhat.com:8080网关接入,用户名中可指定国家、城市和会话ID。以下是Python示例:

from curl_cffi import requests as cffi_requests
import hashlib

def build_proxy_url(keyword, country="US", city="chicago"):
    """根据关键词生成粘性会话ID,确保同一关键词始终使用同一出口IP"""
    session_id = hashlib.md5(keyword.encode()).hexdigest()[:12]
    username = f"user-country-{country}-city-{city}-session-{session_id}"
    password = "YOUR_PASSWORD"
    return f"http://{username}:{password}@gate.proxyhat.com:8080"

# 单次SERP请求示例
def fetch_serp(keyword, country="US", city="chicago", start=0):
    proxy = build_proxy_url(keyword, country, city)
    url = "https://www.google.com/search"
    params = {
        "q": keyword,
        "num": 10,
        "start": start,
        "hl": "en",
        "gl": country.lower(),
    }
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/131.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }
    resp = cffi_requests.get(
        url, params=params, headers=headers,
        proxies={"http": proxy, "https": proxy},
        impersonate="chrome",
        timeout=30,
    )
    resp.raise_for_status()
    return resp.text

使用ProxyHat SDK简化代理管理

如果你需要更灵活的IP轮换策略,可以结合ProxyHat SDK管理会话池:

import itertools
from curl_cffi import requests as cffi_requests

class ProxyHatSERPClient:
    def __init__(self, username, password, countries=None):
        self.base_user = username
        self.password = password
        self.countries = countries or ["US", "DE", "GB"]
        self._country_cycle = itertools.cycle(self.countries)

    def get_proxy(self, keyword, country=None, city=None, sticky=True):
        c = country or next(self._country_cycle)
        parts = [f"user-country-{c}"]
        if city:
            parts.append(f"city-{city}")
        if sticky:
            import hashlib
            sid = hashlib.md5(keyword.encode()).hexdigest()[:12]
            parts.append(f"session-{sid}")
        username = "-".join(parts)
        return {
            "http": f"http://{username}:{self.password}@gate.proxyhat.com:8080",
            "https": f"http://{username}:{self.password}@gate.proxyhat.com:8080",
        }

    def search(self, keyword, start=0, country=None, city=None):
        proxy = self.get_proxy(keyword, country, city)
        resp = cffi_requests.get(
            "https://www.google.com/search",
            params={"q": keyword, "num": 10, "start": start, "hl": "en"},
            proxies=proxy,
            impersonate="chrome",
            timeout=30,
        )
        resp.raise_for_status()
        return resp.text

解析排名位置:CSS选择器与正则结合

Google的HTML结构经常变化,不能完全依赖单一选择器。建议用selectolaxBeautifulSoup解析,同时准备正则作为后备:

from selectolax.parser import HTMLParser
from urllib.parse import urlparse
import re

def parse_organic_results(html, target_domain):
    """解析SERP中的自然结果,返回目标域名的排名位置"""
    tree = HTMLParser(html)
    results = []
    position = 0

    # 主要选择器:div.g > div > div a[href] — Google自然结果容器
    for node in tree.css("div.g"):
        link = node.css_first("a[href]")
        if not link:
            continue
        href = link.attributes.get("href", "")
        if not href.startswith("http"):
            continue

        # 跳过广告
        parent_text = node.text(separator=" ").lower()
        if "sponsored" in parent_text:
            continue

        position += 1
        host = urlparse(href).netloc
        results.append({"position": position, "url": href, "host": host})

    # 查找目标域名
    target_rank = None
    target_url = None
    for r in results:
        if target_domain in r["host"]:
            target_rank = r["position"]
            target_url = r["url"]
            break

    # 后备正则匹配(当CSS选择器失效时)
    if target_rank is None:
        pattern = re.compile(rf'href="(https?://[^"']*{re.escape(target_domain)}[^"']*)"', re.I)
        for i, match in enumerate(pattern.finditer(html), 1):
            target_rank = i
            target_url = match.group(1)
            break

    return target_rank, target_url, results

分页抓取前100条结果

import time
import random

def track_keyword(conn, keyword, target_domain, country="US", city="chicago", max_pages=10):
    """分页抓取前100条结果,记录目标域名排名"""
    rank_found = None
    found_url = None

    for page in range(max_pages):
        start = page * 10
        try:
            html = fetch_serp(keyword, country, city, start=start)
        except Exception as e:
            print(f"[ERROR] Page {page} failed: {e}")
            continue

        rank, url, results = parse_organic_results(html, target_domain)

        # 如果本页找到目标,记录并退出
        if rank is not None:
            # 加上前面页面的偏移量
            rank_found = rank + start
            found_url = url
            break

        # 如果本页无自然结果,提前终止
        if len(results) == 0:
            print(f"[INFO] No results on page {page}, stopping pagination")
            break

        # 随机延迟2-5秒
        time.sleep(random.uniform(2, 5))

    save_ranking(
        conn, keyword, target_domain, country, "desktop",
        rank_found or 101,  # 101表示未找到
        found_url or "",
    )
    return rank_found

生产级加固:重试、CAPTCHA检测与并发控制

指数退避重试

import logging
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((ConnectionError, TimeoutError)),
    before_sleep=lambda retry_state: logger.info(
        f"Retrying in {retry_state.next_action.sleep:.0f}s (attempt {retry_state.attempt_number})"
    ),
)
def fetch_serp_with_retry(keyword, country="US", city="chicago", start=0):
    html = fetch_serp(keyword, country, city, start=start)
    # CAPTCHA检测
    captcha_markers = ["unusual traffic", "captcha", "detected unusual traffic"]
    lower_html = html.lower()
    if any(marker in lower_html for marker in captcha_markers):
        raise RuntimeError("CAPTCHA detected — switching session recommended")
    return html

并发控制与每国家代理池

import asyncio
import aiofiles
import csv
from asyncio import Semaphore

async def track_batch(keywords, target_domain, countries, max_concurrent=5):
    """并发追踪多个关键词,每国家独立并发限制"""
    sem = Semaphore(max_concurrent)
    conn = init_db()

    async def track_one(keyword, country, city):
        async with sem:
            try:
                rank = track_keyword(conn, keyword, target_domain, country, city)
                logger.info(f"{keyword} | {country} → position {rank}")
                return rank
            except Exception as e:
                logger.error(f"Failed: {keyword} ({country}): {e}")
                return None

    tasks = []
    for kw in keywords:
        for country in countries:
            city = "chicago" if country == "US" else None
            tasks.append(track_one(kw, country, city))

    await asyncio.gather(*tasks)
    conn.close()

# 运行示例
if __name__ == "__main__":
    keywords = ["best running shoes", "marathon training plan", "carbon plate shoes"]
    countries = ["US", "DE", "GB"]
    asyncio.run(track_batch(keywords, "example.com", countries, max_concurrent=5))

CSV导出与历史趋势分析

def export_to_csv(conn, output_path="rankings_export.csv"):
    rows = conn.execute(
        "SELECT keyword, target_domain, country, device, position, captured_at "
        "FROM rankings ORDER BY keyword, captured_at"
    ).fetchall()
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["keyword", "target_domain", "country", "device", "position", "captured_at"])
        writer.writerows(rows)
    logger.info(f"Exported {len(rows)} rows to {output_path}")

def rank_volatility(conn, keyword, days=7):
    """计算最近N天的排名波动(标准差)"""
    rows = conn.execute(
        "SELECT position FROM rankings WHERE keyword=? ORDER BY captured_at DESC LIMIT ?",
        (keyword, days)
    ).fetchall()
    positions = [r[0] for r in rows if r[0] and r[0] <= 100]
    if len(positions) < 2:
        return 0.0
    avg = sum(positions) / len(positions)
    variance = sum((p - avg) ** 2 for p in positions) / len(positions)
    return variance ** 0.5

伦理与限制:合规使用排名追踪器

构建排名追踪器时需遵守以下原则:

  • 优先使用官方API:如果你的查询量较低(每天<100次),建议使用Google Custom Search API或官方SERP API,合规且稳定
  • 遵守robots.txt:虽然Google的robots.txt允许/搜索路径,但应尊重其爬取频率限制
  • 仅追踪公开数据:不要追踪需要登录才能访问的内容,不要存储个人数据
  • 控制请求频率:单关键词每日1-2次足够,无需高频查询
  • GDPR/CCPA合规:如果你的数据涉及欧盟或加州用户,确保数据存储符合相关法规

关键提醒:排名追踪器应用于监控你自己拥有的域名排名,或进行合法的竞争分析。未经授权的大规模SERP抓取可能违反Google服务条款。

关键要点

  • 每日SERP快照比一次性检查更能揭示排名趋势和算法更新影响
  • Google移除num=100后,必须通过start=0,10,20...90分页抓取前100条结果
  • 住宅代理+城市级地理定位+粘性会话是规避IP封锁的核心策略
  • curl_cffiimpersonate='chrome'模拟TLS指纹是绕过JA3/JA4检测的必要手段
  • 生产环境需要指数退避重试、CAPTCHA检测、并发限制和排名波动平滑
  • 低频场景优先使用官方API,大规模追踪才需要代理方案

准备好开始构建你自己的排名追踪器了吗?查看ProxyHat的定价方案代理位置列表,或了解更多SERP追踪用例网页抓取场景。完整API文档请参考ProxyHat官方文档

准备开始了吗?

通过AI过滤访问148多个国家的5000多万个住宅IP。

查看价格住宅代理
← 返回博客