如果你需要大规模追踪关键词排名,手动查询Google既不可靠也不可扩展。本文将教你使用住宅代理在Python中构建Google排名追踪器,从数据模型到生产级加固,提供完整可运行的代码示例。
为什么使用住宅代理在Python中构建Google排名追踪器至关重要
Google的搜索结果页(SERP)会根据用户IP、设备、语言和地理位置动态变化。当你用同一个数据中心IP反复查询相同关键词时,Google的反爬系统会在几十次请求后触发CAPTCHA或返回空结果。根据Google Search Central文档,Google明确限制自动化抓取行为,其反爬机制包括TLS指纹检测(JA3/JA4)、IP信誉评分和行为分析。
住宅代理通过真实ISP分配的IP地址发送请求,使流量看起来来自普通家庭用户。结合城市级地理定位和粘性会话,你可以模拟不同地区的真实搜索行为,大幅降低被封锁的概率。
数据模型设计:为什么每日SERP快照优于一次性检查
排名是动态的——同一天上午和下午的排名可能不同。一次性检查只能给你一个时间点的快照,而每日定时抓取能揭示排名趋势、波动模式和算法更新的影响。
核心数据模型如下:
关键词(keyword) | 目标域名(target_domain) | 国家(country)
设备(device) | 排名位置(position) | 抓取时间(captured_at)
用SQLite存储历史数据,既能支持趋势分析,又无需额外数据库服务。下面是建表和插入的代码:
import sqlite3
from datetime import datetime
def init_db(db_path="rank_tracker.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS rankings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
keyword TEXT NOT NULL,
target_domain TEXT NOT NULL,
country TEXT NOT NULL DEFAULT 'US',
device TEXT NOT NULL DEFAULT 'desktop',
position INTEGER,
result_url TEXT,
captured_at TEXT NOT NULL,
UNIQUE(keyword, target_domain, country, device, captured_at)
)
""")
conn.execute(
"CREATE INDEX IF NOT EXISTS idx_keyword_date "
"ON rankings(keyword, captured_at)"
)
conn.commit()
return conn
def save_ranking(conn, keyword, target_domain, country, device, position, result_url):
conn.execute(
"""INSERT OR REPLACE INTO rankings
(keyword, target_domain, country, device, position, result_url, captured_at)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(keyword, target_domain, country, device, position, result_url, datetime.utcnow().isoformat())
)
conn.commit()
Google移除num=100后的SERP分页策略
2025年9月,Google正式移除了num=100参数,这意味着你无法再通过单次请求获取100条结果。现在每页最多返回10条自然结果,需要通过start=0,10,20...90参数分页抓取前100条结果。
分页抓取时需注意:
- 每页之间添加随机延迟(2-5秒),避免触发频率限制
- 跳过广告(
sponsored标签)和SERP特性(知识面板、People Also Ask等) - 只解析
div.g中的自然结果 - 记录每页实际返回的结果数,如果某页返回0条则提前终止
为什么必须使用住宅代理:TLS指纹与IP信誉
Google使用多种技术检测自动化流量:
| 检测维度 | 数据中心代理 | 住宅代理 |
|---|---|---|
| TLS/JA3-JA4指纹 | 需curl_cffi模拟 | 需curl_cffi模拟 |
| IP信誉评分 | 低——容易被标记 | 高——真实ISP分配 |
| 地理定位精度 | 通常仅国家级 | 城市级(如芝加哥) |
| 会话保持 | 有限支持 | 支持粘性会话 |
| 被封锁概率 | 高(约50%以上请求可能被拦截) | 低(成功率达99%以上) |
即使使用住宅代理,你仍需用curl_cffi的impersonate='chrome'参数模拟真实浏览器的TLS指纹。根据curl_cffi项目文档,该库通过编译版libcurl支持JA3/JA4指纹伪装,这是普通requests库无法做到的。
实战:curl_cffi + ProxyHat住宅代理抓取SERP
基础配置
ProxyHat的住宅代理通过gate.proxyhat.com:8080网关接入,用户名中可指定国家、城市和会话ID。以下是Python示例:
from curl_cffi import requests as cffi_requests
import hashlib
def build_proxy_url(keyword, country="US", city="chicago"):
"""根据关键词生成粘性会话ID,确保同一关键词始终使用同一出口IP"""
session_id = hashlib.md5(keyword.encode()).hexdigest()[:12]
username = f"user-country-{country}-city-{city}-session-{session_id}"
password = "YOUR_PASSWORD"
return f"http://{username}:{password}@gate.proxyhat.com:8080"
# 单次SERP请求示例
def fetch_serp(keyword, country="US", city="chicago", start=0):
proxy = build_proxy_url(keyword, country, city)
url = "https://www.google.com/search"
params = {
"q": keyword,
"num": 10,
"start": start,
"hl": "en",
"gl": country.lower(),
}
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
resp = cffi_requests.get(
url, params=params, headers=headers,
proxies={"http": proxy, "https": proxy},
impersonate="chrome",
timeout=30,
)
resp.raise_for_status()
return resp.text
使用ProxyHat SDK简化代理管理
如果你需要更灵活的IP轮换策略,可以结合ProxyHat SDK管理会话池:
import itertools
from curl_cffi import requests as cffi_requests
class ProxyHatSERPClient:
def __init__(self, username, password, countries=None):
self.base_user = username
self.password = password
self.countries = countries or ["US", "DE", "GB"]
self._country_cycle = itertools.cycle(self.countries)
def get_proxy(self, keyword, country=None, city=None, sticky=True):
c = country or next(self._country_cycle)
parts = [f"user-country-{c}"]
if city:
parts.append(f"city-{city}")
if sticky:
import hashlib
sid = hashlib.md5(keyword.encode()).hexdigest()[:12]
parts.append(f"session-{sid}")
username = "-".join(parts)
return {
"http": f"http://{username}:{self.password}@gate.proxyhat.com:8080",
"https": f"http://{username}:{self.password}@gate.proxyhat.com:8080",
}
def search(self, keyword, start=0, country=None, city=None):
proxy = self.get_proxy(keyword, country, city)
resp = cffi_requests.get(
"https://www.google.com/search",
params={"q": keyword, "num": 10, "start": start, "hl": "en"},
proxies=proxy,
impersonate="chrome",
timeout=30,
)
resp.raise_for_status()
return resp.text
解析排名位置:CSS选择器与正则结合
Google的HTML结构经常变化,不能完全依赖单一选择器。建议用selectolax或BeautifulSoup解析,同时准备正则作为后备:
from selectolax.parser import HTMLParser
from urllib.parse import urlparse
import re
def parse_organic_results(html, target_domain):
"""解析SERP中的自然结果,返回目标域名的排名位置"""
tree = HTMLParser(html)
results = []
position = 0
# 主要选择器:div.g > div > div a[href] — Google自然结果容器
for node in tree.css("div.g"):
link = node.css_first("a[href]")
if not link:
continue
href = link.attributes.get("href", "")
if not href.startswith("http"):
continue
# 跳过广告
parent_text = node.text(separator=" ").lower()
if "sponsored" in parent_text:
continue
position += 1
host = urlparse(href).netloc
results.append({"position": position, "url": href, "host": host})
# 查找目标域名
target_rank = None
target_url = None
for r in results:
if target_domain in r["host"]:
target_rank = r["position"]
target_url = r["url"]
break
# 后备正则匹配(当CSS选择器失效时)
if target_rank is None:
pattern = re.compile(rf'href="(https?://[^"']*{re.escape(target_domain)}[^"']*)"', re.I)
for i, match in enumerate(pattern.finditer(html), 1):
target_rank = i
target_url = match.group(1)
break
return target_rank, target_url, results
分页抓取前100条结果
import time
import random
def track_keyword(conn, keyword, target_domain, country="US", city="chicago", max_pages=10):
"""分页抓取前100条结果,记录目标域名排名"""
rank_found = None
found_url = None
for page in range(max_pages):
start = page * 10
try:
html = fetch_serp(keyword, country, city, start=start)
except Exception as e:
print(f"[ERROR] Page {page} failed: {e}")
continue
rank, url, results = parse_organic_results(html, target_domain)
# 如果本页找到目标,记录并退出
if rank is not None:
# 加上前面页面的偏移量
rank_found = rank + start
found_url = url
break
# 如果本页无自然结果,提前终止
if len(results) == 0:
print(f"[INFO] No results on page {page}, stopping pagination")
break
# 随机延迟2-5秒
time.sleep(random.uniform(2, 5))
save_ranking(
conn, keyword, target_domain, country, "desktop",
rank_found or 101, # 101表示未找到
found_url or "",
)
return rank_found
生产级加固:重试、CAPTCHA检测与并发控制
指数退避重试
import logging
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((ConnectionError, TimeoutError)),
before_sleep=lambda retry_state: logger.info(
f"Retrying in {retry_state.next_action.sleep:.0f}s (attempt {retry_state.attempt_number})"
),
)
def fetch_serp_with_retry(keyword, country="US", city="chicago", start=0):
html = fetch_serp(keyword, country, city, start=start)
# CAPTCHA检测
captcha_markers = ["unusual traffic", "captcha", "detected unusual traffic"]
lower_html = html.lower()
if any(marker in lower_html for marker in captcha_markers):
raise RuntimeError("CAPTCHA detected — switching session recommended")
return html
并发控制与每国家代理池
import asyncio
import aiofiles
import csv
from asyncio import Semaphore
async def track_batch(keywords, target_domain, countries, max_concurrent=5):
"""并发追踪多个关键词,每国家独立并发限制"""
sem = Semaphore(max_concurrent)
conn = init_db()
async def track_one(keyword, country, city):
async with sem:
try:
rank = track_keyword(conn, keyword, target_domain, country, city)
logger.info(f"{keyword} | {country} → position {rank}")
return rank
except Exception as e:
logger.error(f"Failed: {keyword} ({country}): {e}")
return None
tasks = []
for kw in keywords:
for country in countries:
city = "chicago" if country == "US" else None
tasks.append(track_one(kw, country, city))
await asyncio.gather(*tasks)
conn.close()
# 运行示例
if __name__ == "__main__":
keywords = ["best running shoes", "marathon training plan", "carbon plate shoes"]
countries = ["US", "DE", "GB"]
asyncio.run(track_batch(keywords, "example.com", countries, max_concurrent=5))
CSV导出与历史趋势分析
def export_to_csv(conn, output_path="rankings_export.csv"):
rows = conn.execute(
"SELECT keyword, target_domain, country, device, position, captured_at "
"FROM rankings ORDER BY keyword, captured_at"
).fetchall()
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["keyword", "target_domain", "country", "device", "position", "captured_at"])
writer.writerows(rows)
logger.info(f"Exported {len(rows)} rows to {output_path}")
def rank_volatility(conn, keyword, days=7):
"""计算最近N天的排名波动(标准差)"""
rows = conn.execute(
"SELECT position FROM rankings WHERE keyword=? ORDER BY captured_at DESC LIMIT ?",
(keyword, days)
).fetchall()
positions = [r[0] for r in rows if r[0] and r[0] <= 100]
if len(positions) < 2:
return 0.0
avg = sum(positions) / len(positions)
variance = sum((p - avg) ** 2 for p in positions) / len(positions)
return variance ** 0.5
伦理与限制:合规使用排名追踪器
构建排名追踪器时需遵守以下原则:
- 优先使用官方API:如果你的查询量较低(每天<100次),建议使用Google Custom Search API或官方SERP API,合规且稳定
- 遵守robots.txt:虽然Google的robots.txt允许/搜索路径,但应尊重其爬取频率限制
- 仅追踪公开数据:不要追踪需要登录才能访问的内容,不要存储个人数据
- 控制请求频率:单关键词每日1-2次足够,无需高频查询
- GDPR/CCPA合规:如果你的数据涉及欧盟或加州用户,确保数据存储符合相关法规
关键提醒:排名追踪器应用于监控你自己拥有的域名排名,或进行合法的竞争分析。未经授权的大规模SERP抓取可能违反Google服务条款。
关键要点
- 每日SERP快照比一次性检查更能揭示排名趋势和算法更新影响
- Google移除
num=100后,必须通过start=0,10,20...90分页抓取前100条结果 - 住宅代理+城市级地理定位+粘性会话是规避IP封锁的核心策略
curl_cffi的impersonate='chrome'模拟TLS指纹是绕过JA3/JA4检测的必要手段- 生产环境需要指数退避重试、CAPTCHA检测、并发限制和排名波动平滑
- 低频场景优先使用官方API,大规模追踪才需要代理方案
准备好开始构建你自己的排名追踪器了吗?查看ProxyHat的定价方案和代理位置列表,或了解更多SERP追踪用例和网页抓取场景。完整API文档请参考ProxyHat官方文档。






