为什么Scrape Google地图数据?
Google Maps包含世界上最全面的当地企业数据库. 随着超过2亿家企业的上市,它包括名字,地址,电话号码,网站,收视率,评论,运营时间,和照片——都具有结构化和可搜索性.
在程序上提取这些数据可以使有价值的业务应用程序:
- 铅生成: 按行业和地点编制目标明确的企业名单
- 竞争性分析: 绘制竞争对手的位置、评分和评论感
- 市场研究: 了解按地区分列的业务密度、定价模式和服务覆盖面
- 地方机会均等审计: 校验您的企业列表, 并与竞争对手进行比较
- 数据浓缩: 以新的业务信息补充客户关系管理数据
本指南涵盖使用代理提取Google地图数据的技术方法. 关于更广泛的SERP刮刮策略,见我们 使用代理向导完成 SERP 刮除。 。 。
Google Places API vs 搜索
在建造刮刮机之前,考虑官方Google Places API是否符合您的需要.
| 因素 | 地点 API | 擦伤 |
|---|---|---|
| 费用 | 每1 000项请求17美元(在免费等级之后) | 仅限代理带宽(~每千页0.10-0.50美元) |
| 数据字段 | 结构化JSON,20+字段 | 所有可见数据,包括审查文本 |
| 费率限制 | 严格的每秒和每日限制 | 受代理池大小限制 |
| 审查案文 | 最多5次最相关的审查 | 所有评论(附页) |
| 可靠性 | 正式、稳定的终点 | 需要解析器维护 |
| 服务条件 | 完全合规 | 检查 ToS 和地方法规 |
| 缩放 | 规模 | 成本效益高 |
Places API是小型生产关键应用的最佳选择. 当您需要大型数据集,全文审查,或者当API成本变得令人望而却步时,拼写更具成本效益.
Google 地图 URL 结构
了解Google地图的URL模式对于建造刮片机至关重要。 有两个主要的切入点:
搜索结果
谷歌地图搜索结果可通过:
# Browser URL format
https://www.google.com/maps/search/restaurants+near+new+york
# URL parameters for search
https://www.google.com/maps/search/{query}/@{lat},{lng},{zoom}z位置细节
个人业务页遵循这一模式:
# Place detail URL
https://www.google.com/maps/place/{business+name}/@{lat},{lng},{zoom}z/data=!{place_id}建立 Google 地图搜索器
Google Maps是一个JavaScript-havy应用程序. 与常规Google搜索不同,简单的HTTP请求往往会返回不完整的数据. 有两种方法:从页面源解析嵌入式JSON数据,或使用无头浏览器.
办法1:解析嵌入式JSON(快)
Google Maps页面包含嵌入在HTML源中的结构化数据. 这是如何提取它:
import requests
import json
import re
import time
import random
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def search_google_maps(query, location="us"):
"""Search Google Maps and extract business listings."""
proxies = {"http": PROXY_URL, "https": PROXY_URL}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml",
}
# Use the search URL format
search_url = f"https://www.google.com/maps/search/{query.replace(' ', '+')}"
response = requests.get(
search_url,
headers=headers,
proxies=proxies,
timeout=20,
)
response.raise_for_status()
# Extract embedded JSON data from the page
# Google Maps embeds data in a specific pattern
businesses = []
# Look for business data patterns in the response
# The data is typically in a JavaScript variable
patterns = re.findall(r'\["([^"]+)",null,null,null,null,null,null,null,"([^"]*)"', response.text)
# Alternative: parse the structured search results
# Google Maps returns data in protobuf-like JSON arrays
json_matches = re.findall(r'null,\["([^"]{5,80})"[^]]*?"([^"]*?(?:St|Ave|Rd|Blvd|Dr|Ln)[^"]*?)"', response.text)
for match in json_matches[:20]:
businesses.append({
"name": match[0],
"address": match[1] if len(match) > 1 else "",
})
return businesses
results = search_google_maps("restaurants near Times Square New York")
for b in results:
print(f"{b['name']} - {b['address']}")方法2:无头浏览器(更可靠)
对于更可靠的提取,使用一个无头浏览器使JavaScript:
from playwright.sync_api import sync_playwright
import json
import time
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def scrape_maps_with_browser(query):
"""Use Playwright to scrape Google Maps with full JS rendering."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.proxyhat.com:8080",
"username": "USERNAME",
"password": "PASSWORD",
},
)
page = browser.new_page()
page.set_extra_http_headers({
"Accept-Language": "en-US,en;q=0.9",
})
# Navigate to Google Maps search
search_url = f"https://www.google.com/maps/search/{query.replace(' ', '+')}"
page.goto(search_url, wait_until="networkidle", timeout=30000)
# Wait for results to load
page.wait_for_selector('div[role="feed"]', timeout=10000)
# Scroll to load more results
feed = page.query_selector('div[role="feed"]')
for _ in range(5):
feed.evaluate("el => el.scrollBy(0, 1000)")
time.sleep(1.5)
# Extract business data from the results
businesses = []
items = page.query_selector_all('div[role="feed"] > div > div > a')
for item in items:
name = item.get_attribute("aria-label")
href = item.get_attribute("href")
if name and href:
businesses.append({
"name": name,
"url": href,
})
browser.close()
return businesses
results = scrape_maps_with_browser("coffee shops in San Francisco")
for b in results:
print(f"{b['name']}")
print(f" {b['url'][:80]}...")
print()提取业务细节
一旦您有商业 URL 列表, 请从每次列表中提取详细信息 :
import requests
import re
import json
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def extract_business_details(maps_url):
"""Extract detailed business info from a Google Maps place page."""
proxies = {"http": PROXY_URL, "https": PROXY_URL}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
response = requests.get(maps_url, headers=headers, proxies=proxies, timeout=20)
text = response.text
business = {}
# Extract business name
name_match = re.search(r'谷歌地图的代理策略
Google Maps拥有自己的反机器人保护,需要量身定制的代理策略.
为何需要住宅邻里
Google Maps在封堵数据中心IP方面特别积极. 应用程序通过多个API呼叫加载数据,谷歌在所有这些请求中交叉引用IP. 住宅代理从 代理汉特 这一点至关重要,因为:
- 他们通过IP名声检查 地图API称强制
- 它们支持为特定地点的搜索确定城市一级的地理目标
- 他们保持地图所期望的一致的会话行为
会话管理
Google Maps 与常规的 SERP 刮刮不同,
# For Google Maps, use sticky sessions (same IP for a business detail page)
# ProxyHat supports session-based rotation via the proxy URL
# See docs.proxyhat.com for session configuration
# Rotating IP (for search listings)
ROTATING_PROXY = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
# Sticky session (for individual place pages)
# Same session ID = same IP for the session duration
STICKY_PROXY = "http://USERNAME-session-maps123:PASSWORD@gate.proxyhat.com:8080"费率限制
Google Maps比常规的Google搜索更敏感于快速请求. 遵循这些准则:
- 搜索结果页面之间等待 5- 10 秒
- 单个位置页面负载之间等待 3-5 秒
- 限制同时提出的避免破裂模式的请求
- 使用较长的延迟时间进行审阅页标(8至15秒之间)
节点.js 执行
const axios = require('axios');
const { HttpsProxyAgent } = require('https-proxy-agent');
const agent = new HttpsProxyAgent('http://USERNAME:PASSWORD@gate.proxyhat.com:8080');
async function searchGoogleMaps(query) {
const searchUrl = `https://www.google.com/maps/search/${encodeURIComponent(query)}`;
const { data } = await axios.get(searchUrl, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
},
httpsAgent: agent,
timeout: 20000,
});
// Extract business names from the response
const businesses = [];
const namePattern = /\["([^"]{3,80})",null,null,null,null,null,null,null/g;
let match;
while ((match = namePattern.exec(data)) !== null) {
businesses.push({ name: match[1] });
}
return businesses;
}
async function main() {
const results = await searchGoogleMaps('plumbers in Chicago');
console.log(`Found ${results.length} businesses:`);
results.forEach((b, i) => console.log(`${i + 1}. ${b.name}`));
}
main().catch(console.error);按比例提取审查
谷歌地图评论是最有价值的数据点之一. 每次评论包括评论员姓名,评分,文字,日期,有时还包括照片.
import requests
import re
import json
import time
import random
PROXY_URL = "http://USERNAME:PASSWORD@gate.proxyhat.com:8080"
def extract_reviews(place_id, num_reviews=50):
"""Extract reviews for a Google Maps place using the internal API."""
proxies = {"http": PROXY_URL, "https": PROXY_URL}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
reviews = []
# Google Maps loads reviews via AJAX with pagination tokens
# The first page is loaded with the place page
maps_url = f"https://www.google.com/maps/place/?q=place_id:{place_id}"
response = requests.get(maps_url, headers=headers, proxies=proxies, timeout=20)
# Extract review data from embedded JSON
# Reviews are typically in arrays with rating, text, and author
review_pattern = re.findall(
r'"(\d)","([^"]{10,500})"[^]]*?"([^"]{2,50})"',
response.text
)
for match in review_pattern[:num_reviews]:
reviews.append({
"rating": int(match[0]),
"text": match[1],
"author": match[2],
})
return reviews
# Example: extract reviews
reviews = extract_reviews("ChIJN1t_tDeuEmsRUsoyG83frY4") # Example place ID
for r in reviews[:5]:
print(f"{'*' * r['rating']} by {r['author']}")
print(f" {r['text'][:100]}...")
print()数据构造和存储
将Google地图数据整理成结构化分析格式:
import json
import csv
from datetime import datetime
def save_businesses(businesses, output_format="json"):
"""Save scraped business data in structured format."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
if output_format == "json":
filename = f"maps_data_{timestamp}.json"
with open(filename, "w") as f:
json.dump(businesses, f, indent=2, ensure_ascii=False)
elif output_format == "csv":
filename = f"maps_data_{timestamp}.csv"
if businesses:
keys = businesses[0].keys()
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(businesses)
print(f"Saved {len(businesses)} businesses to {filename}")
return filename法律和道德考虑
搜索Google地图数据引发重要的法律和伦理问题:
- 谷歌服务条款: Google的TOS禁止自动刮刮. 考虑在生产应用中使用官方 Places API
- 数据保护: 在某些法域,诸如电话号码和地址等商业数据可能须遵守数据保护条例。
- 限制费率 : 即使有代理,也要尊重谷歌的基础设施. 过度刮刮影响服务质量
- 数据新鲜度 : 随着业务信息频繁变化,总是给数据打上时间戳并定期刷新
对于对任务至关重要的应用程序,考虑将核心数据的官方API和审查文本等补充领域的定向报废结合起来。 这种混合方法平衡了数据完整性的遵守情况。
欲了解更多关于删除最佳做法的网页,见我们 完整网络擦除代理指南,并学习如何避免我们 反阻断导线。咨询 代理数据文档 用于代理配置细节。






