引言:为什么 Ruby 开发者需要代理知识
构建数据管道、价格监控系统或 SERP 抓取工具时,代理服务器不再是可选项——而是必需品。单 IP 高频请求会触发速率限制、CAPTCHA 验证,甚至 IP 封禁。Ruby 生态提供了多种代理集成方案,从标准库 Net::HTTP 到高性能的 Typhoeus,再到专业代理服务的 SDK。
本指南将从底层开始,逐步构建一个生产级的代理感知 HTTP 客户端,涵盖认证、错误处理、并发请求、TLS 配置以及 Rails 集成。
Net::HTTP:标准库代理基础
Net::HTTP 是 Ruby 标准库的一部分,无需额外依赖。它原生支持 HTTP 代理,但配置方式需要理解其代理构造器的工作原理。
基本代理配置
以下代码展示如何通过 ProxyHat 住宅代理发送请求:
require 'net/http'
require 'uri'
# ProxyHat 连接参数
PROXY_HOST = 'gate.proxyhat.com'
PROXY_PORT = 8080
PROXY_USER = 'your_username'
PROXY_PASS = 'your_password'
def fetch_with_proxy(url, proxy_user: PROXY_USER, proxy_pass: PROXY_PASS)
uri = URI.parse(url)
# 创建代理连接
proxy = Net::HTTP::Proxy(PROXY_HOST, PROXY_PORT, proxy_user, proxy_pass)
http = proxy.new(uri.host, uri.port)
# 配置超时和 TLS
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = 15
http.read_timeout = 30
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
request = Net::HTTP::Get.new(uri.request_uri)
request['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
request['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
response = http.request(request)
{
status: response.code.to_i,
headers: response.each_header.to_h,
body: response.body
}
rescue Net::OpenTimeout => e
{ error: 'connection_timeout', message: e.message }
rescue Net::ReadTimeout => e
{ error: 'read_timeout', message: e.message }
rescue Net::HTTPBadResponse => e
{ error: 'invalid_response', message: e.message }
rescue Errno::ECONNREFUSED => e
{ error: 'connection_refused', message: e.message }
rescue SocketError => e
{ error: 'dns_resolution_failed', message: e.message }
end
# 使用示例
result = fetch_with_proxy('https://httpbin.org/ip')
puts result[:body] if result[:status] == 200
Net::HTTP::Proxy 返回一个代理感知的类,其 new 方法创建的实例会通过指定代理路由所有请求。用户名和密码通过 Proxy-Authorization 头自动传递。
带重试机制的健壮版本
生产环境需要重试逻辑处理瞬态故障:
require 'net/http'
require 'uri'
class RobustProxyClient
MAX_RETRIES = 3
RETRY_DELAY = 2 # 秒
attr_reader :proxy_host, :proxy_port, :proxy_user, :proxy_pass
def initialize(proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: ENV['PROXYHAT_USER'],
proxy_pass: ENV['PROXYHAT_PASS'])
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
end
def get(url, headers: {}, timeout: 30)
retries = 0
loop do
result = execute_request(url, headers, timeout)
return result if result[:status] || retries >= MAX_RETRIES
retries += 1
sleep(RETRY_DELAY * retries) # 指数退避
end
{ error: 'max_retries_exceeded' }
end
private
def execute_request(url, headers, timeout)
uri = URI.parse(url)
proxy_class = Net::HTTP::Proxy(proxy_host, proxy_port, proxy_user, proxy_pass)
http = proxy_class.new(uri.host, uri.port)
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = 15
http.read_timeout = timeout
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.ssl_version = :TLSv1_2
request = Net::HTTP::Get.new(uri.request_uri)
headers.each { |k, v| request[k] = v }
request['User-Agent'] ||= 'RubyProxyClient/1.0'
response = http.request(request)
{ status: response.code.to_i, body: response.body, headers: response.each_header.to_h }
rescue Net::OpenTimeout, Net::ReadTimeout => e
{ error: 'timeout', retryable: true }
rescue Errno::ECONNRESET, Errno::EPIPE => e
{ error: 'connection_reset', retryable: true }
rescue Net::HTTPServerError => e
{ error: 'server_error', retryable: true }
rescue => e
{ error: 'unexpected_error', message: e.message, retryable: false }
end
end
# 使用示例
client = RobustProxyClient.new(
proxy_user: 'user-country-US', # 美国住宅代理
proxy_pass: 'your_password'
)
response = client.get('https://httpbin.org/ip', headers: { 'Accept' => 'application/json' })
puts response[:body] if response[:status] == 200
Typhoeus:libcurl 支持的高性能并发请求
Typhoeus 是基于 libcurl 的 Ruby HTTP 客户端,支持真正的并行请求。其 Hydra 接口允许同时发送数十甚至数百个请求,非常适合大规模抓取场景。
单请求代理配置
require 'typhoeus'
def typhoeus_proxy_request(url, proxy_url: nil)
# 代理 URL 格式: http://user:pass@host:port
proxy_url ||= "http://#{ENV['PROXYHAT_USER']}:#{ENV['PROXYHAT_PASS']}@gate.proxyhat.com:8080"
request = Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
proxyauth: :any, # 自动检测认证方式
headers: {
'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
'Accept' => 'text/html,application/xhtml+xml'
},
timeout: 30,
connecttimeout: 15,
followlocation: true,
ssl_verifypeer: true,
ssl_verifyhost: 2
)
response = request.run
if response.success?
{
status: response.code,
body: response.body,
headers: response.response_headers,
total_time: response.total_time
}
elsif response.timed_out?
{ error: 'timeout' }
elsif response.code == 0
{ error: 'network_failure', message: response.return_message }
else
{ status: response.code, error: 'http_error' }
end
end
# 使用地理定位代理
proxy_url = 'http://user-country-DE-city-berlin:password@gate.proxyhat.com:8080'
result = typhoeus_proxy_request('https://httpbin.org/ip', proxy_url: proxy_url)
puts result[:body]
Hydra 并发请求引擎
Hydra 是 Typhoeus 的核心优势——可以并发执行多个请求,同时管理最大并发数和超时:
require 'typhoeus'
class ConcurrentScraper
MAX_CONCURRENCY = 50
def initialize(proxy_user:, proxy_pass:, country: nil, city: nil)
@proxy_user = build_proxy_user(proxy_user, country, city)
@proxy_pass = proxy_pass
@hydra = Typhoeus::Hydra.new(max_concurrency: MAX_CONCURRENCY)
@results = Queue.new
end
def scrape_urls(urls)
requests = urls.map { |url| build_request(url) }
requests.each { |req| @hydra.queue(req) }
@hydra.run
results = []
until @results.empty?
results << @results.pop
end
results
end
private
def build_proxy_user(base_user, country, city)
user = base_user
user += "-country-#{country}" if country
user += "-city-#{city}" if city
user
end
def build_request(url)
proxy_url = "http://#{@proxy_user}:#{@proxy_pass}@gate.proxyhat.com:8080"
request = Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
proxyauth: :any,
headers: {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept-Language' => 'en-US,en;q=0.9'
},
timeout: 25,
connecttimeout: 10,
followlocation: true,
ssl_verifypeer: true
)
request.on_complete do |response|
@results << {
url: url,
status: response.code,
body: response.body,
time: response.total_time,
success: response.success?
}
end
request.on_failure do |response|
@results << {
url: url,
status: response.code,
error: response.return_message,
success: false
}
end
request
end
end
# 使用示例:并发抓取 100 个 URL
urls = (1..100).map { |i| "https://httpbin.org/delay/#{rand(1..3)}" }
scraper = ConcurrentScraper.new(
proxy_user: 'your_username',
proxy_pass: 'your_password',
country: 'US'
)
results = scraper.scrape_urls(urls)
successful = results.count { |r| r[:success] }
puts "成功: #{successful}/#{urls.size}"
puts "平均耗时: #{results.sum { |r| r[:time] || 0 } / results.size}秒"
ProxyHat Ruby SDK:轮换与地理定位
ProxyHat 提供专门的 Ruby SDK,封装了代理轮换、会话管理和地理定位功能,简化住宅代理的使用:
require 'proxyhat_sdk' # gem install proxyhat_sdk
class ProxyHatClient
attr_reader :config
def initialize(username:, password:, default_country: nil)
@config = {
gateway: 'gate.proxyhat.com',
http_port: 8080,
socks5_port: 1080,
username: username,
password: password,
default_country: default_country
}
end
# 轮换代理:每次请求使用新 IP
def rotating_proxy_url(country: nil, city: nil)
user = build_username(rotate: true, country: country, city: city)
"http://#{user}:#{@config[:password]}@#{@config[:gateway]}:#{@config[:http_port]}"
end
# 粘性会话:保持同一 IP
def sticky_session_url(session_id:, country: nil, city: nil)
user = build_username(session: session_id, country: country, city: city)
"http://#{user}:#{@config[:password]}@#{@config[:gateway]}:#{@config[:http_port]}"
end
# SOCKS5 代理
def socks5_proxy_url(country: nil)
user = build_username(rotate: true, country: country)
"socks5://#{user}:#{@config[:password]}@#{@config[:gateway]}:#{@config[:socks5_port]}"
end
private
def build_username(rotate: false, session: nil, country: nil, city: nil)
parts = [@config[:username]]
if session
parts << "session-#{session}"
elsif rotate
parts << "rotate-#{SecureRandom.hex(8)}"
end
parts << "country-#{country || @config[:default_country]}" if country || @config[:default_country]
parts << "city-#{city}" if city
parts.join('-')
end
end
# 使用示例
client = ProxyHatClient.new(
username: 'your_username',
password: 'your_password',
default_country: 'US'
)
# 轮换代理请求
rotating_url = client.rotating_proxy_url(country: 'DE', city: 'berlin')
puts "轮换代理: #{rotating_url}"
# 粘性会话(同一 IP 保持 10 分钟)
sticky_url = client.sticky_session_url(session_id: 'order_12345', country: 'GB')
puts "粘性会话: #{sticky_url}"
实战:1000 URL 并发抓取
以下是一个完整的生产级示例,展示如何使用轮换住宅代理并发抓取 1000 个 URL:
require 'typhoeus'
require 'json'
require 'logger'
class ProductionScraper
BATCH_SIZE = 100
MAX_CONCURRENCY = 50
RETRY_COUNT = 2
def initialize(proxy_user:, proxy_pass:, country: 'US')
@proxy_user = proxy_user
@proxy_pass = proxy_pass
@country = country
@logger = Logger.new(STDOUT)
@logger.level = Logger::INFO
end
def scrape(urls)
@logger.info "开始抓取 #{urls.size} 个 URL"
start_time = Time.now
results = { success: [], failed: [], retried: [] }
urls.each_slice(BATCH_SIZE).with_index do |batch, batch_idx|
@logger.info "处理批次 #{batch_idx + 1}/#{(urls.size.to_f / BATCH_SIZE).ceil}"
batch_results = process_batch(batch)
batch_results.each do |result|
if result[:success]
results[:success] << result
else
results[:failed] << result
end
end
# 批次间短暂延迟,避免触发限制
sleep(0.5) unless batch_idx == (urls.size.to_f / BATCH_SIZE).ceil - 1
end
# 处理失败重试
if results[:failed].any?
@logger.info "重试 #{results[:failed].size} 个失败请求"
retry_results = retry_failed(results[:failed])
results[:success] += retry_results[:success]
results[:retried] = retry_results[:success]
results[:failed] = retry_results[:failed]
end
elapsed = Time.now - start_time
@logger.info "完成! 成功: #{results[:success].size}, 失败: #{results[:failed].size}, 耗时: #{elapsed.round(2)}秒"
results
end
private
def process_batch(urls)
hydra = Typhoeus::Hydra.new(max_concurrency: MAX_CONCURRENCY)
results = []
mutex = Mutex.new
urls.each do |url|
request = build_request(url)
request.on_complete do |response|
mutex.synchronize do
results << parse_response(url, response)
end
end
hydra.queue(request)
end
hydra.run
results
end
def build_request(url, retry_count: 0)
# 每个请求使用新的轮换 IP
session_id = SecureRandom.hex(8)
proxy_user = "#{@proxy_user}-session-#{session_id}-country-#{@country}"
proxy_url = "http://#{proxy_user}:#{@proxy_pass}@gate.proxyhat.com:8080"
Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
proxyauth: :any,
headers: random_headers,
timeout: 30,
connecttimeout: 15,
followlocation: true,
ssl_verifypeer: true,
maxredirs: 3
)
end
def random_headers
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'
]
{
'User-Agent' => user_agents.sample,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate, br',
'DNT' => '1',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
}
end
def parse_response(url, response)
{
url: url,
status: response.code,
body: response.body,
time: response.total_time,
success: response.success? && response.code.between?(200, 299),
size: response.body&.bytesize || 0
}
end
def retry_failed(failed_results)
success = []
still_failed = []
failed_results.each do |result|
RETRY_COUNT.times do |attempt|
sleep(2 ** attempt) # 指数退避
request = build_request(result[:url])
response = request.run
if response.success? && response.code.between?(200, 299)
success << parse_response(result[:url], response)
break
elsif attempt == RETRY_COUNT - 1
still_failed << result
end
end
end
{ success: success, failed: still_failed }
end
end
# 执行抓取
scraper = ProductionScraper.new(
proxy_user: ENV['PROXYHAT_USER'],
proxy_pass: ENV['PROXYHAT_PASS'],
country: 'US'
)
# 生成 1000 个测试 URL
urls = (1..1000).map { |i| "https://httpbin.org/uuid" }
results = scraper.scrape(urls)
# 输出统计
puts "\n=== 抓取统计 ==="
puts "成功率: #{(results[:success].size.to_f / 1000 * 100).round(2)}%"
puts "总数据量: #{results[:success].sum { |r| r[:size] }} bytes"
puts "平均响应时间: #{results[:success].sum { |r| r[:time] } / results[:success].size} 秒"
TLS/SSL 配置与证书处理
代理场景下的 TLS 配置需要特别注意。某些上游服务器使用自签名证书或证书链不完整,需要灵活处理:
require 'net/http'
require 'openssl'
class TLSAwareProxyClient
def initialize(proxy_host: 'gate.proxyhat.com', proxy_port: 8080)
@proxy_host = proxy_host
@proxy_port = proxy_port
end
# 标准验证(推荐用于生产)
def fetch_strict_tls(url, proxy_user:, proxy_pass:)
uri = URI.parse(url)
proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, proxy_user, proxy_pass)
http = proxy.new(uri.host, uri.port)
http.use_ssl = (uri.scheme == 'https')
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = default_cert_store
http.min_version = OpenSSL::SSL::TLS1_2_VERSION
http.max_version = OpenSSL::SSL::TLS1_3_VERSION
request = Net::HTTP::Get.new(uri.request_uri)
http.request(request)
end
# 宽松验证(仅用于测试环境或已知上游)
def fetch_permissive_tls(url, proxy_user:, proxy_pass:)
uri = URI.parse(url)
proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, proxy_user, proxy_pass)
http = proxy.new(uri.host, uri.port)
http.use_ssl = (uri.scheme == 'https')
http.verify_mode = OpenSSL::SSL::VERIFY_NONE # 警告:不安全
http.ssl_version = :TLSv1_2
request = Net::HTTP::Get.new(uri.request_uri)
http.request(request)
end
# 自定义证书存储(企业环境)
def fetch_with_custom_ca(url, proxy_user:, proxy_pass:, ca_path: nil, ca_file: nil)
uri = URI.parse(url)
proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, proxy_user, proxy_pass)
http = proxy.new(uri.host, uri.port)
http.use_ssl = (uri.scheme == 'https')
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = custom_cert_store(ca_path: ca_path, ca_file: ca_file)
# 配置 SNI(服务器名称指示)
http.enable_post_connection_check = true
request = Net::HTTP::Get.new(uri.request_uri)
http.request(request)
end
private
def default_cert_store
store = OpenSSL::X509::Store.new
store.set_default_paths
store
end
def custom_cert_store(ca_path: nil, ca_file: nil)
store = OpenSSL::X509::Store.new
store.add_path(ca_path) if ca_path && Dir.exist?(ca_path)
store.add_file(ca_file) if ca_file && File.exist?(ca_file)
store.set_default_paths
store
end
end
# Typhoeus TLS 配置示例
require 'typhoeus'
def typhoeus_strict_tls_request(url, proxy_url)
Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
ssl_verifypeer: true, # 验证对端证书
ssl_verifyhost: 2, # 验证主机名
sslversion: :tlsv1_2,
cainfo: '/etc/ssl/certs/ca-certificates.crt', # Linux CA 包路径
timeout: 30
).run
end
def typhoeus_permissive_request(url, proxy_url)
Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
ssl_verifypeer: false, # 跳过证书验证
ssl_verifyhost: 0, # 不验证主机名
timeout: 30
).run
end
Rails 集成:Faraday 中间件与 ActiveJob
在 Rails 应用中,推荐使用 Faraday 作为 HTTP 客户端抽象层,便于测试和中间件复用:
Faraday 代理中间件
# config/initializers/proxy_client.rb
require 'faraday'
require 'faraday/retry'
class ProxyFaradayClient
def initialize(proxy_user:, proxy_pass:, country: nil)
@proxy_user = proxy_user
@proxy_pass = proxy_pass
@country = country
end
def connection
@connection ||= Faraday.new do |builder|
builder.request :retry, {
max: 3,
interval: 1,
backoff_factor: 2,
retry_statuses: [429, 500, 502, 503, 504],
methods: [:get, :head, :options]
}
builder.response :json, content_type: /\bjson\b/
builder.response :raise_error
builder.adapter :typhoeus
builder.options.timeout = 30
builder.options.open_timeout = 15
end
end
def get(url, headers: {}, country: nil)
proxy_url = build_proxy_url(country: country || @country)
connection.get(url) do |req|
req.options.proxy = proxy_url
headers.each { |k, v| req.headers[k] = v }
end
rescue Faraday::Error => e
{ error: e.class.name, message: e.message }
end
private
def build_proxy_url(country: nil)
user = @proxy_user
user += "-country-#{country}" if country
user += "-rotate-#{SecureRandom.hex(4)}"
"http://#{user}:#{@proxy_pass}@gate.proxyhat.com:8080"
end
end
# 全局配置
PROXY_CLIENT = ProxyFaradayClient.new(
proxy_user: ENV['PROXYHAT_USER'],
proxy_pass: ENV['PROXYHAT_PASS'],
country: 'US'
)
ActiveJob 后台抓取任务
# app/jobs/scraping_job.rb
class ScrapingJob < ApplicationJob
queue_as :scraping
# 重试策略
retry_on NetworkError, wait: :polynomially_longer, attempts: 3
discard_on ScrapingTimeoutError
def perform(urls, options = {})
@country = options[:country] || 'US'
@results = []
urls.each_slice(50) do |batch|
process_batch(batch)
end
# 存储结果
store_results(@results)
# 发送通知
ScrapingCompletionMailer.notify(@results.size).deliver_later
end
private
def process_batch(urls)
hydra = Typhoeus::Hydra.new(max_concurrency: 25)
mutex = Mutex.new
urls.each do |url|
request = build_request(url)
request.on_complete do |response|
mutex.synchronize do
@results << {
url: url,
status: response.code,
body: response.body,
scraped_at: Time.current
}
end
end
hydra.queue(request)
end
hydra.run
end
def build_request(url)
session_id = SecureRandom.hex(6)
proxy_user = "#{ENV['PROXYHAT_USER']}-session-#{session_id}-country-#{@country}"
proxy_url = "http://#{proxy_user}:#{ENV['PROXYHAT_PASS']}@gate.proxyhat.com:8080"
Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
timeout: 25,
connecttimeout: 10,
followlocation: true,
ssl_verifypeer: true,
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml'
}
)
end
def random_user_agent
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
].sample
end
def store_results(results)
# 批量插入数据库
ScrapedPage.insert_all(
results.map { |r| r.merge(created_at: Time.current, updated_at: Time.current) }
)
end
end
# 调用方式
ScrapingJob.perform_later(
['https://example.com/page1', 'https://example.com/page2'],
country: 'DE'
)
代理类型对比
| 特性 | 住宅代理 | 数据中心代理 | 移动代理 |
|---|---|---|---|
| IP 来源 | 真实住宅 ISP | 云服务商 IP 段 | 移动运营商 4G/5G |
| 匿名性 | 极高 | 中等 | 极高 |
| 速度 | 中等 | 极快 | 较慢 |
| 成功率 | 95%+ | 60-80% | 98%+ |
| 价格 | 中高 | 低 | 高 |
| 适用场景 | SERP、电商、社媒 | 大规模数据采集 | 移动应用抓取 |
关键要点
- Net::HTTP 适合简单场景,无需额外依赖,但并发能力有限。
- Typhoeus + Hydra 是大规模并发抓取的最佳选择,支持真正的并行请求。
- 代理轮换 通过用户名参数控制,每次请求使用
session-{随机ID}获得新 IP。- 地理定位 使用
country-{国家码}-city-{城市}格式精确控制出口位置。- TLS 配置 生产环境务必使用
VERIFY_PEER,测试环境可临时放宽。- Rails 集成 推荐使用 Faraday 抽象层配合 ActiveJob 处理后台任务。
选择合适的代理方案取决于你的具体需求。对于需要高成功率和高匿名性的 SERP 抓取或价格监控,住宅代理是首选。如果追求极致速度且目标网站反爬较弱,数据中心代理更具性价比。
访问 ProxyHat 定价页面 了解更多住宅代理方案,或查看 网页抓取用例 获取更多技术细节。






