Wer in Ruby Web-Scraper, API-Clients oder Datenpipelines baut, stößt unweigerlich auf Proxies. Sei es um IP-Rate-Limits zu umgehen, geografische Beschränkungen zu überwinden oder schlicht die eigene IP nicht preiszugeben. Doch die Proxy-Integration in Ruby ist nicht immer trivial – besonders wenn es um Authentifizierung, parallele Anfragen oder TLS-Konfiguration geht.
Dieser Guide zeigt drei Wege: die Standardbibliothek Net::HTTP, die libcurl-basierte Typhoeus-Bibliothek für parallele Requests, und das ProxyHat Ruby SDK für automatische IP-Rotation. Mit produktionsreifem Code.
Net::HTTP mit Proxy: Der Standardweg
Net::HTTP ist Teil der Ruby-Standardbibliothek und benötigt keine zusätzlichen Gems. Für einfache Proxy-Anforderungen reicht das völlig aus. Die Proxy-Authentifizierung erfolgt über explizite Parameter.
require 'net/http'
require 'uri'
# ProxyHat-Verbindungsdaten
PROXY_HOST = 'gate.proxyhat.com'
PROXY_PORT = 8080
PROXY_USER = 'your_username'
PROXY_PASS = 'your_password'
def fetch_with_proxy(url, proxy_user: nil, proxy_pass: nil)
uri = URI.parse(url)
# Proxy-Verbindung erstellen
proxy = Net::HTTP::Proxy(PROXY_HOST, PROXY_PORT, proxy_user, proxy_pass)
http = proxy.new(uri.host, uri.port)
# TLS/SSL-Konfiguration
http.use_ssl = (uri.scheme == 'https')
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.open_timeout = 15
http.read_timeout = 30
request = Net::HTTP::Get.new(uri.request_uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; RubyScraper/1.0)'
request['Accept'] = 'text/html,application/xhtml+xml'
response = http.request(request)
case response
when Net::HTTPSuccess
{ success: true, status: response.code, body: response.body }
when Net::HTTPRedirection
{ success: false, status: response.code, redirect_to: response['location'] }
when Net::HTTPTooManyRequests
{ success: false, status: response.code, retry_after: response['retry-after'] }
else
{ success: false, status: response.code, message: response.message }
end
rescue Net::OpenTimeout => e
{ success: false, error: 'connection_timeout', message: e.message }
rescue Net::ReadTimeout => e
{ success: false, error: 'read_timeout', message: e.message }
rescue OpenSSL::SSL::SSLError => e
{ success: false, error: 'ssl_error', message: e.message }
rescue SocketError => e
{ success: false, error: 'dns_error', message: e.message }
ensure
http&.finish if http&.started?
end
# Beispielaufruf mit Geo-Targeting (Username enthält Ländercode)
result = fetch_with_proxy(
'https://httpbin.org/ip',
proxy_user: "#{PROXY_USER}-country-US",
proxy_pass: PROXY_PASS
)
puts "Status: #{result[:status]}"
puts "Body: #{result[:body][0..200]}..." if result[:success]
Der Username-Parameter bei ProxyHat unterstützt verschiedene Flags:
user-country-US– US-amerikanische Exit-IPuser-country-DE-city-berlin– Berliner Exit-IPuser-session-abc123– Sticky Session mit fester IP
Wiederholungslogik mit Exponential Backoff
Für produktive Scraper ist ein Retry-Mechanismus essenziell. Hier eine robuste Implementierung:
require 'net/http'
require 'uri'
module Scraper
class ProxyClient
MAX_RETRIES = 3
BASE_DELAY = 1.0
attr_reader :proxy_host, :proxy_port, :proxy_user, :proxy_pass
def initialize(proxy_host:, proxy_port:, proxy_user:, proxy_pass:)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
end
def get(url, headers: {})
retries = 0
loop do
result = perform_request(url, headers)
return result if result[:success]
if should_retry?(result) && retries < MAX_RETRIES
retries += 1
delay = BASE_DELAY * (2 ** retries) + rand(0.5)
puts "Retry #{retries}/#{MAX_RETRIES} after #{delay.round(2)}s"
sleep(delay)
else
return result
end
end
end
private
def perform_request(url, headers)
uri = URI.parse(url)
proxy_class = Net::HTTP::Proxy(proxy_host, proxy_port, proxy_user, proxy_pass)
http = proxy_class.new(uri.host, uri.port)
http.use_ssl = (uri.scheme == 'https')
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.open_timeout = 10
http.read_timeout = 25
request = Net::HTTP::Get.new(uri.request_uri)
headers.each { |k, v| request[k] = v }
response = http.request(request)
if response.is_a?(Net::HTTPSuccess)
{ success: true, status: response.code.to_i, body: response.body, headers: response.each_header.to_h }
else
{ success: false, status: response.code.to_i, error: 'http_error' }
end
rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNREFUSED => e
{ success: false, error: 'connection_error', message: e.message }
rescue => e
{ success: false, error: 'unknown', message: e.message }
ensure
http&.finish if http&.started?
end
def should_retry?(result)
%w[connection_error http_error].include?(result[:error]) ||
[429, 502, 503, 504].include?(result[:status])
end
end
end
# Verwendung
client = Scraper::ProxyClient.new(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'user-country-DE',
proxy_pass: 'your_password'
)
result = client.get('https://example.com/api/data', headers: {
'User-Agent' => 'MyApp/1.0',
'Accept' => 'application/json'
})
puts result.inspect
Typhoeus: Parallele Requests mit Hydra
Typhoeus nutzt libcurl unter der Haube und ermöglicht parallele HTTP-Anfragen via Hydra. Das ist ideal für Scraping-Szenarien mit Hunderten gleichzeitigen Requests.
require 'typhoeus'
# Proxy-Konfiguration
PROXY_CONFIG = {
proxy: 'http://gate.proxyhat.com:8080',
proxyuserpwd: 'your_username:your_password',
proxyauth: :basic
}
def fetch_single(url)
response = Typhoeus.get(url, {
**PROXY_CONFIG,
headers: {
'User-Agent' => 'Mozilla/5.0 (compatible; Typhoeus/1.0)',
'Accept' => 'text/html'
},
timeout: 30,
followlocation: true,
ssl_verifypeer: true,
ssl_verifyhost: 2
})
if response.success?
{ success: true, status: response.code, body: response.body }
elsif response.timed_out?
{ success: false, error: 'timeout' }
else
{ success: false, status: response.code, error: response.return_message }
end
end
# Parallele Anfragen mit Hydra
def fetch_parallel(urls, concurrency: 50)
results = Concurrent::Hash.new
hydra = Typhoeus::Hydra.new(max_concurrency: concurrency)
urls.each_with_index do |url, index|
request = Typhoeus::Request.new(url, {
**PROXY_CONFIG,
headers: { 'User-Agent' => 'Typhoeus/1.0' },
timeout: 25,
followlocation: true
})
request.on_complete do |response|
results[url] = {
success: response.success?,
status: response.code,
body: response.body,
time: response.total_time
}
end
request.on_failure do |response|
results[url] = {
success: false,
error: response.return_message
}
end
hydra.queue(request)
end
hydra.run
results
end
# Beispiel: 100 URLs parallel abrufen
urls = (1..100).map { |i| "https://httpbin.org/delay/#{rand(1..3)}?id=#{i}" }
start_time = Time.now
results = fetch_parallel(urls, concurrency: 25)
duration = Time.now - start_time
successful = results.count { |_, r| r[:success] }
puts "Erfolgreich: #{successful}/#{urls.size} in #{duration.round(2)}s"
puts "Durchsatz: #{(urls.size / duration).round(2)} req/s"
Typhoeus mit rotierenden Sessions
Für echte IP-Rotation muss jede Anfrage einen anderen Session-Identifier verwenden. ProxyHat generiert dann für jede Session eine neue Exit-IP:
require 'typhoeus'
require 'securerandom'
class RotatingProxyScraper
BASE_USER = 'your_username'
BASE_PASS = 'your_password'
def initialize(country: nil, city: nil)
@country = country
@city = city
end
def fetch_urls(urls, concurrency: 50)
results = Concurrent::Hash.new
hydra = Typhoeus::Hydra.new(max_concurrency: concurrency)
urls.each do |url|
session_id = SecureRandom.hex(8)
proxy_user = build_username(session_id)
request = Typhoeus::Request.new(url, {
proxy: 'http://gate.proxyhat.com:8080',
proxyuserpwd: "#{proxy_user}:#{BASE_PASS}",
proxyauth: :basic,
timeout: 30,
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml'
}
})
request.on_complete do |response|
results[url] = {
success: response.success?,
status: response.code,
ip: response.headers['X-Proxy-IP'],
session: session_id
}
end
hydra.queue(request)
end
hydra.run
results
end
private
def build_username(session_id)
parts = [BASE_USER]
parts << "country-#{@country}" if @country
parts << "city-#{@city}" if @city
parts << "session-#{session_id}"
parts.join('-')
end
def random_user_agent
agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0'
]
agents.sample
end
end
# 1000 URLs mit rotierenden IPs scrapen
scraper = RotatingProxyScraper.new(country: 'US')
urls = (1..1000).map { |i| "https://httpbin.org/ip?page=#{i}" }
start = Time.now
results = scraper.fetch_urls(urls, concurrency: 100)
puts "Abgeschlossen in #{(Time.now - start).round(2)}s"
puts "Erfolgsrate: #{results.values.count { |r| r[:success] } * 100 / results.size}%"
ProxyHat Ruby SDK: Rotation und Geo-Targeting
Das ProxyHat SDK vereinfacht die Konfiguration und bietet integrierte Rotation, Retry-Logik und Metriken.
require 'net/http'
require 'json'
module ProxyHat
class Client
DEFAULT_OPTIONS = {
host: 'gate.proxyhat.com',
port: 8080,
timeout: 30,
max_retries: 3,
retry_delay: 1.0
}.freeze
attr_reader :username, :password, :options
def initialize(username:, password:, **options)
@username = username
@password = password
@options = DEFAULT_OPTIONS.merge(options)
end
# Einzelner Request mit automatischer IP-Rotation
def get(url, country: nil, city: nil, session: nil)
proxy_user = build_username(country: country, city: city, session: session)
perform_with_retry(url, proxy_user)
end
# Parallele Requests mit Hydra-Integration
def get_parallel(urls, country: nil, city: nil, concurrency: 50)
require 'typhoeus'
results = {}
hydra = Typhoeus::Hydra.new(max_concurrency: concurrency)
urls.each do |url|
session = SecureRandom.hex(8)
proxy_user = build_username(country: country, city: city, session: session)
request = Typhoeus::Request.new(url, {
proxy: "http://#{@options[:host]}:#{@options[:port]}",
proxyuserpwd: "#{proxy_user}:#{@password}",
timeout: @options[:timeout],
followlocation: true,
ssl_verifypeer: true
})
request.on_complete { |resp| results[url] = parse_response(resp) }
hydra.queue(request)
end
hydra.run
results
end
# Sticky Session für Multi-Step-Workflows
def sticky_session(country: nil, city: nil)
session_id = SecureRandom.uuid
client = self.class.new(
username: build_username(country: country, city: city, session: session_id),
password: @password,
**@options
)
yield client, session_id
end
private
def build_username(country: nil, city: nil, session: nil)
parts = [username]
parts << "country-#{country}" if country
parts << "city-#{city}" if city
parts << "session-#{session}" if session
parts.join('-')
end
def perform_with_retry(url, proxy_user, attempt: 0)
uri = URI(url)
proxy = Net::HTTP::Proxy(@options[:host], @options[:port], proxy_user, @password)
http = proxy.new(uri.host, uri.port)
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = @options[:timeout]
http.read_timeout = @options[:timeout]
response = http.request(Net::HTTP::Get.new(uri.request_uri))
if response.is_a?(Net::HTTPSuccess)
{ success: true, status: response.code.to_i, body: response.body }
else
handle_failure(url, proxy_user, attempt, response)
end
rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNREFUSED => e
handle_network_error(url, proxy_user, attempt, e)
ensure
http&.finish if http&.started?
end
def handle_failure(url, proxy_user, attempt, response)
if [429, 502, 503, 504].include?(response.code.to_i) && attempt < @options[:max_retries]
sleep(@options[:retry_delay] * (2 ** attempt))
perform_with_retry(url, proxy_user, attempt: attempt + 1)
else
{ success: false, status: response.code.to_i, error: 'http_error' }
end
end
def handle_network_error(url, proxy_user, attempt, error)
if attempt < @options[:max_retries]
sleep(@options[:retry_delay] * (2 ** attempt))
perform_with_retry(url, proxy_user, attempt: attempt + 1)
else
{ success: false, error: 'network_error', message: error.message }
end
end
def parse_response(response)
{
success: response.success?,
status: response.code,
body: response.body,
time: response.total_time
}
end
end
end
# Verwendung
client = ProxyHat::Client.new(
username: 'your_username',
password: 'your_password',
timeout: 25,
max_retries: 3
)
# Einzelner Request mit US-IP
result = client.get('https://httpbin.org/ip', country: 'US')
puts result[:body]
# Parallele Requests
urls = (1..500).map { |i| "https://api.example.com/items/#{i}" }
results = client.get_parallel(urls, country: 'DE', concurrency: 100)
# Sticky Session für Login-Workflow
client.sticky_session(country: 'US') do |sticky, session_id|
sticky.get('https://example.com/login')
sticky.post('https://example.com/auth', body: { user: 'test', pass: 'secret' })
sticky.get('https://example.com/dashboard')
end
Produktionsreifes Scraping: 1000 URLs parallel
Hier ein vollständiges Beispiel für ein produktives Scraping-Szenario mit Fehlertoleranz, Metriken und Circuit-Breaker-Logik:
require 'typhoeus'
require 'json'
require 'logger'
require 'concurrent'
class ProductionScraper
CIRCUIT_BREAKER_THRESHOLD = 5
CIRCUIT_BREAKER_TIMEOUT = 60
def initialize(username:, password:, country: 'US', concurrency: 100)
@username = username
@password = password
@country = country
@concurrency = concurrency
@logger = Logger.new(STDOUT)
@circuit_breaker = { failures: 0, open: false, opened_at: nil }
@metrics = { total: 0, success: 0, failed: 0, retries: 0 }
@mutex = Mutex.new
end
def scrape(urls)
check_circuit_breaker!
results = Concurrent::Hash.new
hydra = Typhoeus::Hydra.new(max_concurrency: @concurrency)
urls.each do |url|
break if circuit_breaker_open?
session = SecureRandom.hex(12)
proxy_user = "#{@username}-country-#{@country}-session-#{session}"
request = Typhoeus::Request.new(url, {
proxy: 'http://gate.proxyhat.com:8080',
proxyuserpwd: "#{proxy_user}:#{@password}",
timeout: 30,
followlocation: true,
ssl_verifypeer: false, # Für Self-Signed-Certs
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate'
}
})
request.on_complete do |response|
@mutex.synchronize do
@metrics[:total] += 1
if response.success?
@metrics[:success] += 1
results[url] = {
success: true,
status: response.code,
body: response.body,
size: response.body.bytesize
}
reset_circuit_breaker_on_success
elsif response.timed_out?
handle_failure(url, results, 'timeout')
else
handle_failure(url, results, response.return_message, response.code)
end
end
end
hydra.queue(request)
end
hydra.run
{
results: results,
metrics: @metrics,
success_rate: (@metrics[:success].to_f / @metrics[:total] * 100).round(2)
}
end
private
def handle_failure(url, results, error, status = nil)
@metrics[:failed] += 1
record_circuit_breaker_failure
results[url] = {
success: false,
error: error,
status: status
}
@logger.warn("Failed: #{url} - #{error}")
end
def check_circuit_breaker!
if @circuit_breaker[:open]
elapsed = Time.now - @circuit_breaker[:opened_at]
if elapsed > CIRCUIT_BREAKER_TIMEOUT
@circuit_breaker[:open] = false
@circuit_breaker[:failures] = 0
@logger.info("Circuit breaker reset after #{elapsed.round(1)}s")
else
raise "Circuit breaker open - waiting #{(CIRCUIT_BREAKER_TIMEOUT - elapsed).round(1)}s"
end
end
end
def circuit_breaker_open?
@circuit_breaker[:open]
end
def record_circuit_breaker_failure
@circuit_breaker[:failures] += 1
if @circuit_breaker[:failures] >= CIRCUIT_BREAKER_THRESHOLD
@circuit_breaker[:open] = true
@circuit_breaker[:opened_at] = Time.now
@logger.error("Circuit breaker opened after #{@circuit_breaker[:failures]} failures")
end
end
def reset_circuit_breaker_on_success
@circuit_breaker[:failures] = 0
end
def random_user_agent
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'
].sample
end
end
# Ausführung
scraper = ProductionScraper.new(
username: 'your_username',
password: 'your_password',
country: 'US',
concurrency: 100
)
urls = (1..1000).map { |i| "https://httpbin.org/delay/#{rand(1..2)}?id=#{i}" }
begin
start = Time.now
report = scraper.scrape(urls)
duration = Time.now - start
puts "\n=== Scraping Report ==="
puts "Dauer: #{duration.round(2)}s"
puts "Durchsatz: #{(report[:metrics][:total] / duration).round(2)} req/s"
puts "Erfolgsrate: #{report[:success_rate]}%"
puts "Erfolgreich: #{report[:metrics][:success]}/#{report[:metrics][:total]}"
rescue => e
puts "Scraping abgebrochen: #{e.message}"
end
TLS/SSL-Konfiguration: Self-Signed Certs und SNI
Bei HTTPS-Proxies gibt es einige Fallstricke. Besonders bei Self-Signed-Zertifikaten oder wenn Server Name Indication (SNI) korrekt gesetzt werden muss.
require 'net/http'
require 'openssl'
class TLSProxyClient
def initialize(proxy_host:, proxy_port:, proxy_user:, proxy_pass:)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
end
# Option 1: Strikte TLS-Verifizierung (Produktion)
def fetch_strict(url)
uri = URI(url)
proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, @proxy_user, @proxy_pass)
http = proxy.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = trusted_cert_store
# SNI explizit setzen
http.enable_post_connection_check = true
request = Net::HTTP::Get.new(uri.request_uri)
http.request(request).body
ensure
http&.finish
end
# Option 2: Self-Signed-Certs akzeptieren (nur Dev/Staging!)
def fetch_insecure(url)
uri = URI(url)
proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, @proxy_user, @proxy_pass)
http = proxy.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
http.ssl_version = :TLSv1_2
request = Net::HTTP::Get.new(uri.request_uri)
http.request(request).body
ensure
http&.finish
end
# Option 3: Mit Custom CA-Bundle
def fetch_with_ca_bundle(url, ca_bundle_path)
uri = URI(url)
proxy = Net::HTTP::Proxy(@proxy_host, @proxy_port, @proxy_user, @proxy_pass)
http = proxy.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.ca_file = ca_bundle_path
http.verify_depth = 5
request = Net::HTTP::Get.new(uri.request_uri)
http.request(request).body
ensure
http&.finish
end
# Typhoeus mit TLS-Optionen
def fetch_typhoeus_tls(url, verify: true)
require 'typhoeus'
Typhoeus.get(url, {
proxy: "http://#{@proxy_host}:#{@proxy_port}",
proxyuserpwd: "#{@proxy_user}:#{@proxy_pass}",
ssl_verifypeer: verify,
ssl_verifyhost: verify ? 2 : 0,
sslversion: :tlsv1_2,
capath: '/etc/ssl/certs',
timeout: 30
})
end
private
def trusted_cert_store
store = OpenSSL::X509::Store.new
store.set_default_paths
# Zusätzliche CA-Zertifikate
store.add_file('/etc/ssl/certs/ca-certificates.crt') if File.exist?('/etc/ssl/certs/ca-certificates.crt')
store.add_file('/etc/ssl/cert.pem') if File.exist?('/etc/ssl/cert.pem')
store
end
end
# Verwendung
client = TLSProxyClient.new(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'user-country-US',
proxy_pass: 'your_password'
)
# Strikte Verifizierung
body = client.fetch_strict('https://example.com')
# Für Self-Signed-Entwicklungsserver
body = client.fetch_insecure('https://internal-dev.local/api')
Rails-Integration: Faraday Middleware und ActiveJob
In Rails-Anwendungen ist Faraday der De-facto-Standard für HTTP-Clients. Hier eine vollständige Integration mit Proxy-Middleware und Background-Jobs.
# config/initializers/proxy.rb
require 'faraday'
module ProxyHat
class FaradayMiddleware < Faraday::Middleware
def initialize(app, username:, password:, country: nil)
super(app)
@username = username
@password = password
@country = country
end
def call(env)
session = SecureRandom.hex(8)
proxy_user = build_username(session)
env[:proxy] = {
uri: URI('http://gate.proxyhat.com:8080'),
user: proxy_user,
password: @password
}
@app.call(env)
end
private
def build_username(session)
parts = [@username]
parts << "country-#{@country}" if @country
parts << "session-#{session}"
parts.join('-')
end
end
end
# Faraday-Connection mit Proxy-Middleware
class ApiClient
def initialize(country: nil)
@country = country
end
def connection
@connection ||= Faraday.new do |builder|
builder.use ProxyHat::FaradayMiddleware,
username: Rails.application.config.proxy_username,
password: Rails.application.config.proxy_password,
country: @country
builder.request :retry, {
max: 3,
interval: 1.0,
backoff_factor: 2,
retry_statuses: [429, 502, 503, 504]
}
builder.response :json, content_type: /json$/
builder.response :raise_error
builder.adapter :typhoeus do |adapter|
adapter.options = {
timeout: 30,
followlocation: true,
ssl_verifypeer: true
}
end
end
end
def get(path)
connection.get(path)
end
def post(path, body)
connection.post(path, body.to_json, 'Content-Type' => 'application/json')
end
end
# ActiveJob für Background-Scraping
class ScrapeJob < ApplicationJob
queue_as :scraping
retry_on Net::OpenTimeout, wait: :exponentially_longer, attempts: 3
retry_on Net::ReadTimeout, wait: :exponentially_longer, attempts: 3
def perform(urls, country: 'US')
client = ApiClient.new(country: country)
results = urls.map do |url|
begin
response = client.get(url)
{ url: url, success: true, data: response.body }
rescue Faraday::Error => e
{ url: url, success: false, error: e.message }
end
end
# Ergebnisse speichern
ScrapeResult.import(results.select { |r| r[:success] })
# Benachrichtigung bei Fehlern
if results.any? { |r| !r[:success] }
ScrapeFailureMailer.with(failures: results.reject { |r| r[:success] }).alert.deliver_later
end
end
end
# Batch-Verarbeitung mit Jobs
class BatchScrapeJob < ApplicationJob
queue_as :scraping
def perform(url_batch, country: 'US')
# URLs auf kleinere Batches aufteilen
url_batch.each_slice(50) do |slice|
ScrapeJob.perform_later(slice, country: country)
end
end
end
# Controller-Beispiel
class ScraperController < ApplicationController
def start_scrape
urls = params[:urls]
country = params[:country] || 'US'
# In Batches aufteilen
urls.each_slice(100) do |batch|
BatchScrapeJob.perform_later(batch, country: country)
end
render json: { status: 'queued', batches: (urls.size / 100.0).ceil }
end
end
# config/application.rb
module MyApp
class Application < Rails::Application
config.proxy_username = ENV['PROXYHAT_USERNAME']
config.proxy_password = ENV['PROXYHAT_PASSWORD']
end
end
Vergleich der Proxy-Ansätze in Ruby
| Ansatz | Vorteile | Nachteile | Use Case |
|---|---|---|---|
| Net::HTTP | Stdlib, keine Dependencies, einfach | Keine parallelen Requests, limitierte Features | Einfache API-Calls, Prototyping |
| Typhoeus | Parallele Requests via Hydra, libcurl-Features | Zusätzliche Gem, libcurl-Abhängigkeit | High-Volume Scraping, parallele Datenabfrage |
| ProxyHat SDK | Integrierte Rotation, Geo-Targeting, Retry-Logik | Herstellerspezifisch | Produktive Scraping-Pipelines mit IP-Rotation |
| Faraday | Middleware-System, Rails-Integration | Overhead, Konfiguration nötig | Rails-Anwendungen, API-Wrapper |
Key Takeaways
1. Net::HTTP reicht für einfache Anforderungen – Die Standardbibliothek deckt grundlegende Proxy-Anforderungen ab, inklusive Authentifizierung. Für produktive Anwendungen ist jedoch eine Retry-Logik essenziell.
2. Typhoeus für parallele Anfragen – Bei hunderten gleichzeitigen Requests ist Typhoeus mit Hydra unverzichtbar. Die libcurl-Basis bietet zudem bessere TLS-Optionen.
3. IP-Rotation über Username-Flags – ProxyHat ermöglicht Rotation und Geo-Targeting durch spezielle Username-Formate wie
user-country-US-session-abc123.
4. Circuit Breaker schützen vor Kaskadenausfällen – Bei produktiven Scrapern ist ein Circuit Breaker Pflicht, um bei gehäuften Fehlern die Proxies zu schonen.
5. Rails-Integration via Faraday – Faraday-Middleware ermöglicht saubere Proxy-Integration in Rails, ActiveJob eignet sich für Background-Scraping.
Weiterführende Ressourcen
- Residential vs. Datacenter Proxies: Der vollständige Vergleich
- Web Scraping Use Cases mit ProxyHat
- ProxyHat Pricing und Pakete
- Verfügbare Proxy-Standorte
Für produktive Scraping-Pipelines mit Ruby sind rotierende Residential Proxies von ProxyHat die zuverlässigste Wahl. Die Kombination aus Typhoeus für parallele Anfragen und dem ProxyHat SDK für Rotation bietet die beste Balance aus Performance und Stabilität.






