Dla programistów Ruby budujących systemy scrapingu i potoki danych, proxy HTTP to niezbędne narzędzie. Bez względu na to, czy pobierasz wyniki SERP, monitorujesz ceny e-commerce, czy trenujesz modele AI na danych z sieci — musisz znać sposoby konfiguracji proxy na każdym poziomie stosu.
W tym przewodniku pokazuję kod produkcyjny, nie teorię. Zaczynamy od Net::HTTP ze standardowej biblioteki, przechodzimy do Typhoeus z równoległymi żądaniami przez Hydra, i kończymy na ProxyHat SDK z automatyczną rotacją IP i geo-targetingiem.
Net::HTTP z proxy: podstawy i autoryzacja
Net::HTTP to standardowa biblioteka Ruby. Jest dostępna bez dodatkowych gemów, ale wymaga ręcznej konfiguracji proxy. Oto kompletny przykład z obsługą błędów:
require 'net/http'
require 'uri'
class ProxyHTTPClient
PROXY_HOST = 'gate.proxyhat.com'
PROXY_PORT = 8080
PROXY_USER = 'your_username'
PROXY_PASS = 'your_password'
def initialize(proxy_user: nil, proxy_pass: nil)
@proxy_user = proxy_user || PROXY_USER
@proxy_pass = proxy_pass || PROXY_PASS
end
def get(url, timeout: 30)
uri = URI.parse(url)
http = Net::HTTP.new(
uri.host,
uri.port,
PROXY_HOST,
PROXY_PORT,
@proxy_user,
@proxy_pass
)
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = timeout
http.read_timeout = timeout
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
request = Net::HTTP::Get.new(uri.request_uri)
request['User-Agent'] = 'ProxyHat-Ruby-Client/1.0'
request['Accept'] = 'text/html,application/xhtml+xml'
response = http.request(request)
case response
when Net::HTTPSuccess
{ status: response.code.to_i, body: response.body, headers: response.each_header.to_h }
when Net::HTTPRedirection
{ status: response.code.to_i, location: response['Location'], body: nil }
else
{ status: response.code.to_i, error: response.message }
end
rescue Net::OpenTimeout => e
{ status: 0, error: "Connection timeout: #{e.message}" }
rescue Net::ReadTimeout => e
{ status: 0, error: "Read timeout: #{e.message}" }
rescue SocketError => e
{ status: 0, error: "DNS/Socket error: #{e.message}" }
rescue OpenSSL::SSL::SSLError => e
{ status: 0, error: "SSL error: #{e.message}" }
rescue StandardError => e
{ status: 0, error: "Unexpected error: #{e.message}" }
ensure
http&.finish if http&.started?
end
end
# Użycie
client = ProxyHTTPClient.new(proxy_user: 'user-country-US', proxy_pass: 'pass')
result = client.get('https://httpbin.org/ip')
puts result.inspect
Parametry geo-targeting ProxyHat przekazujesz w nazwie użytkownika:
# Proxy dla USA
client_us = ProxyHTTPClient.new(proxy_user: 'user-country-US', proxy_pass: 'pass')
# Proxy dla Berlina, Niemcy
client_de = ProxyHTTPClient.new(
proxy_user: 'user-country-DE-city-berlin',
proxy_pass: 'pass'
)
# Sticky session (to samo IP przez 10 minut)
client_sticky = ProxyHTTPClient.new(
proxy_user: 'user-session-abc123-duration-10',
proxy_pass: 'pass'
)
Rotacja IP przy każdym żądaniu
Domyślnie ProxyHat rotuje IP przy każdym nowym żądaniu. Jeśli potrzebujesz sticky session, dodaj flagę session do username:
class RotatingProxyClient
def initialize(base_user:, password:)
@base_user = base_user
@password = password
end
def fetch_with_rotation(urls)
urls.map do |url|
session_id = SecureRandom.hex(8)
proxy_user = "#{@base_user}-session-#{session_id}"
client = ProxyHTTPClient.new(proxy_user: proxy_user, proxy_pass: @password)
client.get(url)
end
end
end
# Każde żądanie = nowe IP
rotator = RotatingProxyClient.new(base_user: 'user-country-US', password: 'pass')
results = rotator.fetch_with_rotation([
'https://httpbin.org/ip',
'https://httpbin.org/user-agent',
'https://httpbin.org/headers'
])
Typhoeus: równoległe żądania z libcurl
Typhoeus to wrapper wokół libcurl z interfejsem idiomatycznym dla Ruby. Jego największą zaletą jest Hydra — mechanizm do równoległego wykonywania żądań HTTP.
Dodaj do Gemfile:
gem 'typhoeus'</n
Podstawowe żądanie z proxy:
require 'typhoeus'
class TyphoeusProxyClient
PROXY_URL = 'http://user-country-US:pass@gate.proxyhat.com:8080'
def initialize(proxy_url: nil)
@proxy_url = proxy_url || PROXY_URL
end
def get(url, follow_location: true, timeout: 30)
response = Typhoeus.get(
url,
proxy: @proxy_url,
followlocation: follow_location,
timeout: timeout,
connecttimeout: 10,
ssl_verifypeer: true,
ssl_verifyhost: 2,
headers: {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
)
if response.success?
{ status: response.code, body: response.body, headers: response.headers }
elsif response.timed_out?
{ status: 0, error: 'Request timed out' }
elsif response.code == 0
{ status: 0, error: response.return_message }
else
{ status: response.code, error: "HTTP error: #{response.code}" }
end
end
end
client = TyphoeusProxyClient.new(
proxy_url: 'http://user-country-DE:pass@gate.proxyhat.com:8080'
)
result = client.get('https://httpbin.org/ip')
Hydra: równoległe żądania
Hydra pozwala uruchomić setki żądań jednocześnie. To kluczowe przy scrapingu na dużą skalę:
require 'typhoeus'
class ParallelScraper
PROXY_BASE = 'http://user-country-US:pass@gate.proxyhat.com:8080'
MAX_CONCURRENT = 50
def initialize(urls, proxy_base: nil, max_concurrent: nil)
@urls = urls
@proxy_base = proxy_base || PROXY_BASE
@max_concurrent = max_concurrent || MAX_CONCURRENT
end
def scrape_all
results = Concurrent::Hash.new
hydra = Typhoeus::Hydra.new(max_concurrency: @max_concurrent)
@urls.each_with_index do |url, idx|
request = build_request(url, idx)
request.on_complete do |response|
results[url] = process_response(response)
end
hydra.queue(request)
end
hydra.run
results
end
private
def build_request(url, idx)
session_id = "sess-#{idx}-#{SecureRandom.hex(4)}"
proxy_url = @proxy_base.sub('user-', "user-session-#{session_id}-")
Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
timeout: 30,
connecttimeout: 10,
followlocation: true,
ssl_verifypeer: true,
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate, br'
}
)
end
def process_response(response)
if response.success?
{ status: response.code, body: response.body, success: true }
elsif response.timed_out?
{ status: 0, error: 'timeout', success: false }
else
{ status: response.code, error: response.return_message, success: false }
end
end
def random_user_agent
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'
].sample
end
end
# Przykład: 100 URL-i równolegle
urls = (1..100).map { |i| "https://httpbin.org/delay/#{rand(1..3)}?id=#{i}" }
scraper = ParallelScraper.new(urls)
results = scraper.scrape_all
successful = results.count { |_, r| r[:success] }
puts "Pobrano #{successful}/#{urls.size} URL-i"
ProxyHat Ruby SDK: rotacja i geo-targeting
ProxyHat udostępnia SDK upraszczające konfigurację. Oto wrapper produkcyjny z retry logic i circuit breaker:
require 'net/http'
require 'uri'
require 'concurrent'
module ProxyHat
class Client
GATEWAY_HOST = 'gate.proxyhat.com'
HTTP_PORT = 8080
SOCKS5_PORT = 1080
DEFAULT_OPTIONS = {
timeout: 30,
max_retries: 3,
retry_delay: 1.0,
country: nil,
city: nil,
session: nil,
session_duration: nil
}.freeze
attr_reader :username, :password, :options
def initialize(username:, password:, **options)
@username = username
@password = password
@options = DEFAULT_OPTIONS.merge(options)
@circuit_breaker = CircuitBreaker.new(
failure_threshold: 5,
recovery_timeout: 60
)
end
def get(url, **override_options)
opts = @options.merge(override_options)
execute_with_retry(url, opts)
end
def post(url, body:, content_type: 'application/json', **override_options)
opts = @options.merge(override_options)
execute_with_retry(url, opts, method: :post, body: body, content_type: content_type)
end
def build_proxy_username(**opts)
parts = [username]
if opts[:country]
parts << "country-#{opts[:country]}"
parts << "city-#{opts[:city]}" if opts[:city]
end
if opts[:session]
parts << "session-#{opts[:session]}"
parts << "duration-#{opts[:session_duration]}" if opts[:session_duration]
end
parts.join('-')
end
private
def execute_with_retry(url, opts, method: :get, body: nil, content_type: nil)
retries = 0
last_error = nil
loop do
return @circuit_breaker.execute do
make_request(url, opts, method, body, content_type)
end
rescue CircuitBreaker::OpenCircuitError => e
raise e
rescue StandardError => e
last_error = e
retries += 1
if retries <= opts[:max_retries]
sleep(opts[:retry_delay] * retries)
else
raise ProxyError.new("Max retries exceeded: #{e.message}")
end
end
end
def make_request(url, opts, method, body, content_type)
uri = URI.parse(url)
proxy_user = build_proxy_username(**opts)
http = Net::HTTP.new(
uri.host,
uri.port,
GATEWAY_HOST,
HTTP_PORT,
proxy_user,
password
)
configure_ssl(http, uri)
http.open_timeout = opts[:timeout]
http.read_timeout = opts[:timeout]
request = build_request(uri, method, body, content_type)
response = http.request(request)
handle_response(response)
ensure
http&.finish if http&.started?
end
def configure_ssl(http, uri)
if uri.scheme == 'https'
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.min_version = OpenSSL::SSL::TLS1_2_VERSION
end
end
def build_request(uri, method, body, content_type)
request_class = method == :post ? Net::HTTP::Post : Net::HTTP::Get
request = request_class.new(uri.request_uri)
request['User-Agent'] = 'ProxyHat-Ruby-SDK/2.0'
request['Accept'] = '*/*'
if body
request['Content-Type'] = content_type
request.body = content_type == 'application/json' ? JSON.generate(body) : body
end
request
end
def handle_response(response)
case response
when Net::HTTPSuccess
Response.new(
status: response.code.to_i,
body: response.body,
headers: response.each_header.to_h
)
when Net::HTTPRedirection
Response.new(
status: response.code.to_i,
body: nil,
headers: response.each_header.to_h,
redirect_to: response['Location']
)
else
raise HTTPError.new(response.code.to_i, response.message)
end
end
end
class Response
attr_reader :status, :body, :headers, :redirect_to
def initialize(status:, body:, headers:, redirect_to: nil)
@status = status
@body = body
@headers = headers
@redirect_to = redirect_to
end
def json
JSON.parse(@body)
rescue JSON::ParserError
nil
end
def success?
(200..299).cover?(@status)
end
end
class ProxyError < StandardError; end
class HTTPError < StandardError
attr_reader :status_code
def initialize(status_code, message)
@status_code = status_code
super("HTTP #{status_code}: #{message}")
end
end
class CircuitBreaker
OpenCircuitError = Class.new(StandardError)
def initialize(failure_threshold:, recovery_timeout:)
@failure_threshold = failure_threshold
@recovery_timeout = recovery_timeout
@failures = 0
@last_failure_time = nil
@mutex = Mutex.new
end
def execute
if open?
raise OpenCircuitError.new('Circuit breaker is open')
end
result = yield
reset
result
rescue StandardError => e
record_failure
raise e
end
private
def open?
@mutex.synchronize do
return false if @failures < @failure_threshold
return false if @last_failure_time.nil?
elapsed = Time.now - @last_failure_time
if elapsed > @recovery_timeout
@failures = 0
return false
end
true
end
end
def record_failure
@mutex.synchronize do
@failures += 1
@last_failure_time = Time.now
end
end
def reset
@mutex.synchronize do
@failures = 0
@last_failure_time = nil
end
end
end
end
# Użycie
client = ProxyHat::Client.new(
username: 'user',
password: 'pass',
country: 'US',
max_retries: 3
)
# GET z automatyczną rotacją IP
response = client.get('https://httpbin.org/ip')
puts response.json
# Sticky session dla wielokrotnych żądań
session_client = ProxyHat::Client.new(
username: 'user',
password: 'pass',
country: 'DE',
city: 'berlin',
session: 'my-session-123',
session_duration: 10
)
response1 = session_client.get('https://example.com/page1')
response2 = session_client.get('https://example.com/page2')
Scraper produkcyjny: 1000 URL-i równolegle
Oto kompletny przykład produkcyjnego scrapera z równoległym pobieraniem, rotacją IP i obsługą błędów:
require 'typhoeus'
require 'concurrent'
require 'json'
class ProductionScraper
attr_reader :stats
def initialize(username:, password:, concurrency: 100)
@username = username
@password = password
@concurrency = concurrency
@stats = Concurrent::Hash.new(0)
@mutex = Mutex.new
end
def scrape(urls, country: 'US', output_file: nil)
results = Concurrent::Hash.new
hydra = Typhoeus::Hydra.new(max_concurrency: @concurrency)
urls.each_with_index do |url, idx|
request = build_request(url, idx, country)
request.on_complete do |response|
handle_response(response, url, results)
end
hydra.queue(request)
end
start_time = Time.now
hydra.run
elapsed = Time.now - start_time
print_stats(elapsed, urls.size)
if output_file
save_results(results, output_file)
end
results
end
private
def build_request(url, idx, country)
session_id = "scraper-#{idx}-#{SecureRandom.uuid}"
proxy_url = "http://#{@username}-country-#{country}-session-#{session_id}:#{@password}@gate.proxyhat.com:8080"
Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
timeout: 45,
connecttimeout: 15,
followlocation: true,
maxredirs: 5,
ssl_verifypeer: false, # Dla scrapingu czasem trzeba wyłączyć
ssl_verifyhost: 0,
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate',
'Cache-Control' => 'no-cache',
'Pragma' => 'no-cache'
}
)
end
def handle_response(response, url, results)
@mutex.synchronize do
if response.success?
results[url] = {
status: response.code,
body: response.body,
size: response.body.bytesize,
time: response.total_time
}
@stats[:success] += 1
elsif response.timed_out?
results[url] = { status: 0, error: 'timeout' }
@stats[:timeout] += 1
elsif response.code == 403 || response.code == 429
results[url] = { status: response.code, error: 'blocked/rate-limited' }
@stats[:blocked] += 1
else
results[url] = { status: response.code, error: response.return_message }
@stats[:failed] += 1
end
end
end
def print_stats(elapsed, total)
puts "\n=== Scraping Results ==="
puts "Total URLs: #{total}"
puts "Success: #{@stats[:success]}"
puts "Timeout: #{@stats[:timeout]}"
puts "Blocked: #{@stats[:blocked]}"
puts "Failed: #{@stats[:failed]}"
puts "Time: #{elapsed.round(2)}s"
puts "Rate: #{(@stats[:success] / elapsed).round(2)} req/s"
puts "========================\n"
end
def save_results(results, filename)
File.write(filename, JSON.pretty_generate(results.to_h))
puts "Results saved to #{filename}"
end
def random_user_agent
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1'
].sample
end
end
# Uruchomienie
scraper = ProductionScraper.new(
username: 'your_username',
password: 'your_password',
concurrency: 100
)
# Generuj 1000 URL-i
urls = (1..1000).map do |i|
"https://httpbin.org/delay/#{rand(1..2)}?id=#{i}"
end
results = scraper.scrape(urls, country: 'US', output_file: 'results.json')
TLS/SSL: certyfikaty, SNI i self-signed upstream
Przy scrapingu możesz napotkać serwery z certyfikatami self-signed lub błędami SSL. Oto jak je obsłużyć:
require 'net/http'
require 'openssl'
class SSLAwareProxyClient
def initialize(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'user-country-US',
proxy_pass: 'pass',
verify_ssl: true,
ca_file: nil
)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
@verify_ssl = verify_ssl
@ca_file = ca_file
end
def get(url, timeout: 30)
uri = URI.parse(url)
http = Net::HTTP.new(
uri.host,
uri.port,
@proxy_host,
@proxy_port,
@proxy_user,
@proxy_pass
)
if uri.scheme == 'https'
configure_ssl(http)
end
http.open_timeout = timeout
http.read_timeout = timeout
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
{ status: response.code.to_i, body: response.body }
rescue OpenSSL::SSL::SSLError => e
handle_ssl_error(e)
ensure
http&.finish if http&.started?
end
private
def configure_ssl(http)
http.use_ssl = true
if @verify_ssl
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.min_version = OpenSSL::SSL::TLS1_2_VERSION
# Własne CA (np. dla corporate proxy)
http.ca_file = @ca_file if @ca_file
# SNI (Server Name Indication) - wymagane dla wielu CDN
http.enable_post_connection_check = true
else
# TYLKO dla developmentu lub zaufanych wewnętrznych serwerów
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
puts "WARNING: SSL verification disabled!"
end
end
def handle_ssl_error(error)
case error.message
when /certificate verify failed/
{ status: 0, error: 'SSL certificate verification failed', ssl_error: true }
when /hostname does not match/
{ status: 0, error: 'SSL hostname mismatch', ssl_error: true }
when /connection reset/
{ status: 0, error: 'SSL connection reset', ssl_error: true }
else
{ status: 0, error: "SSL error: #{error.message}", ssl_error: true }
end
end
end
# Tryb bezpieczny (produkcyjny)
secure_client = SSLAwareProxyClient.new(verify_ssl: true)
# Tryb permisywny (development/internal)
permissive_client = SSLAwareProxyClient.new(
proxy_user: 'user-country-DE',
verify_ssl: false
)
# Z własnym plikiem CA
corporate_client = SSLAwareProxyClient.new(
verify_ssl: true,
ca_file: '/etc/ssl/certs/corporate-ca.pem'
)
Typhoeus z konfiguracją SSL
require 'typhoeus'
class TyphoeusSSLClient
def get_with_ssl(url, verify: true, ca_path: nil)
options = {
proxy: 'http://user-country-US:pass@gate.proxyhat.com:8080',
timeout: 30,
ssl_verifypeer: verify,
ssl_verifyhost: verify ? 2 : 0
}
if ca_path
Options[:ssl_capath] = ca_path
end
# SNI jest włączone domyślnie w libcurl
# Możesz wyłączyć: sslversion: :tlsv1_2
Typhoeus.get(url, **options)
end
end
Integracja z Ruby on Rails
Faraday middleware z proxy
Faraday to popularny klient HTTP w ekosystemie Rails. Oto middleware integrujący ProxyHat:
# config/initializers/proxyhat.rb
require 'faraday'
module ProxyHat
class FaradayMiddleware < Faraday::Middleware
PROXY_HOST = 'gate.proxyhat.com'
PROXY_PORT = 8080
def initialize(app, username:, password:, country: nil, city: nil)
super(app)
@username = username
@password = password
@country = country
@city = city
end
def call(env)
proxy_user = build_proxy_user
env[:proxy] = {
uri: "http://#{PROXY_HOST}:#{PROXY_PORT}",
user: proxy_user,
password: @password
}
@app.call(env)
end
private
def build_proxy_user
parts = [@username]
parts << "country-#{@country}" if @country
parts << "city-#{@city}" if @city
parts.join('-')
end
end
end
# Konfiguracja Faraday
module ApiClients
class Base
def self.connection(country: nil)
Faraday.new do |builder|
builder.use ProxyHat::FaradayMiddleware,
username: Rails.application.credentials.proxyhat[:username],
password: Rails.application.credentials.proxyhat[:password],
country: country
builder.request :retry,
max: 3,
interval: 1.0,
backoff_factor: 2,
exceptions: [Faraday::TimeoutError, Faraday::ConnectionFailed]
builder.response :json, content_type: /json\b/
builder.response :raise_error
builder.adapter :typhoeus
end
end
end
class ScraperClient < Base
def self.fetch(url, country: 'US')
connection(country: country).get(url).body
rescue Faraday::Error => e
Rails.logger.error("Scraping failed: #{e.message}")
nil
end
end
end
ActiveJob: scraping w tle
# app/jobs/scraping_job.rb
class ScrapingJob < ApplicationJob
queue_as :scraping
retry_on ScrapingError, wait: :exponentially_longer, attempts: 3
discard_on ActiveJob::DeserializationError
def perform(urls, options = {})
country = options.fetch(:country, 'US')
output_file = options[:output_file]
scraper = ProductionScraper.new(
username: Rails.application.credentials.proxyhat[:username],
password: Rails.application.credentials.proxyhat[:password],
concurrency: options.fetch(:concurrency, 50)
)
results = scraper.scrape(urls, country: country)
if output_file
save_results(results, output_file)
end
# Powiadomienie o zakończeniu
ScrapingCompletionNotifier.call(results: results, job_id: job_id)
results
end
private
def save_results(results, filename)
path = Rails.root.join('storage', 'scraping', filename)
FileUtils.mkdir_p(File.dirname(path))
File.write(path, JSON.pretty_generate(results.to_h))
end
end
# app/jobs/batch_scraping_job.rb
class BatchScrapingJob < ApplicationJob
queue_as :scraping_batch
def perform(url_list_id, batch_size: 100)
url_list = UrlList.find(url_list_id)
urls = url_list.urls
urls.each_slice(batch_size).with_index do |batch, idx|
ScrapingJob.perform_later(
batch,
country: url_list.country,
output_file: "batch_#{idx}_#{url_list.id}.json"
)
end
end
end
# Użycie w kontrolerze
class ScrapingController < ApplicationController
def create
urls = params[:urls].split('\n').map(&:strip).compact_blank
BatchScrapingJob.perform_later(create_url_list(urls).id)
redirect_to scraping_status_path, notice: 'Scraping rozpoczęty'
end
private
def create_url_list(urls)
UrlList.create!(
urls: urls,
country: params[:country] || 'US',
user: current_user
)
end
end
Sidekiq integration
Dla Sidekiq dodaj middleware do zarządzania połączeniami proxy:
# config/initializers/sidekiq.rb
Sidekiq.configure_client do |config|
config.redis = { url: ENV['REDIS_URL'] }
end
Sidekiq.configure_server do |config|
config.redis = { url: ENV['REDIS_URL'] }
# Middleware do logowania i metryk
config.server_middleware do |chain|
chain.add ScrapingMetricsMiddleware
end
end
# app/workers/scraping_worker.rb
class ScrapingWorker
include Sidekiq::Worker
sidekiq_options queue: 'scraping', retry: 3, backtrace: true
def perform(urls, country = 'US')
@stats = { success: 0, failed: 0 }
urls.each_slice(50) do |batch|
process_batch(batch, country)
end
logger.info "Completed: #{@stats.inspect}"
end
private
def process_batch(urls, country)
hydra = Typhoeus::Hydra.new(max_concurrency: 25)
urls.each do |url|
request = build_request(url, country)
request.on_complete do |response|
if response.success?
process_success(url, response)
@stats[:success] += 1
else
@stats[:failed] += 1
logger.warn "Failed: #{url} - #{response.code}"
end
end
hydra.queue(request)
end
hydra.run
end
def build_request(url, country)
session = SecureRandom.hex(8)
proxy = "http://user-country-#{country}-session-#{session}:pass@gate.proxyhat.com:8080"
Typhoeus::Request.new(
url,
proxy: proxy,
timeout: 30,
followlocation: true,
headers: { 'User-Agent' => random_user_agent }
)
end
def process_success(url, response)
# Zapisz do bazy lub cache
ScrapedPage.create!(
url: url,
content: response.body,
status: response.code
)
rescue ActiveRecord::RecordNotUnique
# Ignoruj duplikaty
end
def random_user_agent
UserAgents.sample
end
end
Porównanie metod proxy w Ruby
| Metoda | Zalety | Wady | Przypadek użycia |
|---|---|---|---|
| Net::HTTP | Stdlib, brak zależności, pełna kontrola | Synchroniczny, brak connection pooling | Proste skrypty, pojedyncze żądania |
| Typhoeus | Równoległość (Hydra), libcurl, wydajny | Wymaga gem, natywna kompilacja | Scraping na dużą skalę, wysoka konkurencja |
| ProxyHat SDK | Rotacja IP, geo-targeting, retry logic | Zależność od zewnętrznego serwisu | Produkcyjny scraping z anti-bot bypass |
| Faraday | Middleware, elastyczny, Rails-friendly | Narzut, wymaga adaptera | API clients, Rails aplikacje |
Kluczowe wnioski
- Net::HTTP wystarcza dla prostych zadań, ale brakuje mu równoległości — używaj Typhoeus dla scrapingu na skalę.
- Hydra w Typhoeus pozwala na setki równoległych żądań — kluczowe przy pobieraniu tysięcy URL-i.
- Rotacja IP w ProxyHat przekazywana jest przez username — każde żądanie może mieć inny kraj/miasto.
- Circuit breaker i retry logic to konieczność przy produkcyjnym scrapingu — nie polegaj na pojedynczych próbach.
- SSL verification wyłączaj tylko dla developmentu — w produkcji zawsze weryfikuj certyfikaty.
- ActiveJob/Sidekiq to naturalne miejsce dla zadań scrapingu w aplikacjach Rails.
Szukasz gotowego rozwiązania proxy residential dla Ruby? Sprawdź ceny ProxyHat lub przejrzyj dostępne lokalizacje.






