Introduction : Pourquoi les Proxies sont Indispensables en Ruby
Si vous construisez des pipelines de données, des scrapers ou des outils d'automatisation en Ruby, vous avez forcément rencontré des limitations : rate limits, blocages d'IP, restrictions géographiques. Les proxies Ruby ne sont pas qu'un luxe — c'est une nécessité pour tout système de collecte de données sérieux.
Ce guide couvre trois approches complémentaires : la bibliothèque standard Net::HTTP, le client haute performance Typhoeus avec son système Hydra pour les requêtes parallèles, et le ProxyHat SDK pour la rotation automatique et le geo-targeting. Vous verrez des exemples concrets de scraping de 1000 URLs, la gestion TLS/SSL, et l'intégration avec Rails.
Net::HTTP : Les Fondamentaux avec Proxy
Net::HTTP fait partie de la bibliothèque standard Ruby. C'est l'option la plus simple pour ajouter un proxy à vos requêtes, mais elle nécessite une configuration manuelle.
Configuration Basique d'un Proxy
Voici comment configurer un proxy avec authentification pour une requête HTTP simple :
require 'net/http'
require 'uri'
# Configuration du proxy ProxyHat
PROXY_HOST = 'gate.proxyhat.com'
PROXY_PORT = 8080
PROXY_USER = 'your_username'
PROXY_PASS = 'your_password'
# URL cible
target_url = 'https://httpbin.org/ip'
uri = URI(target_url)
# Création de la connexion via proxy
http = Net::HTTP.new(uri.host, uri.port, PROXY_HOST, PROXY_PORT)
http.use_ssl = (uri.scheme == 'https')
# Configuration de l'authentification proxy
# Net::HTTP utilise un header Proxy-Authorization encodé en Base64
request = Net::HTTP::Get.new(uri.request_uri)
request['Proxy-Authorization'] = 'Basic ' + Base64.strict_encode64("#{PROXY_USER}:#{PROXY_PASS}")
response = http.request(request)
puts "Status: #{response.code}"
puts "Body: #{response.body}"
Gestion des Erreurs et Timeouts
En production, les erreurs réseau sont inévitables. Voici un pattern robuste avec retry et gestion d'exceptions :
require 'net/http'
require 'uri'
class ProxyRequest
MAX_RETRIES = 3
TIMEOUT_SECONDS = 30
def initialize(proxy_host:, proxy_port:, proxy_user:, proxy_pass:)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
end
def get(url, headers: {})
uri = URI(url)
retries = 0
loop do
begin
http = Net::HTTP.new(uri.host, uri.port, @proxy_host, @proxy_port)
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = TIMEOUT_SECONDS
http.read_timeout = TIMEOUT_SECONDS
http.write_timeout = TIMEOUT_SECONDS
request = Net::HTTP::Get.new(uri.request_uri)
request['Proxy-Authorization'] = 'Basic ' + Base64.strict_encode64("#{@proxy_user}:#{@proxy_pass}")
headers.each { |k, v| request[k] = v }
response = http.request(request)
case response
when Net::HTTPSuccess
return { success: true, status: response.code.to_i, body: response.body }
when Net::HTTPRedirection
return { success: true, status: response.code.to_i, body: response.body, redirect: response['location'] }
else
return { success: false, status: response.code.to_i, error: "HTTP #{response.code}: #{response.message}" }
end
rescue Net::OpenTimeout, Net::ReadTimeout => e
retries += 1
raise e if retries >= MAX_RETRIES
sleep(2 ** retries) # Exponential backoff
retry
rescue SocketError, Errno::ECONNREFUSED, Errno::ECONNRESET => e
retries += 1
raise e if retries >= MAX_RETRIES
sleep(1)
retry
rescue Net::HTTPBadResponse => e
return { success: false, error: "Invalid response: #{e.message}" }
end
end
end
end
# Utilisation
client = ProxyRequest.new(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'user-country-US',
proxy_pass: 'your_password'
)
result = client.get('https://httpbin.org/ip')
puts result.inspect
Typhoeus : Requêtes Parallèles avec Hydra
Typhoeus est un client HTTP basé sur libcurl qui excelle dans les requêtes parallèles via son système Hydra. C'est idéal pour scraper des milliers d'URLs simultanément.
Installation et Configuration
gem install typhoeus
Requête Simple avec Proxy
require 'typhoeus'
# Configuration du proxy
PROXY_URL = 'http://user-country-FR:password@gate.proxyhat.com:8080'
response = Typhoeus.get(
'https://httpbin.org/ip',
proxy: PROXY_URL,
timeout: 30,
followlocation: true,
ssl_verifypeer: true,
ssl_verifyhost: 2
)
if response.success?
puts "Status: #{response.code}"
puts "Body: #{response.body}"
puts "Time: #{response.total_time}s"
else
puts "Error: #{response.return_message}"
end
Requêtes Parallèles avec Hydra
La vraie puissance de Typhoeus réside dans Hydra qui gère jusqu'à 50 requêtes concurrentes par défaut :
require 'typhoeus'
class ParallelScraper
MAX_CONCURRENT = 50
def initialize(proxy_user:, proxy_pass:)
@proxy_url = "http://#{proxy_user}:#{proxy_pass}@gate.proxyhat.com:8080"
@hydra = Typhoeus::Hydra.new(max_concurrency: MAX_CONCURRENT)
end
def fetch_urls(urls, &callback)
requests = urls.map do |url|
request = Typhoeus::Request.new(
url,
method: :get,
proxy: @proxy_url,
timeout: 30,
followlocation: true,
ssl_verifypeer: true
)
request.on_complete do |response|
if response.success?
callback.call(url, response.body, nil)
else
callback.call(url, nil, response.return_message)
end
end
@hydra.queue(request)
request
end
@hydra.run
requests
end
end
# Exemple : scraper 100 URLs en parallèle
urls = 100.times.map { |i| "https://httpbin.org/delay/#{rand(1..3)}" }
scraper = ParallelScraper.new(
proxy_user: 'user-country-US',
proxy_pass: 'your_password'
)
results = []
scraper.fetch_urls(urls) do |url, body, error|
if error
puts "❌ #{url}: #{error}"
else
puts "✅ #{url}: #{body.bytesize} bytes"
results << { url: url, body: body }
end
end
puts "\nTotal réussi: #{results.size}/#{urls.size}"
ProxyHat Ruby SDK : Rotation et Geo-Targeting
Le ProxyHat SDK simplifie la rotation automatique d'IP et le ciblage géographique. Voici comment l'intégrer dans votre workflow Ruby.
SDK Personnalisé pour ProxyHat
require 'net/http'
require 'uri'
require 'json'
require 'securerandom'
module ProxyHat
class Client
GATEWAY = 'gate.proxyhat.com'
HTTP_PORT = 8080
SOCKS5_PORT = 1080
attr_reader :username, :password
def initialize(username:, password:, country: nil, city: nil, session: nil)
@username = build_username(username, country, city, session)
@password = password
end
# Configuration pour Net::HTTP
def net_http_proxy_args
[GATEWAY, HTTP_PORT, @username, @password]
end
# Configuration pour Typhoeus
def typhoeus_proxy_url
"http://#{URI.encode_www_form_component(@username)}:#{URI.encode_www_form_component(@password)}@#{GATEWAY}:#{HTTP_PORT}"
end
# Configuration SOCKS5 pour certains cas d'usage
def socks5_proxy_url
"socks5://#{URI.encode_www_form_component(@username)}:#{URI.encode_www_form_component(@password)}@#{GATEWAY}:#{SOCKS5_PORT}"
end
# Créer une nouvelle session avec IP sticky
def create_sticky_session(session_id = nil)
session_id ||= SecureRandom.hex(8)
self.class.new(
username: @username.split('-').first,
password: @password,
session: session_id
)
end
# Obtenir une nouvelle IP (rotation)
def rotate
self.class.new(
username: @username.split('-').first,
password: @password,
country: nil,
city: nil,
session: SecureRandom.hex(8)
)
end
private
def build_username(base_user, country, city, session)
parts = [base_user]
parts << "country-#{country}" if country
parts << "city-#{city}" if city
parts << "session-#{session}" if session
parts.join('-')
end
end
# Gestionnaire de pool de proxies avec rotation
class ProxyPool
def initialize(username:, password:, country: nil, pool_size: 10)
@base_config = { username: username, password: password, country: country }
@pool_size = pool_size
@sessions = []
end
def get_session(index)
@sessions[index] ||= Client.new(**@base_config, session: SecureRandom.hex(8))
end
def rotate_all
@sessions = @pool_size.times.map do
Client.new(**@base_config, session: SecureRandom.hex(8))
end
end
end
end
# Utilisation du SDK
proxy = ProxyHat::Client.new(
username: 'your_username',
password: 'your_password',
country: 'US',
city: 'new_york'
)
# Pour Net::HTTP
proxy_args = proxy.net_http_proxy_args
http = Net::HTTP.new('example.com', 443, *proxy_args)
# Pour Typhoeus
proxy_url = proxy.typhoeus_proxy_url
Scraping de 1000 URLs avec Proxies Résidentiels
Voici un exemple complet de scraping à grande échelle avec rotation d'IP et gestion des erreurs :
require 'typhoeus'
require 'json'
require 'concurrent' # gem install concurrent-ruby
class LargeScaleScraper
BATCH_SIZE = 100
MAX_RETRIES = 3
CONCURRENT_REQUESTS = 50
def initialize(username:, password:, country: 'US')
@username = username
@password = password
@country = country
@results = Concurrent::Array.new
@errors = Concurrent::Array.new
@mutex = Mutex.new
end
def scrape_urls(urls)
start_time = Time.now
batches = urls.each_slice(BATCH_SIZE).to_a
puts "🚀 Démarrage: #{urls.size} URLs en #{batches.size} batches"
batches.each_with_index do |batch, batch_idx|
puts "\n📦 Batch #{batch_idx + 1}/#{batches.size}"
process_batch(batch, batch_idx)
end
duration = Time.now - start_time
print_summary(duration)
{ results: @results.to_a, errors: @errors.to_a }
end
private
def process_batch(urls, batch_idx)
hydra = Typhoeus::Hydra.new(max_concurrency: CONCURRENT_REQUESTS)
urls.each_with_index do |url, idx|
# Rotation d'IP pour chaque batch
session_id = "batch-#{batch_idx}-req-#{idx}"
proxy_url = build_proxy_url(session_id)
request = Typhoeus::Request.new(
url,
method: :get,
proxy: proxy_url,
timeout: 45,
connecttimeout: 15,
followlocation: true,
ssl_verifypeer: true,
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.9'
}
)
request.on_complete do |response|
handle_response(url, response, batch_idx)
end
hydra.queue(request)
end
hydra.run
end
def handle_response(url, response, batch_idx)
@mutex.synchronize do
if response.success?
data = {
url: url,
status: response.code,
body: response.body,
size: response.body.bytesize,
time: response.total_time
}
@results << data
puts " ✅ #{url} (#{response.code}) - #{response.body.bytesize} bytes"
elsif response.code == 429 || response.code == 403
# Rate limité ou bloqué - retry avec nouvelle IP
@errors << { url: url, status: response.code, error: 'Rate limited/blocked' }
puts " ⚠️ #{url} - Bloqué (#{response.code})"
else
@errors << { url: url, status: response.code, error: response.return_message }
puts " ❌ #{url} - Erreur: #{response.return_message}"
end
end
end
def build_proxy_url(session_id)
user = "#{@username}-country-#{@country}-session-#{session_id}"
"http://#{URI.encode_www_form_component(user)}:#{URI.encode_www_form_component(@password)}@gate.proxyhat.com:8080"
end
def random_user_agent
agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0'
]
agents.sample
end
def print_summary(duration)
total = @results.size + @errors.size
success_rate = ( @results.size.to_f / total * 100 ).round(2)
avg_time = @results.empty? ? 0 : @results.map { |r| r[:time] }.sum / @results.size
puts "\n" + "=" * 60
puts "📊 RÉSUMÉ"
puts "=" * 60
puts "Total URLs : #{total}"
puts "Succès : #{@results.size} (#{success_rate}%)"
puts "Échecs : #{@errors.size}"
puts "Durée totale : #{duration.round(2)}s"
puts "Temps moyen/req : #{avg_time.round(3)}s"
puts "Requêtes/sec : #{ (total / duration).round(2) }"
puts "=" * 60
end
end
# Exécution
if __FILE__ == $0
urls = 1000.times.map { |i| "https://httpbin.org/delay/#{rand(1..2)}" }
scraper = LargeScaleScraper.new(
username: 'your_username',
password: 'your_password',
country: 'US'
)
results = scraper.scrape_urls(urls)
# Export des résultats
File.write('results.json', JSON.pretty_generate(results[:results]))
puts "\n💾 Résultats sauvegardés dans results.json"
end
Configuration TLS/SSL Avancée
La gestion des certificats SSL est cruciale pour le scraping de sites avec des certificats auto-signés ou des configurations TLS particulières.
Gestion des Certificats Auto-signés
require 'net/http'
require 'openssl'
class SSLAwareClient
def initialize(proxy_host:, proxy_port:, proxy_user:, proxy_pass:, verify_ssl: true)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
@verify_ssl = verify_ssl
end
def get(url, headers: {})
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port, @proxy_host, @proxy_port)
if uri.scheme == 'https'
http.use_ssl = true
if @verify_ssl
# Vérification SSL stricte
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = OpenSSL::X509::Store.new.tap(&:set_default_paths)
else
# Pour les certificats auto-signés (développement uniquement)
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
warn "⚠️ SSL verification disabled - use only for development!"
end
# Configuration SNI (Server Name Indication)
http.sni_hostname = uri.host
# Versions TLS modernes
http.min_version = OpenSSL::SSL::TLS1_2_VERSION
http.max_version = OpenSSL::SSL::TLS1_3_VERSION
# Cipher suites recommandés
http.ciphers = [
'TLS_AES_256_GCM_SHA384',
'TLS_CHACHA20_POLY1305_SHA256',
'TLS_AES_128_GCM_SHA256',
'ECDHE-RSA-AES256-GCM-SHA384',
'ECDHE-RSA-AES128-GCM-SHA256'
].join(':')
end
request = Net::HTTP::Get.new(uri.request_uri)
request['Proxy-Authorization'] = 'Basic ' + Base64.strict_encode64("#{@proxy_user}:#{@proxy_pass}")
headers.each { |k, v| request[k] = v }
http.request(request)
rescue OpenSSL::SSL::SSLError => e
{ success: false, error: "SSL Error: #{e.message}" }
end
end
# Utilisation
client = SSLAwareClient.new(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'user-country-DE',
proxy_pass: 'your_password',
verify_ssl: true
)
response = client.get('https://example.com')
TLS avec Typhoeus
require 'typhoeus'
# Configuration TLS complète
response = Typhoeus.get(
'https://example.com',
proxy: 'http://user-country-US:password@gate.proxyhat.com:8080',
# SSL/TLS options
ssl_verifypeer: true,
ssl_verifyhost: 2, # Vérification stricte du hostname
sslversion: :tlsv1_2,
# Options avancées
connecttimeout: 15,
timeout: 30,
# Pour certificats auto-signés (développement)
# ssl_verifypeer: false,
# ssl_verifyhost: 0
)
Intégration Ruby on Rails
Middleware Faraday avec Proxy
Faraday est le client HTTP standard dans l'écosystème Rails. Voici une configuration propre avec support de proxy :
# Gemfile
gem 'faraday'
gem 'faraday_middleware'
# config/initializers/proxy_client.rb
class ProxyFaradayClient
attr_reader :connection
def initialize(proxy_user: nil, proxy_pass: nil, country: nil)
@proxy_config = build_proxy_config(proxy_user, proxy_pass, country)
@connection = build_connection
end
def get(url, params: {}, headers: {})
response = @connection.get(url, params, headers)
{ success: true, status: response.status, body: response.body }
rescue Faraday::Error => e
{ success: false, error: e.message }
end
def post(url, body:, headers: {})
response = @connection.post(url, body, headers)
{ success: true, status: response.status, body: response.body }
rescue Faraday::Error => e
{ success: false, error: e.message }
end
private
def build_proxy_config(user, pass, country)
return nil unless user && pass
username = country ? "#{user}-country-#{country}" : user
{
uri: "http://#{URI.encode_www_form_component(username)}:#{URI.encode_www_form_component(pass)}@gate.proxyhat.com:8080"
}
end
def build_connection
Faraday.new do |builder|
# Proxy configuration
builder.proxy = @proxy_config if @proxy_config
# Middlewares
builder.request :url_encoded
builder.request :json
builder.response :json, content_type: /\bjson$/
builder.response :raise_error
# Retry middleware avec backoff exponentiel
builder.request :retry,
max: 3,
interval: 1,
backoff_factor: 2,
retry_statuses: [429, 500, 502, 503, 504]
# Adapter
builder.adapter :net_http do |http|
http.open_timeout = 15
http.read_timeout = 30
http.write_timeout = 30
end
end
end
end
# Service de scraping avec pool de proxies
class ScrapingService
def initialize
@clients = {
us: ProxyFaradayClient.new(
proxy_user: ENV['PROXYHAT_USER'],
proxy_pass: ENV['PROXYHAT_PASS'],
country: 'US'
),
eu: ProxyFaradayClient.new(
proxy_user: ENV['PROXYHAT_USER'],
proxy_pass: ENV['PROXYHAT_PASS'],
country: 'DE'
),
direct: ProxyFaradayClient.new # Sans proxy
}
end
def fetch(url, region: :us)
client = @clients[region] || @clients[:us]
client.get(url)
end
end
Intégration avec ActiveJob
# app/jobs/scraping_job.rb
class ScrapingJob < ApplicationJob
queue_as :scraping
# Retry automatique avec backoff
retry_on(Net::ReadTimeout, wait: :exponentially_longer, attempts: 3)
retry_on(Errno::ECONNRESET, wait: 5.seconds, attempts: 3)
def perform(urls, options = {})
country = options[:country] || 'US'
batch_id = options[:batch_id]
proxy_client = ProxyFaradayClient.new(
proxy_user: ENV['PROXYHAT_USER'],
proxy_pass: ENV['PROXYHAT_PASS'],
country: country
)
results = urls.map do |url|
result = proxy_client.get(url)
result.merge(url: url, scraped_at: Time.current)
end
# Sauvegarder les résultats
save_results(results, batch_id)
# Notifier la complétion
ScrapingChannel.broadcast_to(
batch_id,
event: 'completed',
total: results.size,
success: results.count { |r| r[:success] }
)
end
private
def save_results(results, batch_id)
# Bulk insert pour performance
ScrapingResult.insert_all(
results.map do |r|
{
batch_id: batch_id,
url: r[:url],
status: r[:status],
body: r[:body],
created_at: Time.current,
updated_at: Time.current
}
end
)
end
end
# Lancer le job
ScrapingJob.perform_later(
['https://example1.com', 'https://example2.com'],
country: 'FR',
batch_id: SecureRandom.uuid
)
Circuit Breaker Pattern
Pour éviter de surcharger un service en panne, implémentez un circuit breaker :
require 'concurrent'
class CircuitBreaker
STATES = %i[closed open half_open].freeze
attr_reader :failure_count, :state
def initialize(failure_threshold: 5, recovery_timeout: 60)
@failure_threshold = failure_threshold
@recovery_timeout = recovery_timeout
@failure_count = 0
@state = :closed
@last_failure_time = nil
@mutex = Mutex.new
end
def call
raise 'Circuit open' if open?
begin
result = yield
on_success
result
rescue => e
on_failure
raise e
end
end
def open?
@mutex.synchronize do
if @state == :open
if Time.now - @last_failure_time > @recovery_timeout
@state = :half_open
false
else
true
end
else
false
end
end
end
private
def on_success
@mutex.synchronize do
@failure_count = 0
@state = :closed
end
end
def on_failure
@mutex.synchronize do
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= @failure_threshold
@state = :open
end
end
end
end
# Utilisation avec proxy
class ResilientProxyClient
def initialize
@breaker = CircuitBreaker.new(failure_threshold: 5, recovery_timeout: 30)
end
def get(url)
@breaker.call do
# Votre logique de requête ici
Typhoeus.get(url, proxy: proxy_url, timeout: 30)
end
end
private
def proxy_url
"http://#{ENV['PROXYHAT_USER']}:#{ENV['PROXYHAT_PASS']}@gate.proxyhat.com:8080"
end
end
Comparatif des Approches
| Approche | Avantages | Inconvénients | Cas d'usage |
|---|---|---|---|
| Net::HTTP | Stdlib, pas de dépendances, simple | Synchrone, pas de parallélisme natif | Scripts simples, prototypes |
| Typhoeus | Parallélisme massif, performant, libcurl | Compilation requise, plus complexe | Scraping haute performance |
| ProxyHat SDK | Rotation automatique, geo-targeting | Nécessite un compte ProxyHat | Production, scraping à grande échelle |
| Faraday | Middleware, testable, standard Rails | Overhead, moins performant que Typhoeus | Applications Rails, APIs |






