Si estás construyendo pipelines de datos o sistemas de scraping en Ruby, sooner or later necesitarás enrutar tus peticiones HTTP a través de proxies. Ya sea para evitar rate limits, acceder a contenido geo-restringido, o simplemente distribuir tus requests across múltiples IPs, los proxies son una herramienta esencial en el arsenal de cualquier Ruby developer.
Ruby ofrece varias opciones para trabajar con proxies HTTP, desde la biblioteca estándar Net::HTTP hasta clientes más avanzados como Typhoeus que permiten peticiones paralelas. Esta guía cubre los tres enfoques principales: la stdlib básica, un cliente libcurl-backed de alto rendimiento, y el SDK oficial de ProxyHat para rotación automática de IPs residenciales.
Net::HTTP: Proxy Básico con la Biblioteca Estándar
Net::HTTP viene incluido en Ruby, por lo que no necesitas dependencias externas para configurar un proxy básico. Sin embargo, la API puede ser verbosa y el manejo de errores requiere atención cuidadosa.
Configuración Básica de Proxy en Net::HTTP
El patrón más directo es crear una instancia de Net::HTTP especificando el proxy host y puerto:
require 'net/http'
require 'uri'
# Configuración del proxy ProxyHat
PROXY_HOST = 'gate.proxyhat.com'
PROXY_PORT = 8080
PROXY_USER = 'tu_usuario'
PROXY_PASS = 'tu_password'
def fetch_with_proxy(url, timeout: 30)
uri = URI.parse(url)
# Crear conexión HTTP a través del proxy
http = Net::HTTP.new(
uri.host,
uri.port,
PROXY_HOST,
PROXY_PORT,
PROXY_USER,
PROXY_PASS
)
# Configuración de timeouts y TLS
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = timeout
http.read_timeout = timeout
http.ssl_version = :TLSv1_2
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# Construir y enviar la petición
request = Net::HTTP::Get.new(uri.request_uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; RubyScraper/1.0)'
request['Accept'] = 'text/html,application/xhtml+xml'
response = http.request(request)
case response
when Net::HTTPSuccess
{ status: response.code.to_i, body: response.body, headers: response.each_header.to_h }
when Net::HTTPRedirection
{ status: response.code.to_i, location: response['Location'], body: response.body }
else
{ status: response.code.to_i, error: response.message }
end
rescue Net::OpenTimeout => e
{ error: "Timeout de conexión: #{e.message}" }
rescue Net::ReadTimeout => e
{ error: "Timeout de lectura: #{e.message}" }
rescue OpenSSL::SSL::SSLError => e
{ error: "Error SSL/TLS: #{e.message}" }
rescue SocketError => e
{ error: "Error de DNS/conexión: #{e.message}" }
rescue StandardError => e
{ error: "Error inesperado: #{e.class} - #{e.message}" }
end
# Ejemplo de uso
result = fetch_with_proxy('https://httpbin.org/ip')
puts result.inspect
Manejo de Errores Robusto con Reintentos
En producción, necesitas reintentos con backoff exponencial para manejar fallos transitorios:
require 'net/http'
require 'uri'
class ProxyHTTPClient
MAX_RETRIES = 3
BASE_DELAY = 1 # segundos
def initialize(proxy_host:, proxy_port:, proxy_user:, proxy_pass:)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
end
def get(url, headers: {}, timeout: 30)
retries = 0
loop do
result = perform_request(url, headers, timeout)
return result if result[:status] || retries >= MAX_RETRIES
if retriable_error?(result[:error])
retries += 1
delay = BASE_DELAY * (2 ** (retries - 1)) + rand(0.0..1.0)
puts "Reintento #{retries}/#{MAX_RETRIES} en #{delay.round(2)}s..."
sleep(delay)
else
return result
end
end
end
private
def perform_request(url, headers, timeout)
uri = URI.parse(url)
http = Net::HTTP.new(
uri.host, uri.port,
@proxy_host, @proxy_port,
@proxy_user, @proxy_pass
)
configure_http(http, uri, timeout)
request = Net::HTTP::Get.new(uri.request_uri)
headers.each { |k, v| request[k] = v }
response = http.request(request)
{ status: response.code.to_i, body: response.body, headers: response.to_hash }
rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET, Errno::ETIMEDOUT => e
{ error: e.class.name, message: e.message }
end
def configure_http(http, uri, timeout)
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = timeout
http.read_timeout = timeout
http.ssl_version = :TLSv1_2
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
end
def retriable_error?(error)
%w[Net::OpenTimeout Net::ReadTimeout Errno::ECONNRESET Errno::ETIMEDOUT].include?(error)
end
end
# Uso
client = ProxyHTTPClient.new(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'tu_usuario',
proxy_pass: 'tu_password'
)
result = client.get('https://httpbin.org/status/200', headers: { 'Accept' => 'application/json' })
puts "Status: #{result[:status]}"
Typhoeus: Peticiones Paralelas con libcurl
Typhoeus es un wrapper alrededor de libcurl que ofrece rendimiento superior y soporte nativo para peticiones paralelas a través de Hydra. Es ideal cuando necesitas hacer cientos o miles de requests concurrentes.
Instalación y Configuración
Añade a tu Gemfile:
gem 'typhoeus'
Petición Individual con Proxy
require 'typhoeus'
# Configurar proxy para una petición individual
response = Typhoeus::Request.new(
'https://httpbin.org/ip',
method: :get,
proxy: 'http://tu_usuario:tu_password@gate.proxyhat.com:8080',
headers: {
'User-Agent' => 'Mozilla/5.0 (compatible; TyphoeusScraper/1.0)',
'Accept' => 'application/json'
},
timeout: 30,
followlocation: true,
ssl_verifypeer: true,
ssl_verifyhost: 2
).run
if response.success?
puts "Status: #{response.code}"
puts "Body: #{response.body}"
puts "IP detectada: #{JSON.parse(response.body)['origin']}"
elsif response.timed_out?
puts "La petición excedió el timeout"
elsif response.code == 0
puts "Error de conexión: #{response.return_message}"
else
puts "Error HTTP #{response.code}"
end
Peticiones Paralelas con Hydra
El verdadero poder de Typhoeus está en Hydra, que permite ejecutar múltiples peticiones en paralelo:
require 'typhoeus'
require 'json'
class ParallelScraper
PROXY_URL = 'http://tu_usuario:tu_password@gate.proxyhat.com:8080'
MAX_CONCURRENCY = 50
def initialize(urls)
@urls = urls
@results = Concurrent::Array.new # thread-safe
end
def scrape_all
hydra = Typhoeus::Hydra.new(max_concurrency: MAX_CONCURRENCY)
@urls.each_with_index do |url, index|
request = build_request(url, index)
hydra.queue(request)
end
# Ejecutar todas las peticiones
hydra.run
@results.to_a
end
private
def build_request(url, index)
request = Typhoeus::Request.new(
url,
method: :get,
proxy: PROXY_URL,
headers: {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5'
},
timeout: 30,
followlocation: true,
ssl_verifypeer: true
)
request.on_complete do |response|
@results << {
url: url,
index: index,
status: response.code,
body: response.body,
success: response.success?,
time: response.total_time
}
end
request.on_failure do |response|
@results << {
url: url,
index: index,
status: response.code,
error: response.return_message,
success: false
}
end
request
end
def random_user_agent
agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)'
]
agents.sample
end
end
# Ejecutar scraping paralelo
urls = 100.times.map { |i| "https://httpbin.org/delay/#{rand(1..3)}?id=#{i}" }
scraper = ParallelScraper.new(urls)
results = scraper.scrape_all
# Estadísticas
successful = results.count { |r| r[:success] }
failed = results.count { |r| !r[:success] }
avg_time = results.map { |r| r[:time] }.compact.sum / results.size
puts "Exitosas: #{successful}/#{urls.size}"
puts "Fallidas: #{failed}"
puts "Tiempo promedio: #{avg_time.round(2)}s"
SDK ProxyHat: Rotación Automática y Geo-targeting
El SDK de ProxyHat simplifica la rotación de IPs residenciales y el geo-targeting. Puedes especificar país, ciudad, y mantener sesiones sticky cuando necesites consistencia de IP.
Configuración del SDK
require 'net/http'
require 'uri'
require 'json'
module ProxyHat
class Client
GATEWAY_HOST = 'gate.proxyhat.com'
HTTP_PORT = 8080
SOCKS5_PORT = 1080
attr_reader :username, :password
def initialize(username:, password:)
@username = username
@password = password
end
# Construir URL del proxy con opciones
def proxy_url(port: HTTP_PORT, country: nil, city: nil, session: nil, sticky: false)
user_parts = [username]
# Geo-targeting
user_parts << "country-#{country.upcase}" if country
user_parts << "city-#{city.downcase.gsub(/\s+/, '-')}" if city
# Sesión sticky para mantener la misma IP
if sticky && session
user_parts << "session-#{session}"
end
formatted_user = user_parts.join('-')
"http://#{formatted_user}:#{password}@#{GATEWAY_HOST}:#{port}"
end
# Cliente HTTP con proxy configurado
def http_client(port: HTTP_PORT, **proxy_opts)
proxy = proxy_url(port: port, **proxy_opts)
uri = URI.parse(proxy)
Net::HTTP::Proxy(uri.host, uri.port, uri.user, uri.password)
end
# Petición GET con proxy rotativo
def get(url, country: nil, city: nil, session: nil, sticky: false, headers: {}, timeout: 30)
proxy = proxy_url(country: country, city: city, session: session, sticky: sticky)
uri = URI.parse(proxy)
target = URI.parse(url)
http = Net::HTTP.new(
target.host, target.port,
uri.host, uri.port,
uri.user, uri.password
)
http.use_ssl = (target.scheme == 'https')
http.open_timeout = timeout
http.read_timeout = timeout
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
request = Net::HTTP::Get.new(target.request_uri)
headers.each { |k, v| request[k] = v }
request['User-Agent'] ||= 'ProxyHat-Ruby/1.0'
response = http.request(request)
{
status: response.code.to_i,
body: response.body,
headers: response.each_header.to_h
}
rescue StandardError => e
{ error: e.class.name, message: e.message }
end
# Verificar IP actual del proxy
def current_ip(country: nil, city: nil)
result = get('https://httpbin.org/ip', country: country, city: city)
return nil unless result[:status] == 200
JSON.parse(result[:body])['origin']
end
# Obtener IPs de diferentes países
def test_geo_routing(countries)
countries.each do |country|
ip = current_ip(country: country)
puts "#{country}: #{ip || 'Error'}"
sleep(0.5)
end
end
end
end
# Uso del SDK
client = ProxyHat::Client.new(
username: 'tu_usuario',
password: 'tu_password'
)
# Petición simple con proxy residencial rotativo
result = client.get('https://httpbin.org/ip')
puts "IP: #{JSON.parse(result[:body])['origin']}"
# Geo-targeting: IP desde Estados Unidos
us_result = client.get('https://httpbin.org/ip', country: 'US')
puts "IP US: #{JSON.parse(us_result[:body])['origin']}"
# Geo-targeting específico: Berlín, Alemania
de_result = client.get('https://httpbin.org/ip', country: 'DE', city: 'Berlin')
puts "IP Berlin: #{JSON.parse(de_result[:body])['origin']}"
# Sesión sticky: mantener la misma IP para múltiples peticiones
session_id = SecureRandom.hex(8)
5.times do |i|
result = client.get(
'https://httpbin.org/ip',
country: 'US',
session: session_id,
sticky: true
)
ip = JSON.parse(result[:body])['origin']
puts "Request #{i + 1}: #{ip}"
end
Scraping Real: 1000 URLs Concurrentes con Proxies Rotativos
Aquí tienes un ejemplo completo de scraping production-ready que combina Typhoeus para concurrencia y ProxyHat para rotación de IPs residenciales:
require 'typhoeus'
require 'json'
require 'concurrent'
require 'logger'
module ProxyHat
class BatchScraper
BATCH_SIZE = 100
MAX_RETRIES = 3
CONCURRENCY = 50
attr_reader :stats
def initialize(username:, password:, country: nil, logger: nil)
@username = username
@password = password
@country = country
@logger = logger || Logger.new($stdout).tap { |l| l.level = Logger::INFO }
@stats = Concurrent::Hash.new(0)
@results = Concurrent::Array.new
end
def scrape(urls, headers: {})
@logger.info "Iniciando scraping de #{urls.size} URLs..."
start_time = Time.now
# Procesar en lotes para evitar agotar conexiones
urls.each_slice(BATCH_SIZE).with_index do |batch, batch_idx|
@logger.info "Procesando lote #{batch_idx + 1}/#{(urls.size.to_f / BATCH_SIZE).ceil}"
process_batch(batch, headers)
end
duration = Time.now - start_time
print_summary(duration)
@results.to_a
end
private
def process_batch(urls, headers)
hydra = Typhoeus::Hydra.new(max_concurrency: CONCURRENCY)
urls.each_with_index do |url, idx|
# Rotar sesión para cada URL (IP diferente)
session = "batch-#{SecureRandom.hex(4)}-#{idx}"
proxy = build_proxy_url(session)
request = Typhoeus::Request.new(
url,
method: :get,
proxy: proxy,
headers: default_headers.merge(headers),
timeout: 30,
followlocation: true,
ssl_verifypeer: true,
ssl_verifyhost: 2
)
request.on_complete do |response|
handle_response(url, response)
end
hydra.queue(request)
end
hydra.run
end
def build_proxy_url(session)
user = @username.dup
user << "-country-#{@country}" if @country
user << "-session-#{session}"
"http://#{user}:#{@password}@gate.proxyhat.com:8080"
end
def handle_response(url, response)
@stats[:total] += 1
if response.success?
@stats[:success] += 1
@results << {
url: url,
status: response.code,
body: response.body,
size: response.body.bytesize,
time: response.total_time,
success: true
}
@logger.debug "✓ #{url} (#{response.code}) - #{response.total_time.round(2)}s"
elsif response.timed_out?
@stats[:timeouts] += 1
@results << { url: url, error: 'timeout', success: false }
@logger.warn "⏱ Timeout: #{url}"
else
@stats[:errors] += 1
@results << {
url: url,
status: response.code,
error: response.return_message,
success: false
}
@logger.warn "✗ #{url} (#{response.code || 'connection error'})"
end
end
def default_headers
{
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive'
}
end
def print_summary(duration)
success_rate = (@stats[:success].to_f / @stats[:total] * 100).round(2)
avg_time = @results.map { |r| r[:time] }.compact.sum / [@results.size, 1].max
total_bytes = @results.map { |r| r[:size] || 0 }.sum
@logger.info "\n" + '=' * 50
@logger.info "RESUMEN DE SCRAPING"
@logger.info '=' * 50
@logger.info "Total URLs: #{@stats[:total]}"
@logger.info "Exitosas: #{@stats[:success]} (#{success_rate}%)"
@logger.info "Timeouts: #{@stats[:timeouts]}"
@logger.info "Errores: #{@stats[:errors]}"
@logger.info "Datos descargados: #{(total_bytes / 1024.0 / 1024).round(2)} MB"
@logger.info "Tiempo total: #{duration.round(2)}s"
@logger.info "Tiempo promedio por URL: #{avg_time.round(3)}s"
@logger.info "Throughput: #{(@stats[:total] / duration).round(2)} URLs/s"
end
end
end
# Ejecutar scraping de 1000 URLs
generator = ->(n) { n.times.map { |i| "https://httpbin.org/delay/#{rand(1..2)}?id=#{i}" } }
scraper = ProxyHat::BatchScraper.new(
username: 'tu_usuario',
password: 'tu_password',
country: 'US',
logger: Logger.new($stdout).tap { |l| l.level = Logger::INFO }
)
urls = generator.call(1000)
results = scraper.scrape(urls)
# Filtrar resultados exitosos
successful = results.select { |r| r[:success] }
puts "\nPrimeros 5 resultados exitosos:"
successful.first(5).each do |r|
puts " #{r[:url]} - #{r[:status]} - #{r[:time].round(2)}s"
end
Configuración TLS/SSL: Certificados Self-Signed y SNI
Cuando trabajas con proxies, puedes encontrar certificados SSL problemáticos o configuraciones SNI específicas. Aquí te muestro cómo manejar estos casos:
require 'net/http'
require 'uri'
require 'openssl'
class SSLAwareProxyClient
def initialize(proxy_host:, proxy_port:, proxy_user:, proxy_pass:)
@proxy_host = proxy_host
@proxy_port = proxy_port
@proxy_user = proxy_user
@proxy_pass = proxy_pass
end
# Petición con verificación SSL estándar
def get_strict(url, timeout: 30)
perform_request(url, timeout: timeout) do |http|
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = default_cert_store
end
end
# Petición ignorando errores de certificado (solo para testing!)
def get_permissive(url, timeout: 30)
perform_request(url, timeout: timeout) do |http|
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
end
end
# Petición con SNI personalizado
def get_with_sni(url, server_name:, timeout: 30)
perform_request(url, timeout: timeout) do |http|
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = default_cert_store
# SNI se configura automáticamente en Ruby moderno
# Para casos especiales, puedes usar:
http.enable_post_connection_check = true
end
end
# Petición con certificado cliente (mTLS)
def get_with_client_cert(url, cert_path:, key_path:, timeout: 30)
perform_request(url, timeout: timeout) do |http|
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.cert_store = default_cert_store
http.cert = OpenSSL::X509::Certificate.new(File.read(cert_path))
http.key = OpenSSL::PKey::RSA.new(File.read(key_path))
end
end
private
def perform_request(url, timeout: 30)
uri = URI.parse(url)
http = Net::HTTP.new(
uri.host, uri.port,
@proxy_host, @proxy_port,
@proxy_user, @proxy_pass
)
http.use_ssl = (uri.scheme == 'https')
http.open_timeout = timeout
http.read_timeout = timeout
http.ssl_version = :TLSv1_2
# Configuración SSL personalizada via block
yield(http) if block_given?
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
{ status: response.code.to_i, body: response.body }
rescue OpenSSL::SSL::SSLError => e
{ error: 'SSL_ERROR', message: e.message }
end
def default_cert_store
store = OpenSSL::X509::Store.new
store.set_default_paths
# Añadir CA certificates adicionales si es necesario
# store.add_file('/path/to/custom/ca-bundle.crt')
store
end
end
# Uso
client = SSLAwareProxyClient.new(
proxy_host: 'gate.proxyhat.com',
proxy_port: 8080,
proxy_user: 'tu_usuario',
proxy_pass: 'tu_password'
)
# Petición estándar con verificación SSL completa
result = client.get_strict('https://example.com')
puts "Status: #{result[:status]}"
# Petición con SNI específico (útil para hosts virtuales)
result = client.get_with_sni(
'https://192.168.1.100',
server_name: 'internal.example.com'
)
Integración con Ruby on Rails
En aplicaciones Rails, querrás integrar proxies de forma modular. Aquí te muestro dos patrones comunes: middleware Faraday y jobs de ActiveJob.
Middleware Faraday para Proxies
# config/initializers/proxy_client.rb
require 'faraday'
require 'typhoeus/adapters/faraday'
module ProxyHat
class FaradayMiddleware < Faraday::Middleware
def initialize(app, options = {})
super(app)
@username = options[:username]
@password = options[:password]
@country = options[:country]
@sticky = options[:sticky]
@session_key = options[:session_key]
end
def call(env)
session = @sticky ? env.request_context[@session_key] : SecureRandom.hex(8)
user = @username.dup
user << "-country-#{@country}" if @country
user << "-session-#{session}" if session
env[:proxy] = "http://#{user}:#{@password}@gate.proxyhat.com:8080"
@app.call(env)
end
end
end
# Crear cliente Faraday con proxy middleware
class ApiClient
PROXYHAT_USER = ENV['PROXYHAT_USERNAME']
PROXYHAT_PASS = ENV['PROXYHAT_PASSWORD']
def initialize(country: nil, sticky: false)
@country = country
@sticky = sticky
end
def connection
@connection ||= Faraday.new do |builder|
builder.adapter :typhoeus
builder.use ProxyHat::FaradayMiddleware,
username: PROXYHAT_USER,
password: PROXYHAT_PASS,
country: @country,
sticky: @sticky,
session_key: :proxy_session
# Response middlewares
builder.response :json, content_type: /json/
builder.response :logger, Rails.logger, bodies: true
# Request middlewares
builder.request :json
builder.request :retry, {
max: 3,
interval: 1,
interval_randomness: 0.5,
backoff_factor: 2,
exceptions: [
'Faraday::TimeoutError',
'Faraday::ConnectionFailed',
'Typhoeus::Errors::TyphoeusError'
]
}
end
end
def get(url, params: {})
response = connection.get(url, params)
{ status: response.status, body: response.body }
rescue Faraday::Error => e
{ error: e.class.name, message: e.message }
end
end
# Uso en controladores o services
client = ApiClient.new(country: 'US', sticky: true)
result = client.get('https://api.example.com/data')
ActiveJob para Scraping Asíncrono
# app/jobs/scraping_job.rb
class ScrapingJob < ApplicationJob
queue_as :scraping
# Configuración del proxy
PROXYHAT_CONFIG = {
host: 'gate.proxyhat.com',
port: 8080,
username: ENV['PROXYHAT_USERNAME'],
password: ENV['PROXYHAT_PASSWORD']
}.freeze
retry_on Net::OpenTimeout, wait: :polynomially_longer, attempts: 5
retry_on Net::ReadTimeout, wait: :polynomially_longer, attempts: 3
discard_on ScrapingError
def perform(url, options = {})
@url = url
@country = options[:country]
@session = options[:session] || SecureRandom.hex(8)
Rails.logger.info "Scraping #{url} (country: #{@country})"
result = fetch_with_proxy
if result[:success]
process_result(result[:body])
else
handle_failure(result[:error])
end
end
private
def fetch_with_proxy
uri = URI.parse(@url)
proxy_user = build_proxy_username
http = Net::HTTP.new(
uri.host, uri.port,
PROXYHAT_CONFIG[:host],
PROXYHAT_CONFIG[:port],
proxy_user,
PROXYHAT_CONFIG[:password]
)
configure_ssl(http, uri)
request = Net::HTTP::Get.new(uri.request_uri)
request['User-Agent'] = random_user_agent
response = http.request(request)
{ success: true, body: response.body, status: response.code.to_i }
rescue Net::OpenTimeout, Net::ReadTimeout => e
{ success: false, error: e.class.name }
rescue StandardError => e
{ success: false, error: e.message }
end
def build_proxy_username
user = PROXYHAT_CONFIG[:username].dup
user << "-country-#{@country}" if @country
user << "-session-#{@session}"
user
end
def configure_ssl(http, uri)
http.use_ssl = (uri.scheme == 'https')
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
http.open_timeout = 30
http.read_timeout = 30
end
def random_user_agent
@user_agent ||= [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36'
].sample
end
def process_result(body)
# Guardar en base de datos o procesar
ScrapedPage.create!(
url: @url,
content: body,
scraped_at: Time.current,
country: @country
)
end
def handle_failure(error)
Rails.logger.error "Scraping failed for #{@url}: #{error}"
ScrapingFailure.create!(url: @url, error: error)
raise ScrapingError, error
end
end
# app/jobs/batch_scraping_job.rb
class BatchScrapingJob < ApplicationJob
queue_as :scraping
def perform(urls, country: nil)
urls.each_slice(100).with_index do |batch, index|
batch.each do |url|
ScrapingJob.perform_later(url, country: country, session: "batch-#{index}")
end
# Rate limiting entre lotes
sleep(2) unless index == 0
end
end
end
# Ejecutar desde consola o controlador
BatchScrapingJob.perform_later(
['https://example1.com', 'https://example2.com'],
country: 'US'
)
Comparación de Métodos
| Método | Concurrencia | Complejidad | Caso de Uso Ideal |
|---|---|---|---|
| Net::HTTP | Secuencial | Baja | Scripts simples, pocas URLs |
| Typhoeus + Hydra | Alta (50-500) | Media | Scraping masivo, alta concurrencia |
| Faraday + ProxyHat | Configurable | Media | Aplicaciones Rails, APIs |
| ActiveJob + Proxy | Background jobs | Alta | Procesos asíncronos, pipelines ETL |
Key Takeaways
Elige la herramienta correcta: Net::HTTP para scripts simples, Typhoeus para scraping masivo, y Faraday/ActiveJob para aplicaciones Rails production-ready.
- Net::HTTP es suficiente para peticiones individuales o pocas URLs, pero carece de soporte nativo para concurrencia.
- Typhoeus con Hydra permite cientos de peticiones paralelas con control fino sobre timeouts, retries, y callbacks.
- ProxyHat simplifica la rotación de IPs residenciales con geo-targeting a nivel de país y ciudad.
- Sesiones sticky mantienen la misma IP para múltiples peticiones cuando necesitas consistencia (login, carritos de compra).
- TLS/SSL requiere atención especial cuando el proxy maneja certificados upstream o necesitas mTLS.
- En Rails, usa Faraday middleware para configuración centralizada y ActiveJob para procesamiento asíncrono robusto.
Para más detalles sobre configuración de proxies y opciones avanzadas de geo-targeting, consulta nuestra página de precios o la lista de ubicaciones disponibles.






