Jak używać proxy HTTP w Ruby: Net::HTTP, Typhoeus i ProxyHat SDK

Kompletny przewodnik po proxy HTTP w Ruby — od Net::HTTP przez Typhoeus z równoległymi żądaniami aż po ProxyHat SDK z rotacją IP i geo-targetingiem. Kod produkcyjny z obsługą błędów.

Jak używać proxy HTTP w Ruby: Net::HTTP, Typhoeus i ProxyHat SDK

Dla programistów Ruby budujących systemy scrapingu i potoki danych, proxy HTTP to niezbędne narzędzie. Bez względu na to, czy pobierasz wyniki SERP, monitorujesz ceny e-commerce, czy trenujesz modele AI na danych z sieci — musisz znać sposoby konfiguracji proxy na każdym poziomie stosu.

W tym przewodniku pokazuję kod produkcyjny, nie teorię. Zaczynamy od Net::HTTP ze standardowej biblioteki, przechodzimy do Typhoeus z równoległymi żądaniami przez Hydra, i kończymy na ProxyHat SDK z automatyczną rotacją IP i geo-targetingiem.

Net::HTTP z proxy: podstawy i autoryzacja

Net::HTTP to standardowa biblioteka Ruby. Jest dostępna bez dodatkowych gemów, ale wymaga ręcznej konfiguracji proxy. Oto kompletny przykład z obsługą błędów:

require 'net/http'
require 'uri'

class ProxyHTTPClient
  PROXY_HOST = 'gate.proxyhat.com'
  PROXY_PORT = 8080
  PROXY_USER = 'your_username'
  PROXY_PASS = 'your_password'

  def initialize(proxy_user: nil, proxy_pass: nil)
    @proxy_user = proxy_user || PROXY_USER
    @proxy_pass = proxy_pass || PROXY_PASS
  end

  def get(url, timeout: 30)
    uri = URI.parse(url)

    http = Net::HTTP.new(
      uri.host,
      uri.port,
      PROXY_HOST,
      PROXY_PORT,
      @proxy_user,
      @proxy_pass
    )

    http.use_ssl = (uri.scheme == 'https')
    http.open_timeout = timeout
    http.read_timeout = timeout
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    request = Net::HTTP::Get.new(uri.request_uri)
    request['User-Agent'] = 'ProxyHat-Ruby-Client/1.0'
    request['Accept'] = 'text/html,application/xhtml+xml'

    response = http.request(request)

    case response
    when Net::HTTPSuccess
      { status: response.code.to_i, body: response.body, headers: response.each_header.to_h }
    when Net::HTTPRedirection
      { status: response.code.to_i, location: response['Location'], body: nil }
    else
      { status: response.code.to_i, error: response.message }
    end
  rescue Net::OpenTimeout => e
    { status: 0, error: "Connection timeout: #{e.message}" }
  rescue Net::ReadTimeout => e
    { status: 0, error: "Read timeout: #{e.message}" }
  rescue SocketError => e
    { status: 0, error: "DNS/Socket error: #{e.message}" }
  rescue OpenSSL::SSL::SSLError => e
    { status: 0, error: "SSL error: #{e.message}" }
  rescue StandardError => e
    { status: 0, error: "Unexpected error: #{e.message}" }
  ensure
    http&.finish if http&.started?
  end
end

# Użycie
client = ProxyHTTPClient.new(proxy_user: 'user-country-US', proxy_pass: 'pass')
result = client.get('https://httpbin.org/ip')
puts result.inspect

Parametry geo-targeting ProxyHat przekazujesz w nazwie użytkownika:

# Proxy dla USA
client_us = ProxyHTTPClient.new(proxy_user: 'user-country-US', proxy_pass: 'pass')

# Proxy dla Berlina, Niemcy
client_de = ProxyHTTPClient.new(
  proxy_user: 'user-country-DE-city-berlin',
  proxy_pass: 'pass'
)

# Sticky session (to samo IP przez 10 minut)
client_sticky = ProxyHTTPClient.new(
  proxy_user: 'user-session-abc123-duration-10',
  proxy_pass: 'pass'
)

Rotacja IP przy każdym żądaniu

Domyślnie ProxyHat rotuje IP przy każdym nowym żądaniu. Jeśli potrzebujesz sticky session, dodaj flagę session do username:

class RotatingProxyClient
  def initialize(base_user:, password:)
    @base_user = base_user
    @password = password
  end

  def fetch_with_rotation(urls)
    urls.map do |url|
      session_id = SecureRandom.hex(8)
      proxy_user = "#{@base_user}-session-#{session_id}"

      client = ProxyHTTPClient.new(proxy_user: proxy_user, proxy_pass: @password)
      client.get(url)
    end
  end
end

# Każde żądanie = nowe IP
rotator = RotatingProxyClient.new(base_user: 'user-country-US', password: 'pass')
results = rotator.fetch_with_rotation([
  'https://httpbin.org/ip',
  'https://httpbin.org/user-agent',
  'https://httpbin.org/headers'
])

Typhoeus: równoległe żądania z libcurl

Typhoeus to wrapper wokół libcurl z interfejsem idiomatycznym dla Ruby. Jego największą zaletą jest Hydra — mechanizm do równoległego wykonywania żądań HTTP.

Dodaj do Gemfile:

gem 'typhoeus'</n

Podstawowe żądanie z proxy:

require 'typhoeus'

class TyphoeusProxyClient
  PROXY_URL = 'http://user-country-US:pass@gate.proxyhat.com:8080'

  def initialize(proxy_url: nil)
    @proxy_url = proxy_url || PROXY_URL
  end

  def get(url, follow_location: true, timeout: 30)
    response = Typhoeus.get(
      url,
      proxy: @proxy_url,
      followlocation: follow_location,
      timeout: timeout,
      connecttimeout: 10,
      ssl_verifypeer: true,
      ssl_verifyhost: 2,
      headers: {
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
      }
    )

    if response.success?
      { status: response.code, body: response.body, headers: response.headers }
    elsif response.timed_out?
      { status: 0, error: 'Request timed out' }
    elsif response.code == 0
      { status: 0, error: response.return_message }
    else
      { status: response.code, error: "HTTP error: #{response.code}" }
    end
  end
end

client = TyphoeusProxyClient.new(
  proxy_url: 'http://user-country-DE:pass@gate.proxyhat.com:8080'
)
result = client.get('https://httpbin.org/ip')

Hydra: równoległe żądania

Hydra pozwala uruchomić setki żądań jednocześnie. To kluczowe przy scrapingu na dużą skalę:

require 'typhoeus'

class ParallelScraper
  PROXY_BASE = 'http://user-country-US:pass@gate.proxyhat.com:8080'
  MAX_CONCURRENT = 50

  def initialize(urls, proxy_base: nil, max_concurrent: nil)
    @urls = urls
    @proxy_base = proxy_base || PROXY_BASE
    @max_concurrent = max_concurrent || MAX_CONCURRENT
  end

  def scrape_all
    results = Concurrent::Hash.new
    hydra = Typhoeus::Hydra.new(max_concurrency: @max_concurrent)

    @urls.each_with_index do |url, idx|
      request = build_request(url, idx)

      request.on_complete do |response|
        results[url] = process_response(response)
      end

      hydra.queue(request)
    end

    hydra.run
    results
  end

  private

  def build_request(url, idx)
    session_id = "sess-#{idx}-#{SecureRandom.hex(4)}"
    proxy_url = @proxy_base.sub('user-', "user-session-#{session_id}-")

    Typhoeus::Request.new(
      url,
      method: :get,
      proxy: proxy_url,
      timeout: 30,
      connecttimeout: 10,
      followlocation: true,
      ssl_verifypeer: true,
      headers: {
        'User-Agent' => random_user_agent,
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language' => 'en-US,en;q=0.9',
        'Accept-Encoding' => 'gzip, deflate, br'
      }
    )
  end

  def process_response(response)
    if response.success?
      { status: response.code, body: response.body, success: true }
    elsif response.timed_out?
      { status: 0, error: 'timeout', success: false }
    else
      { status: response.code, error: response.return_message, success: false }
    end
  end

  def random_user_agent
    [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
      'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'
    ].sample
  end
end

# Przykład: 100 URL-i równolegle
urls = (1..100).map { |i| "https://httpbin.org/delay/#{rand(1..3)}?id=#{i}" }

scraper = ParallelScraper.new(urls)
results = scraper.scrape_all

successful = results.count { |_, r| r[:success] }
puts "Pobrano #{successful}/#{urls.size} URL-i"

ProxyHat Ruby SDK: rotacja i geo-targeting

ProxyHat udostępnia SDK upraszczające konfigurację. Oto wrapper produkcyjny z retry logic i circuit breaker:

require 'net/http'
require 'uri'
require 'concurrent'

module ProxyHat
  class Client
    GATEWAY_HOST = 'gate.proxyhat.com'
    HTTP_PORT = 8080
    SOCKS5_PORT = 1080

    DEFAULT_OPTIONS = {
      timeout: 30,
      max_retries: 3,
      retry_delay: 1.0,
      country: nil,
      city: nil,
      session: nil,
      session_duration: nil
    }.freeze

    attr_reader :username, :password, :options

    def initialize(username:, password:, **options)
      @username = username
      @password = password
      @options = DEFAULT_OPTIONS.merge(options)
      @circuit_breaker = CircuitBreaker.new(
        failure_threshold: 5,
        recovery_timeout: 60
      )
    end

    def get(url, **override_options)
      opts = @options.merge(override_options)
      execute_with_retry(url, opts)
    end

    def post(url, body:, content_type: 'application/json', **override_options)
      opts = @options.merge(override_options)
      execute_with_retry(url, opts, method: :post, body: body, content_type: content_type)
    end

    def build_proxy_username(**opts)
      parts = [username]

      if opts[:country]
        parts << "country-#{opts[:country]}"
        parts << "city-#{opts[:city]}" if opts[:city]
      end

      if opts[:session]
        parts << "session-#{opts[:session]}"
        parts << "duration-#{opts[:session_duration]}" if opts[:session_duration]
      end

      parts.join('-')
    end

    private

    def execute_with_retry(url, opts, method: :get, body: nil, content_type: nil)
      retries = 0
      last_error = nil

      loop do
        return @circuit_breaker.execute do
          make_request(url, opts, method, body, content_type)
        end
      rescue CircuitBreaker::OpenCircuitError => e
        raise e
      rescue StandardError => e
        last_error = e
        retries += 1

        if retries <= opts[:max_retries]
          sleep(opts[:retry_delay] * retries)
        else
          raise ProxyError.new("Max retries exceeded: #{e.message}")
        end
      end
    end

    def make_request(url, opts, method, body, content_type)
      uri = URI.parse(url)
      proxy_user = build_proxy_username(**opts)

      http = Net::HTTP.new(
        uri.host,
        uri.port,
        GATEWAY_HOST,
        HTTP_PORT,
        proxy_user,
        password
      )

      configure_ssl(http, uri)
      http.open_timeout = opts[:timeout]
      http.read_timeout = opts[:timeout]

      request = build_request(uri, method, body, content_type)
      response = http.request(request)

      handle_response(response)
    ensure
      http&.finish if http&.started?
    end

    def configure_ssl(http, uri)
      if uri.scheme == 'https'
        http.use_ssl = true
        http.verify_mode = OpenSSL::SSL::VERIFY_PEER
        http.min_version = OpenSSL::SSL::TLS1_2_VERSION
      end
    end

    def build_request(uri, method, body, content_type)
      request_class = method == :post ? Net::HTTP::Post : Net::HTTP::Get
      request = request_class.new(uri.request_uri)

      request['User-Agent'] = 'ProxyHat-Ruby-SDK/2.0'
      request['Accept'] = '*/*'

      if body
        request['Content-Type'] = content_type
        request.body = content_type == 'application/json' ? JSON.generate(body) : body
      end

      request
    end

    def handle_response(response)
      case response
      when Net::HTTPSuccess
        Response.new(
          status: response.code.to_i,
          body: response.body,
          headers: response.each_header.to_h
        )
      when Net::HTTPRedirection
        Response.new(
          status: response.code.to_i,
          body: nil,
          headers: response.each_header.to_h,
          redirect_to: response['Location']
        )
      else
        raise HTTPError.new(response.code.to_i, response.message)
      end
    end
  end

  class Response
    attr_reader :status, :body, :headers, :redirect_to

    def initialize(status:, body:, headers:, redirect_to: nil)
      @status = status
      @body = body
      @headers = headers
      @redirect_to = redirect_to
    end

    def json
      JSON.parse(@body)
    rescue JSON::ParserError
      nil
    end

    def success?
      (200..299).cover?(@status)
    end
  end

  class ProxyError < StandardError; end
  class HTTPError < StandardError
    attr_reader :status_code

    def initialize(status_code, message)
      @status_code = status_code
      super("HTTP #{status_code}: #{message}")
    end
  end

  class CircuitBreaker
    OpenCircuitError = Class.new(StandardError)

    def initialize(failure_threshold:, recovery_timeout:)
      @failure_threshold = failure_threshold
      @recovery_timeout = recovery_timeout
      @failures = 0
      @last_failure_time = nil
      @mutex = Mutex.new
    end

    def execute
      if open?
        raise OpenCircuitError.new('Circuit breaker is open')
      end

      result = yield
      reset
      result
    rescue StandardError => e
      record_failure
      raise e
    end

    private

    def open?
      @mutex.synchronize do
        return false if @failures < @failure_threshold
        return false if @last_failure_time.nil?

        elapsed = Time.now - @last_failure_time
        if elapsed > @recovery_timeout
          @failures = 0
          return false
        end

        true
      end
    end

    def record_failure
      @mutex.synchronize do
        @failures += 1
        @last_failure_time = Time.now
      end
    end

    def reset
      @mutex.synchronize do
        @failures = 0
        @last_failure_time = nil
      end
    end
  end
end

# Użycie
client = ProxyHat::Client.new(
  username: 'user',
  password: 'pass',
  country: 'US',
  max_retries: 3
)

# GET z automatyczną rotacją IP
response = client.get('https://httpbin.org/ip')
puts response.json

# Sticky session dla wielokrotnych żądań
session_client = ProxyHat::Client.new(
  username: 'user',
  password: 'pass',
  country: 'DE',
  city: 'berlin',
  session: 'my-session-123',
  session_duration: 10
)

response1 = session_client.get('https://example.com/page1')
response2 = session_client.get('https://example.com/page2')

Scraper produkcyjny: 1000 URL-i równolegle

Oto kompletny przykład produkcyjnego scrapera z równoległym pobieraniem, rotacją IP i obsługą błędów:

require 'typhoeus'
require 'concurrent'
require 'json'

class ProductionScraper
  attr_reader :stats

  def initialize(username:, password:, concurrency: 100)
    @username = username
    @password = password
    @concurrency = concurrency
    @stats = Concurrent::Hash.new(0)
    @mutex = Mutex.new
  end

  def scrape(urls, country: 'US', output_file: nil)
    results = Concurrent::Hash.new
    hydra = Typhoeus::Hydra.new(max_concurrency: @concurrency)

    urls.each_with_index do |url, idx|
      request = build_request(url, idx, country)

      request.on_complete do |response|
        handle_response(response, url, results)
      end

      hydra.queue(request)
    end

    start_time = Time.now
    hydra.run
    elapsed = Time.now - start_time

    print_stats(elapsed, urls.size)

    if output_file
      save_results(results, output_file)
    end

    results
  end

  private

  def build_request(url, idx, country)
    session_id = "scraper-#{idx}-#{SecureRandom.uuid}"
    proxy_url = "http://#{@username}-country-#{country}-session-#{session_id}:#{@password}@gate.proxyhat.com:8080"

    Typhoeus::Request.new(
      url,
      method: :get,
      proxy: proxy_url,
      timeout: 45,
      connecttimeout: 15,
      followlocation: true,
      maxredirs: 5,
      ssl_verifypeer: false, # Dla scrapingu czasem trzeba wyłączyć
      ssl_verifyhost: 0,
      headers: {
        'User-Agent' => random_user_agent,
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language' => 'en-US,en;q=0.9',
        'Accept-Encoding' => 'gzip, deflate',
        'Cache-Control' => 'no-cache',
        'Pragma' => 'no-cache'
      }
    )
  end

  def handle_response(response, url, results)
    @mutex.synchronize do
      if response.success?
        results[url] = {
          status: response.code,
          body: response.body,
          size: response.body.bytesize,
          time: response.total_time
        }
        @stats[:success] += 1
      elsif response.timed_out?
        results[url] = { status: 0, error: 'timeout' }
        @stats[:timeout] += 1
      elsif response.code == 403 || response.code == 429
        results[url] = { status: response.code, error: 'blocked/rate-limited' }
        @stats[:blocked] += 1
      else
        results[url] = { status: response.code, error: response.return_message }
        @stats[:failed] += 1
      end
    end
  end

  def print_stats(elapsed, total)
    puts "\n=== Scraping Results ==="
    puts "Total URLs: #{total}"
    puts "Success: #{@stats[:success]}"
    puts "Timeout: #{@stats[:timeout]}"
    puts "Blocked: #{@stats[:blocked]}"
    puts "Failed: #{@stats[:failed]}"
    puts "Time: #{elapsed.round(2)}s"
    puts "Rate: #{(@stats[:success] / elapsed).round(2)} req/s"
    puts "========================\n"
  end

  def save_results(results, filename)
    File.write(filename, JSON.pretty_generate(results.to_h))
    puts "Results saved to #{filename}"
  end

  def random_user_agent
    [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
      'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1'
    ].sample
  end
end

# Uruchomienie
scraper = ProductionScraper.new(
  username: 'your_username',
  password: 'your_password',
  concurrency: 100
)

# Generuj 1000 URL-i
urls = (1..1000).map do |i|
  "https://httpbin.org/delay/#{rand(1..2)}?id=#{i}"
end

results = scraper.scrape(urls, country: 'US', output_file: 'results.json')

TLS/SSL: certyfikaty, SNI i self-signed upstream

Przy scrapingu możesz napotkać serwery z certyfikatami self-signed lub błędami SSL. Oto jak je obsłużyć:

require 'net/http'
require 'openssl'

class SSLAwareProxyClient
  def initialize(
    proxy_host: 'gate.proxyhat.com',
    proxy_port: 8080,
    proxy_user: 'user-country-US',
    proxy_pass: 'pass',
    verify_ssl: true,
    ca_file: nil
  )
    @proxy_host = proxy_host
    @proxy_port = proxy_port
    @proxy_user = proxy_user
    @proxy_pass = proxy_pass
    @verify_ssl = verify_ssl
    @ca_file = ca_file
  end

  def get(url, timeout: 30)
    uri = URI.parse(url)

    http = Net::HTTP.new(
      uri.host,
      uri.port,
      @proxy_host,
      @proxy_port,
      @proxy_user,
      @proxy_pass
    )

    if uri.scheme == 'https'
      configure_ssl(http)
    end

    http.open_timeout = timeout
    http.read_timeout = timeout

    request = Net::HTTP::Get.new(uri.request_uri)
    response = http.request(request)

    { status: response.code.to_i, body: response.body }
  rescue OpenSSL::SSL::SSLError => e
    handle_ssl_error(e)
  ensure
    http&.finish if http&.started?
  end

  private

  def configure_ssl(http)
    http.use_ssl = true

    if @verify_ssl
      http.verify_mode = OpenSSL::SSL::VERIFY_PEER
      http.min_version = OpenSSL::SSL::TLS1_2_VERSION

      # Własne CA (np. dla corporate proxy)
      http.ca_file = @ca_file if @ca_file

      # SNI (Server Name Indication) - wymagane dla wielu CDN
      http.enable_post_connection_check = true
    else
      # TYLKO dla developmentu lub zaufanych wewnętrznych serwerów
      http.verify_mode = OpenSSL::SSL::VERIFY_NONE
      puts "WARNING: SSL verification disabled!"
    end
  end

  def handle_ssl_error(error)
    case error.message
    when /certificate verify failed/
      { status: 0, error: 'SSL certificate verification failed', ssl_error: true }
    when /hostname does not match/
      { status: 0, error: 'SSL hostname mismatch', ssl_error: true }
    when /connection reset/
      { status: 0, error: 'SSL connection reset', ssl_error: true }
    else
      { status: 0, error: "SSL error: #{error.message}", ssl_error: true }
    end
  end
end

# Tryb bezpieczny (produkcyjny)
secure_client = SSLAwareProxyClient.new(verify_ssl: true)

# Tryb permisywny (development/internal)
permissive_client = SSLAwareProxyClient.new(
  proxy_user: 'user-country-DE',
  verify_ssl: false
)

# Z własnym plikiem CA
corporate_client = SSLAwareProxyClient.new(
  verify_ssl: true,
  ca_file: '/etc/ssl/certs/corporate-ca.pem'
)

Typhoeus z konfiguracją SSL

require 'typhoeus'

class TyphoeusSSLClient
  def get_with_ssl(url, verify: true, ca_path: nil)
    options = {
      proxy: 'http://user-country-US:pass@gate.proxyhat.com:8080',
      timeout: 30,
      ssl_verifypeer: verify,
      ssl_verifyhost: verify ? 2 : 0
    }

    if ca_path
      Options[:ssl_capath] = ca_path
    end

    # SNI jest włączone domyślnie w libcurl
    # Możesz wyłączyć: sslversion: :tlsv1_2

    Typhoeus.get(url, **options)
  end
end

Integracja z Ruby on Rails

Faraday middleware z proxy

Faraday to popularny klient HTTP w ekosystemie Rails. Oto middleware integrujący ProxyHat:

# config/initializers/proxyhat.rb
require 'faraday'

module ProxyHat
  class FaradayMiddleware < Faraday::Middleware
    PROXY_HOST = 'gate.proxyhat.com'
    PROXY_PORT = 8080

    def initialize(app, username:, password:, country: nil, city: nil)
      super(app)
      @username = username
      @password = password
      @country = country
      @city = city
    end

    def call(env)
      proxy_user = build_proxy_user

      env[:proxy] = {
        uri: "http://#{PROXY_HOST}:#{PROXY_PORT}",
        user: proxy_user,
        password: @password
      }

      @app.call(env)
    end

    private

    def build_proxy_user
      parts = [@username]
      parts << "country-#{@country}" if @country
      parts << "city-#{@city}" if @city
      parts.join('-')
    end
  end
end

# Konfiguracja Faraday
module ApiClients
  class Base
    def self.connection(country: nil)
      Faraday.new do |builder|
        builder.use ProxyHat::FaradayMiddleware,
          username: Rails.application.credentials.proxyhat[:username],
          password: Rails.application.credentials.proxyhat[:password],
          country: country

        builder.request :retry,
          max: 3,
          interval: 1.0,
          backoff_factor: 2,
          exceptions: [Faraday::TimeoutError, Faraday::ConnectionFailed]

        builder.response :json, content_type: /json\b/
        builder.response :raise_error

        builder.adapter :typhoeus
      end
    end
  end

  class ScraperClient < Base
    def self.fetch(url, country: 'US')
      connection(country: country).get(url).body
    rescue Faraday::Error => e
      Rails.logger.error("Scraping failed: #{e.message}")
      nil
    end
  end
end

ActiveJob: scraping w tle

# app/jobs/scraping_job.rb
class ScrapingJob < ApplicationJob
  queue_as :scraping

  retry_on ScrapingError, wait: :exponentially_longer, attempts: 3
  discard_on ActiveJob::DeserializationError

  def perform(urls, options = {})
    country = options.fetch(:country, 'US')
    output_file = options[:output_file]

    scraper = ProductionScraper.new(
      username: Rails.application.credentials.proxyhat[:username],
      password: Rails.application.credentials.proxyhat[:password],
      concurrency: options.fetch(:concurrency, 50)
    )

    results = scraper.scrape(urls, country: country)

    if output_file
      save_results(results, output_file)
    end

    # Powiadomienie o zakończeniu
    ScrapingCompletionNotifier.call(results: results, job_id: job_id)

    results
  end

  private

  def save_results(results, filename)
    path = Rails.root.join('storage', 'scraping', filename)
    FileUtils.mkdir_p(File.dirname(path))
    File.write(path, JSON.pretty_generate(results.to_h))
  end
end

# app/jobs/batch_scraping_job.rb
class BatchScrapingJob < ApplicationJob
  queue_as :scraping_batch

  def perform(url_list_id, batch_size: 100)
    url_list = UrlList.find(url_list_id)
    urls = url_list.urls

    urls.each_slice(batch_size).with_index do |batch, idx|
      ScrapingJob.perform_later(
        batch,
        country: url_list.country,
        output_file: "batch_#{idx}_#{url_list.id}.json"
      )
    end
  end
end

# Użycie w kontrolerze
class ScrapingController < ApplicationController
  def create
    urls = params[:urls].split('\n').map(&:strip).compact_blank

    BatchScrapingJob.perform_later(create_url_list(urls).id)

    redirect_to scraping_status_path, notice: 'Scraping rozpoczęty'
  end

  private

  def create_url_list(urls)
    UrlList.create!(
      urls: urls,
      country: params[:country] || 'US',
      user: current_user
    )
  end
end

Sidekiq integration

Dla Sidekiq dodaj middleware do zarządzania połączeniami proxy:

# config/initializers/sidekiq.rb
Sidekiq.configure_client do |config|
  config.redis = { url: ENV['REDIS_URL'] }
end

Sidekiq.configure_server do |config|
  config.redis = { url: ENV['REDIS_URL'] }

  # Middleware do logowania i metryk
  config.server_middleware do |chain|
    chain.add ScrapingMetricsMiddleware
  end
end

# app/workers/scraping_worker.rb
class ScrapingWorker
  include Sidekiq::Worker
  sidekiq_options queue: 'scraping', retry: 3, backtrace: true

  def perform(urls, country = 'US')
    @stats = { success: 0, failed: 0 }

    urls.each_slice(50) do |batch|
      process_batch(batch, country)
    end

    logger.info "Completed: #{@stats.inspect}"
  end

  private

  def process_batch(urls, country)
    hydra = Typhoeus::Hydra.new(max_concurrency: 25)

    urls.each do |url|
      request = build_request(url, country)

      request.on_complete do |response|
        if response.success?
          process_success(url, response)
          @stats[:success] += 1
        else
          @stats[:failed] += 1
          logger.warn "Failed: #{url} - #{response.code}"
        end
      end

      hydra.queue(request)
    end

    hydra.run
  end

  def build_request(url, country)
    session = SecureRandom.hex(8)
    proxy = "http://user-country-#{country}-session-#{session}:pass@gate.proxyhat.com:8080"

    Typhoeus::Request.new(
      url,
      proxy: proxy,
      timeout: 30,
      followlocation: true,
      headers: { 'User-Agent' => random_user_agent }
    )
  end

  def process_success(url, response)
    # Zapisz do bazy lub cache
    ScrapedPage.create!(
      url: url,
      content: response.body,
      status: response.code
    )
  rescue ActiveRecord::RecordNotUnique
    # Ignoruj duplikaty
  end

  def random_user_agent
    UserAgents.sample
  end
end

Porównanie metod proxy w Ruby

Metoda Zalety Wady Przypadek użycia
Net::HTTP Stdlib, brak zależności, pełna kontrola Synchroniczny, brak connection pooling Proste skrypty, pojedyncze żądania
Typhoeus Równoległość (Hydra), libcurl, wydajny Wymaga gem, natywna kompilacja Scraping na dużą skalę, wysoka konkurencja
ProxyHat SDK Rotacja IP, geo-targeting, retry logic Zależność od zewnętrznego serwisu Produkcyjny scraping z anti-bot bypass
Faraday Middleware, elastyczny, Rails-friendly Narzut, wymaga adaptera API clients, Rails aplikacje

Kluczowe wnioski

  • Net::HTTP wystarcza dla prostych zadań, ale brakuje mu równoległości — używaj Typhoeus dla scrapingu na skalę.
  • Hydra w Typhoeus pozwala na setki równoległych żądań — kluczowe przy pobieraniu tysięcy URL-i.
  • Rotacja IP w ProxyHat przekazywana jest przez username — każde żądanie może mieć inny kraj/miasto.
  • Circuit breaker i retry logic to konieczność przy produkcyjnym scrapingu — nie polegaj na pojedynczych próbach.
  • SSL verification wyłączaj tylko dla developmentu — w produkcji zawsze weryfikuj certyfikaty.
  • ActiveJob/Sidekiq to naturalne miejsce dla zadań scrapingu w aplikacjach Rails.

Szukasz gotowego rozwiązania proxy residential dla Ruby? Sprawdź ceny ProxyHat lub przejrzyj dostępne lokalizacje.

Gotowy, aby zacząć?

Dostęp do ponad 50 mln rezydencjalnych IP w ponad 148 krajach z filtrowaniem AI.

Zobacz cenyProxy rezydencjalne
← Powrót do Bloga