How to Scrape Twitter/X Data with Proxies in 2025: A Developer's Guide

Learn how to scrape public Twitter/X data using residential proxies after the API restrictions. Includes Python/Playwright examples, rate limit handling, and legal considerations.

How to Scrape Twitter/X Data with Proxies in 2025: A Developer's Guide

Important Disclaimer: This guide covers techniques for accessing publicly available data on X (formerly Twitter). Web scraping must comply with the platform's Terms of Service, robots.txt directives, and applicable laws including the CFAA (US) and GDPR (EU). Always consider official APIs first for production use cases. This content is for educational purposes only.

The Post-API Landscape: Why Teams Are Turning to Web Scraping

If you've been building social monitoring tools or sentiment dashboards over the past few years, you've felt the impact of X's API changes. In 2023, the platform dramatically restructured its API tiers, eliminating the free search endpoint that many developers relied on for years.

The current API landscape looks dramatically different:

Tier Monthly Cost Tweet Cap Search Access
Free $0 1,500 posts/month No search endpoint
Basic $100/month 3,000 posts/month Limited search
Pro $5,000/month 10,000 posts/month Full search access
Enterprise Custom pricing Unlimited (negotiated) Full access

For growth teams and developers building monitoring tools, these changes created a stark choice: pay thousands monthly for API access, or find alternative methods to gather public data. This is where web scraping with residential proxies enters the conversation.

What Data Is Actually Accessible Without Login?

Before diving into technical implementation, understand what X exposes to anonymous visitors. The platform's single-page application (SPA) loads data via GraphQL endpoints, and the accessibility varies significantly:

Publicly Accessible (No Login Required)

  • User profiles: Display name, bio, follower/following counts, join date, profile image
  • Public tweets: Text content, media attachments, timestamp, engagement metrics (likes, retweets, replies)
  • Tweet threads: Full conversation chains for public accounts
  • Trending topics: Current trending hashtags and topics by region
  • Search results: Limited visibility—X applies aggressive rate limiting to anonymous search

Login-Walled or Restricted

  • Protected accounts: All content from private accounts requires authentication
  • Full search history: Anonymous search returns limited results before hitting walls
  • Advanced search filters: Date ranges and boolean operators require login
  • Community posts: Private community content is inaccessible
  • DMs and notifications: Obviously require account access

Key Insight: X applies stricter rate limits to non-authenticated sessions. A logged-in user can browse more content before hitting limits, but even authenticated scraping carries account-level risks. Anonymous scraping with rotating residential proxies often provides better longevity for large-scale operations.

Why Residential Proxies Are Essential for X Scraping

X's anti-bot systems are among the most sophisticated in social media. They employ multiple detection layers:

Datacenter IP Flagging

X maintains extensive blocklists of datacenter IP ranges. When requests originate from AWS, GCP, Azure, or known hosting providers, the platform often returns HTTP 429 (Too Many Requests) immediately or serves CAPTCHA challenges. Datacenter proxies—while fast and cheap—get flagged quickly.

Behavioral Analysis

X tracks request patterns, timing, and user-agent consistency. A single IP making hundreds of requests per minute triggers automated throttling. Residential proxies distribute requests across many IPs, each appearing as a legitimate home or mobile connection.

Rate Limit Tiers

X applies different rate limits based on detection confidence:

Session Type Approximate Limit Detection Trigger
Logged-in user ~900 requests/15min Account-level throttling
Anonymous residential ~300-500 requests/session IP rotation needed
Anonymous datacenter ~50-100 requests Often blocked immediately
Flagged IP/range 0 requests CAPTCHA or 403

Mobile Proxies: The Premium Option

For the highest trust scores, mobile proxies (4G/5G) provide IP addresses from real mobile carrier pools. X treats these as the most legitimate traffic source since mobile users naturally share IPs through carrier NAT. However, mobile proxies come at a premium price point.

ProxyHat offers both residential and mobile proxy options with geo-targeting capabilities, allowing you to route requests through specific countries or cities—useful when scraping regional trending topics or location-specific content.

Technical Implementation: Python + Playwright with Rotating Proxies

X's SPA architecture means you can't simply parse HTML. The real data lives in JSON payloads returned from GraphQL endpoints, embedded in the page or fetched dynamically. Here's how to extract it properly.

Basic Setup with ProxyHat Residential Proxies

import asyncio
from playwright.async_api import async_playwright
import json
import re

# ProxyHat residential proxy configuration
PROXY_CONFIG = {
    "server": "http://gate.proxyhat.com:8080",
    "username": "user-country-US",  # Rotate through US IPs
    "password": "YOUR_PASSWORD"
}

async def create_browser_context(playwright, proxy_config):
    browser = await playwright.chromium.launch(
        proxy=proxy_config,
        headless=True
    )
    context = await browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        viewport={"width": 1920, "height": 1080}
    )
    return browser, context

async def scrape_profile(username):
    async with async_playwright() as p:
        browser, context = await create_browser_context(p, PROXY_CONFIG)
        page = await context.new_page()
        
        # Navigate to profile
        await page.goto(f"https://x.com/{username}", wait_until="networkidle")
        
        # Wait for tweets to load
        await page.wait_for_selector('[data-testid="tweet"]', timeout=15000)
        
        # Extract embedded JSON data
        tweets = await extract_tweets_from_page(page)
        
        await browser.close()
        return tweets

async def extract_tweets_from_page(page):
    # X embeds initial state in a script tag
    content = await page.content()
    
    # Find the __NEXT_DATA__ or similar embedded JSON
    pattern = r'<script[^>]*>window\.__INITIAL_STATE__\s*=\s*({.*?})</script>'
    match = re.search(pattern, content, re.DOTALL)
    
    if match:
        try:
            data = json.loads(match.group(1))
            return parse_tweet_data(data)
        except json.JSONDecodeError:
            pass
    
    # Fallback: extract from DOM
    return await extract_from_dom(page)

async def extract_from_dom(page):
    tweets = []
    tweet_elements = await page.query_selector_all('[data-testid="tweet"]')
    
    for tweet_el in tweet_elements[:10]:  # Limit for demo
        text_el = await tweet_el.query_selector('[data-testid="tweetText"]')
        time_el = await tweet_el.query_selector('time')
        
        text = await text_el.inner_text() if text_el else ""
        time = await time_el.get_attribute('datetime') if time_el else ""
        
        tweets.append({
            "text": text,
            "timestamp": time
        })
    
    return tweets

# Run the scraper
if __name__ == "__main__":
    tweets = asyncio.run(scrape_profile("elonmusk"))
    print(json.dumps(tweets, indent=2))

Handling GraphQL Endpoints Directly

For more efficient scraping, intercept the GraphQL requests X makes internally:

import asyncio
from playwright.async_api import async_playwright
import json

class TwitterGraphQLScraper:
    def __init__(self, proxy_config):
        self.proxy_config = proxy_config
        self.graphql_responses = []
    
    async def intercept_graphql(self, response):
        if "graphql" in response.url and response.ok:
            try:
                data = await response.json()
                self.graphql_responses.append({
                    "url": response.url,
                    "data": data
                })
            except:
                pass
    
    async def scrape_search(self, query, max_results=50):
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                proxy=self.proxy_config,
                headless=True
            )
            context = await browser.new_context()
            page = await context.new_page()
            
            # Set up response interception
            page.on("response", self.intercept_graphql)
            
            # Navigate to search
            search_url = f"https://x.com/search?q={query}&src=typed_query"
            await page.goto(search_url, wait_until="networkidle")
            
            # Scroll to load more results
            for _ in range(3):
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                await page.wait_for_timeout(2000)
            
            await browser.close()
            
            return self.parse_graphql_responses()
    
    def parse_graphql_responses(self):
        tweets = []
        for response in self.graphql_responses:
            data = response.get("data", {})
            # Navigate the nested GraphQL structure
            # Structure varies by endpoint
            instructions = data.get("data", {}).get("search_by_raw_query", {}).get("timeline", {}).get("instructions", [])
            for instruction in instructions:
                if instruction.get("type") == "TimelineAddEntries":
                    entries = instruction.get("entries", [])
                    for entry in entries:
                        content = entry.get("content", {}).get("itemContent", {})
                        tweet_results = content.get("tweet_results", {}).get("result", {})
                        if tweet_results:
                            legacy = tweet_results.get("legacy", {})
                            tweets.append({
                                "id": tweet_results.get("rest_id"),
                                "text": legacy.get("full_text"),
                                "created_at": legacy.get("created_at"),
                                "likes": legacy.get("favorite_count", 0),
                                "retweets": legacy.get("retweet_count", 0)
                            })
        return tweets

# Usage
PROXY_CONFIG = {
    "server": "http://gate.proxyhat.com:8080",
    "username": "user-country-US-session-abc123",  # Sticky session
    "password": "YOUR_PASSWORD"
}

scraper = TwitterGraphQLScraper(PROXY_CONFIG)
results = asyncio.run(scraper.scrape_search("python%20programming"))
print(json.dumps(results[:10], indent=2))

Node.js Implementation

For teams preferring JavaScript/TypeScript:

const { chromium } = require('playwright');

const PROXY_CONFIG = {
  server: 'http://gate.proxyhat.com:8080',
  username: 'user-country-US',
  password: process.env.PROXYHAT_PASSWORD
};

async function scrapeTweets(username) {
  const browser = await chromium.launch({
    proxy: PROXY_CONFIG,
    headless: true
  });
  
  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  });
  
  const page = await context.newPage();
  const tweets = [];
  
  // Intercept GraphQL responses
  page.on('response', async (response) => {
    if (response.url().includes('graphql') && response.ok()) {
      try {
        const json = await response.json();
        // Parse based on endpoint structure
        const entries = json?.data?.user?.result?.timeline?.timeline?.instructions?.[0]?.entries || [];
        entries.forEach(entry => {
          const tweet = entry?.content?.itemContent?.tweet_results?.result;
          if (tweet?.legacy) {
            tweets.push({
              id: tweet.rest_id,
              text: tweet.legacy.full_text,
              created_at: tweet.legacy.created_at,
              likes: tweet.legacy.favorite_count
            });
          }
        });
      } catch (e) {}
    }
  });
  
  await page.goto(`https://x.com/${username}`, { waitUntil: 'networkidle' });
  await page.waitForSelector('[data-testid="tweet"]', { timeout: 15000 });
  
  // Scroll to trigger more GraphQL requests
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(3000);
  
  await browser.close();
  return tweets;
}

scrapeTweets('nasa').then(console.log);

Handling Rate Limits and Detection

Even with residential proxies, X will eventually detect scraping patterns. Here's how to build resilience:

1. Implement Exponential Backoff

import asyncio
import random

class RateLimitHandler:
    def __init__(self, max_retries=3):
        self.max_retries = max_retries
    
    async def request_with_backoff(self, page, url):
        for attempt in range(self.max_retries):
            try:
                response = await page.goto(url, wait_until="networkidle")
                
                if response.status == 429:
                    wait_time = self.calculate_backoff(attempt)
                    print(f"Rate limited. Waiting {wait_time}s...")
                    await asyncio.sleep(wait_time)
                    continue
                
                if response.status == 200:
                    return response
                
                # Other errors
                if response.status >= 500:
                    await asyncio.sleep(2 ** attempt)
                    continue
                
                return response
                
            except Exception as e:
                if attempt == self.max_retries - 1:
                    raise
                await asyncio.sleep(2 ** attempt)
        
        return None
    
    def calculate_backoff(self, attempt):
        base = 2 ** attempt
        jitter = random.uniform(0, 1)
        return base + jitter

2. Rotate Proxies Strategically

Don't rotate on every request—that's a red flag. Instead:

  • Session-based rotation: Use sticky sessions for 10-30 minutes, then rotate
  • Request count rotation: Rotate after 50-100 requests per IP
  • Error-triggered rotation: Rotate immediately on 429 or CAPTCHA
# ProxyHat session rotation example
PROXY_CONFIG = {
    "server": "http://gate.proxyhat.com:8080",
    "username": f"user-country-US-session-{generate_session_id()}",
    "password": "YOUR_PASSWORD"
}

def generate_session_id():
    import uuid
    return str(uuid.uuid4())[:8]

3. Mimic Human Behavior

  • Add random delays between requests (2-5 seconds)
  • Scroll pages naturally before extracting data
  • Don't request the same endpoint repeatedly from one session
  • Vary your user agents and viewport sizes

4. Monitor and Adapt

Track success rates per IP and proxy pool. When a proxy range shows elevated 429 rates, exclude it temporarily. ProxyHat's geo-targeting lets you shift traffic to different regions if one area shows signs of flagging.

Legal and Ethical Considerations

The legal landscape around social media scraping remains unsettled, but several principles are clear:

Terms of Service Violations

X's Terms of Service explicitly prohibit unauthorized scraping. While a ToS violation alone doesn't create criminal liability, it can result in:

  • Account termination
  • IP blocking
  • Civil lawsuits for breach of contract

CFAA Considerations

In the United States, the Computer Fraud and Abuse Act (CFAA) has been invoked in scraping cases. However, recent court decisions (including Van Buren v. United States) have narrowed CFAA's application to situations involving unauthorized access to protected systems. Publicly accessible data on social platforms occupies a gray area.

GDPR and Data Protection

In the EU, scraping personal data—even publicly posted—triggers GDPR obligations. If you're storing or processing data of EU residents, you need:

  • Legal basis for processing
  • Privacy notices to data subjects
  • Right to deletion mechanisms

Recent Legal Precedents

X has filed lawsuits against scraping operations, with mixed results. In several cases, courts have recognized that public data access differs from hacking, but contract claims (ToS violations) have been upheld. The key is scale and intent—small-scale research faces less risk than commercial operations competing with X's own data products.

When to Use Official APIs Instead

Consider the official X API when:

  • You need reliable, legal access for a commercial product
  • Your use case fits within the Basic or Pro tier limits
  • You require real-time streaming capabilities
  • You need access to historical data beyond what's on the public web
  • Your organization can't accept legal risk

Scraping may be appropriate when:

  • You're conducting academic research with IRB approval
  • You need data not available through any API tier
  • Your scale is modest and sporadic
  • You're building internal tools, not commercial products

Practical Advice: For production sentiment dashboards and monitoring tools, the Pro tier API at $5,000/month may actually be more cost-effective than maintaining a robust scraping infrastructure with proxy costs, development time, and legal risk. Run the numbers for your specific use case.

Key Takeaways

  • API restrictions drove scraping adoption: X's elimination of free search and high Pro tier pricing pushed many teams toward web scraping as the only viable option for moderate-scale data access.
  • Residential proxies are non-negotiable: Datacenter IPs get flagged almost immediately. Residential and mobile proxies provide the trust scores needed for sustainable scraping.
  • GraphQL interception is more efficient: Rather than parsing HTML, intercept X's internal GraphQL responses for structured JSON data.
  • Rate limits require adaptive strategies: Implement exponential backoff, session-based proxy rotation, and human-like behavior patterns.
  • Legal risk varies by use case: Academic research faces different exposure than commercial products. Always evaluate whether official APIs are the better choice.
  • Respect platform boundaries: Don't attempt to access private accounts, circumvent login walls, or scrape at scales that could impact platform performance.

For teams building social monitoring infrastructure, ProxyHat's residential proxy network provides the rotating IP pool needed for reliable data collection. Combined with proper rate limit handling and legal awareness, you can build sustainable data pipelines for sentiment analysis, trend monitoring, and competitive intelligence.

Ready to get started?

Access 50M+ residential IPs across 148+ countries with AI-powered filtering.

View PricingResidential Proxies
← Back to Blog