Is it legal to scrape public YouTube data?

Scraping publicly accessible data (video titles, view counts, public comments) is generally legal in many jurisdictions, but it may violate YouTube's Terms of Service. You should never bypass authentication, access private content, or redistribute scraped transcripts or video content. Always consult legal counsel for your specific use case and comply with CFAA, GDPR, and other applicable laws.

Why does YouTube block datacenter IPs so quickly?

Google maintains ASN databases for major cloud providers (AWS, DigitalOcean, Hetzner, etc.) and assigns low trust scores to these IP ranges because they're historically associated with bot traffic. Residential IPs come from ISPs and appear as real users, making them far less likely to trigger CAPTCHAs or rate-limit blocks.

What is the YouTube InnerTube API?

InnerTube is YouTube's internal API at www.youtube.com/youtubei/v1/. It serves the same JSON data that YouTube's web and mobile apps consume — including video metadata, comments, search results, and channel data. It's more comprehensive and less quota-restricted than the public Data API v3, but it's undocumented and can change without notice.

How do I rotate IPs when scraping YouTube?

Use a residential proxy service with session-based rotation. For independent requests, generate a random session ID in your proxy username to get a fresh IP each time. For paginated data like comment threads, keep the same session ID to maintain IP consistency. Limit requests to 1–2 per second per IP and add randomized delays between requests.

Can I scrape YouTube transcripts at scale?

Yes, for videos with auto-generated or manually uploaded captions (the vast majority). The youtube-transcript-api Python library can fetch these through a proxy. However, transcripts are derivative works of creators' content — scraping them for personal analysis is generally acceptable, but redistributing or republishing them may violate copyright and YouTube's Terms of Service.

Scrape YouTube with Proxies: InnerTube Guide | ProxyHat

Why YouTube Data Extraction Demands More Than the Official API

YouTube is the world's second-largest search engine and the richest public repository of video metadata on the internet. For media analytics teams tracking creator trends, brand-safety monitors auditing ad placements, and researcher groups studying misinformation, YouTube data is essential.

The YouTube Data API v3 exists — and for small projects it works fine. But the moment you need comment threads at scale, early trend detection before videos hit the algorithm, or systematic ad-monitoring, the API's quota walls turn your pipeline into a trickle. That's when teams turn to YouTube data extraction via the internal InnerTube API and, critically, YouTube residential proxies to keep the requests flowing.

Legal & ethical disclaimer: This guide covers accessing public YouTube data only. Scraping that bypasses authentication, circumvents technical measures, or violates YouTube's Terms of Service may breach the CFAA (US), GDPR (EU), or other laws. Always respect creators' ownership — do not redistribute transcripts, video content, or personal data. Where official APIs exist and meet your needs, prefer them.

When the YouTube Data API v3 Is Enough — And When It Isn't

The Data API v3 is well-documented and stable. If your use case is monitoring a handful of channels or fetching metadata for under 10,000 videos per day, it may be all you need.

Quota costs that add up fast

YouTube allocates each project 10,000 quota units per day. The cost per request varies dramatically:

Endpoint	Quota Cost	Requests / Day at 10k Units
`videos.list`	1 unit	10,000
`search.list`	100 units	100
`commentThreads.list`	1 unit	10,000
`channels.list`	1 unit	10,000

At first glance, videos.list seems generous. But a comment-scraping pipeline that fetches threads and then replies for a trending video can burn through 10,000 units in under an hour. search.list at 100 units per call is effectively unusable for any real-time monitoring — you get 100 searches a day, full stop.

What the API can't give you at all

Full comment thread depth — the API paginates at 100 comments per page and the deeper you go, the more requests you burn.
Auto-generated transcripts — there is no endpoint for captions/transcripts in Data API v3.
Real-time view-count deltas — the API caches counts; InnerTube returns fresher figures.
Ad and sponsorship metadata — no API access whatsoever.
Recommendation graph data — the "related videos" the API returns differ from what real users see.

This is where scraping YouTube with proxies becomes necessary — not to access private data, but to retrieve the same public data a browser sees, at the speed and depth your product requires.

Public YouTube Data You Can Access Without Logging In

YouTube serves most of its public-facing content to anonymous browser sessions. Here's what's accessible without authentication:

Video metadata — title, description, view count, like count, upload date, duration, thumbnail URLs, category.
Channel pages — subscriber counts, video lists, about sections, banner images.
Comment threads — top-level comments and replies, including like counts and timestamps.
Auto-generated transcripts — available when the creator has enabled auto-captions (most videos).
Search results — video, channel, and playlist results for any query.
Trending pages — per-category and per-country trending feeds.

Data that requires login and is out of scope for ethical scraping: watch history, private/unlisted videos, subscription feeds, and any personally identifying channel data the creator hasn't made public.

Understanding the YouTube InnerTube API

When you open YouTube in a browser, the page doesn't call youtube.googleapis.com. It calls YouTube's internal API — InnerTube — at www.youtube.com/youtubei/v1/. This is the same API the mobile apps use, and it returns structured JSON far richer than what the public Data API offers.

Key InnerTube endpoints

Endpoint	Purpose	Notes
`/youtubei/v1/player`	Video metadata, streaming info	Returns playability status, length, views
`/youtubei/v1/next`	Comments, related videos	Paginated via continuation tokens
`/youtubei/v1/search`	Search results	Much cheaper than Data API search
`/youtubei/v1/browse`	Channel pages, trending	Channel video grids, tabs

Continuation tokens and pagination

InnerTube doesn't use page numbers. Instead, each response includes a continuation token — an opaque string you pass in the next request to get the following page. For comment threads, this looks like:

{
  "context": {
    "client": {
      "clientName": "WEB",
      "clientVersion": "2.20240501.00.00"
    }
  },
  "continuation": "EgSC4oCEDqD9hwAAAEJCCtQzWUZB..."
}

You extract the token from the previous response's continuationItems array and feed it back until the array is empty. This pattern lets you walk through arbitrarily long comment threads — something the Data API makes prohibitively expensive.

Required headers and fingerprinting

InnerTube expects a specific set of headers that mimic a real browser session. At minimum, include:

Content-Type: application/json
User-Agent — a current Chrome UA string
X-YouTube-Client-Name: 1 (WEB client)
X-YouTube-Client-Version — matches the version in your context payload

Missing or stale headers are one of the fastest ways to get your requests flagged. Google's anti-bot systems cross-reference your headers, TLS fingerprint, and IP reputation — which is exactly why YouTube residential proxies matter.

Why Residential Proxies Are Essential for YouTube Scraping

Google operates one of the most sophisticated bot-detection systems on the internet. When you send hundreds or thousands of requests from a datacenter IP range, the pattern is trivial to identify:

ASN fingerprinting — Google maintains lists of hosting provider ASNs (AWS, DigitalOcean, Hetzner, etc.). Requests from these ranges receive CAPTCHAs or HTTP 429 responses almost immediately.
Behavioral analysis — request cadence, header ordering, and TLS handshake patterns are compared against known bot profiles.
IP reputation scoring — datacenter IPs have low trust scores because they've historically been used for scraping, spam, and credential stuffing.

Residential vs. datacenter for YouTube

Factor	Datacenter Proxies	Residential Proxies
IP trust with Google	Low — flagged quickly	High — appears as real user
Success rate (first 1k requests)	~40–60%	~92–98%
CAPTCHA frequency	High	Low (if rate-limited)
Cost per GB	Lower	Higher
Best for	Small one-off tasks	Sustained scraping at scale

For sustained YouTube data extraction, residential proxies are not optional — they are the difference between a pipeline that works and one that spends 80% of its time solving CAPTCHAs. Mobile proxies (which rotate from real carrier IP pools) offer even higher trust scores for the most aggressive workloads.

Python: Scraping YouTube Metadata & Comments with InnerTube

Let's build a practical pipeline that fetches video metadata and comments via InnerTube, rotating through residential proxies on each request.

Setup

pip install requests youtube-transcript-api

Fetching video metadata via InnerTube player endpoint

import requests
import random
import string

PROXY_URL = "http://user-country-US:YOUR_PASSWORD@gate.proxyhat.com:8080"

PROXIES = {"http": PROXY_URL, "https": PROXY_URL}

INNER_TUBE_CONTEXT = {
    "client": {
        "clientName": "WEB",
        "clientVersion": "2.20240501.00.00",
        "hl": "en",
        "gl": "US"
    }
}

HEADERS = {
    "Content-Type": "application/json",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "X-YouTube-Client-Name": "1",
    "X-YouTube-Client-Version": "2.20240501.00.00",
}


def get_video_metadata(video_id: str) -> dict:
    """Fetch video metadata via the InnerTube player endpoint."""
    payload = {
        "context": INNER_TUBE_CONTEXT,
        "videoId": video_id,
    }
    resp = requests.post(
        "https://www.youtube.com/youtubei/v1/player",
        json=payload,
        headers=HEADERS,
        proxies=PROXIES,
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    return {
        "video_id": video_id,
        "title": data.get("videoDetails", {}).get("title"),
        "view_count": data.get("videoDetails", {}).get("viewCount"),
        "length_seconds": data.get("videoDetails", {}).get("lengthSeconds"),
        "channel": data.get("videoDetails", {}).get("author"),
        "description": data.get("videoDetails", {}).get("shortDescription"),
    }


# Example usage
meta = get_video_metadata("dQw4w9WgXcQ")
print(meta)

Walking comment threads with continuation tokens

import time


def get_initial_comments(video_id: str) -> tuple[list, str | None]:
    """Fetch the first page of comments and return (comments, continuation_token)."""
    payload = {
        "context": INNER_TUBE_CONTEXT,
        "videoId": video_id,
    }
    resp = requests.post(
        "https://www.youtube.com/youtubei/v1/next",
        json=payload,
        headers=HEADERS,
        proxies=PROXIES,
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    # Navigate the nested structure to find comment data
    comments = []
    continuation_token = None
    
    # The response structure is deeply nested; this is a simplified extractor
    for renderer in data.get("contents", {}).get("twoColumnWatchNextResults", {})
        .get("results", {}).get("results", {}).get("contents", []):
        if "itemSectionRenderer" in renderer:
            section = renderer["itemSectionRenderer"]
            for item in section.get("contents", []):
                if "commentThreadRenderer" in item:
                    comment = item["commentThreadRenderer"]["comment"]["commentRenderer"]
                    comments.append({
                        "author": comment.get("authorText", {}).get("simpleText"),
                        "text": comment.get("contentText", {}).get("runs", [{}])[0].get("text"),
                        "likes": comment.get("voteCount", {}).get("simpleText", "0"),
                    })
                if "continuationItemRenderer" in item:
                    continuation_token = item["continuationItemRenderer"]
                        .get("continuationEndpoint", {})
                        .get("continuationCommand", {}).get("token")

    return comments, continuation_token


def get_next_comments(continuation_token: str) -> tuple[list, str | None]:
    """Fetch the next page of comments using a continuation token."""
    payload = {
        "context": INNER_TUBE_CONTEXT,
        "continuation": continuation_token,
    }
    resp = requests.post(
        "https://www.youtube.com/youtubei/v1/next",
        json=payload,
        headers=HEADERS,
        proxies=PROXIES,
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    comments = []
    next_token = None
    
    for item in data.get("onResponseReceivedEndpoints", []):
        for entry in item.get("continuationItems", []):
            if "commentThreadRenderer" in entry:
                comment = entry["commentThreadRenderer"]["comment"]["commentRenderer"]
                comments.append({
                    "author": comment.get("authorText", {}).get("simpleText"),
                    "text": comment.get("contentText", {}).get("runs", [{}])[0].get("text"),
                    "likes": comment.get("voteCount", {}).get("simpleText", "0"),
                })
            if "continuationItemRenderer" in entry:
                next_token = entry["continuationItemRenderer"]
                    .get("continuationEndpoint", {})
                    .get("continuationCommand", {}).get("token")

    return comments, next_token


def scrape_all_comments(video_id: str, max_pages: int = 50) -> list:
    """Scrape comments with rate limiting and proxy rotation."""
    all_comments = []
    comments, token = get_initial_comments(video_id)
    all_comments.extend(comments)
    
    page = 1
    while token and page < max_pages:
        time.sleep(random.uniform(1.5, 3.5))  # Respectful delay
        comments, token = get_next_comments(token)
        all_comments.extend(comments)
        page += 1
        print(f"Page {page}: {len(all_comments)} comments collected")
    
    return all_comments

Fetching transcripts with youtube-transcript-api through a proxy

from youtube_transcript_api import YouTubeTranscriptApi

# The youtube-transcript-api library uses requests under the hood,
# so you can set the proxy via environment variables or pass a proxies dict.
import os
os.environ["HTTP_PROXY"] = PROXY_URL
os.environ["HTTPS_PROXY"] = PROXY_URL

def get_transcript(video_id: str) -> list[dict]:
    """Fetch auto-generated transcript for a public video."""
    try:
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
        return [{"text": entry["text"], "start": entry["start"], "duration": entry["duration"]} for entry in transcript_list]
    except Exception as e:
        print(f"Transcript unavailable for {video_id}: {e}")
        return []


transcript = get_transcript("dQw4w9WgXcQ")
for line in transcript[:5]:
    print(f"[{line['start']:.1f}s] {line['text']}")

Node.js: InnerTube Video Metadata Fetcher

For teams running JavaScript pipelines, here's a Node.js equivalent for fetching video metadata through a residential proxy:

const https = require('https');
const { HttpsProxyAgent } = require('https-proxy-agent');

const PROXY_URL = 'http://user-country-US:YOUR_PASSWORD@gate.proxyhat.com:8080';
const agent = new HttpsProxyAgent(PROXY_URL);

const INNER_TUBE_CONTEXT = {
  client: {
    clientName: 'WEB',
    clientVersion: '2.20240501.00.00',
    hl: 'en',
    gl: 'US'
  }
};

const HEADERS = {
  'Content-Type': 'application/json',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
  'X-YouTube-Client-Name': '1',
  'X-YouTube-Client-Version': '2.20240501.00.00'
};

async function getVideoMetadata(videoId) {
  const payload = JSON.stringify({ context: INNER_TUBE_CONTEXT, videoId });
  
  return new Promise((resolve, reject) => {
    const req = https.request({
      hostname: 'www.youtube.com',
      path: '/youtubei/v1/player',
      method: 'POST',
      headers: { ...HEADERS, 'Content-Length': Buffer.byteLength(payload) },
      agent
    }, (res) => {
      let data = '';
      res.on('data', chunk => data += chunk);
      res.on('end', () => {
        const parsed = JSON.parse(data);
        resolve({
          video_id: videoId,
          title: parsed.videoDetails?.title,
          view_count: parsed.videoDetails?.viewCount,
          channel: parsed.videoDetails?.author
        });
      });
    });
    req.on('error', reject);
    req.write(payload);
    req.end();
  });
}

getVideoMetadata('dQw4w9WgXcQ').then(console.log);

Proxy Rotation Strategy for YouTube Scraping

Even with residential IPs, Google monitors request patterns. A single IP making 200 requests in 60 seconds will still get flagged. Here's how to structure your rotation:

Per-request rotation (sticky sessions not needed)

For fetching independent video metadata pages, rotate the IP on every request. With ProxyHat, you can use a random session identifier in the username to get a fresh IP each time:

import random, string

def get_rotating_proxy():
    """Generate a proxy URL with a random session for IP rotation."""
    session = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
    return f"http://user-session-{session}-country-US:YOUR_PASSWORD@gate.proxyhat.com:8080"

# Use in requests:
proxies = {"http": get_rotating_proxy(), "https": get_rotating_proxy()}

Sticky sessions for paginated data

When walking comment threads with continuation tokens, keep the same IP for the entire thread. A session that suddenly shifts IPs mid-conversation looks suspicious. Use a consistent session ID:

def get_sticky_proxy(session_id: str, country: str = "US"):
    """Return a proxy URL that maintains the same IP for the session."""
    return f"http://user-session-{session_id}-country-{country}:YOUR_PASSWORD@gate.proxyhat.com:8080"

# Use the same session_id for all continuation requests in one thread
session_id = "comment-thread-abc123"
sticky_proxy = get_sticky_proxy(session_id)

Rate limiting discipline

1–2 requests per second per IP — mimic a human browsing speed.
Randomized delays — add jitter (1.5–3.5 seconds) between requests.
Concurrent IP limit — don't exceed ~5 concurrent connections from the same IP.
Daily IP budget — rotate through enough IPs that no single IP makes more than ~300 requests per day.

Handling CAPTCHAs and Blocks

Even with best practices, you'll occasionally hit CAPTCHAs. Here's how to handle them gracefully:

Detect early — YouTube returns CAPTCHA pages as HTML, not JSON. If your response isn't valid JSON, discard that IP and rotate.
Exponential backoff — if a proxy gets blocked, wait before retrying with a new IP.
Geo-match your targets — scraping US trending pages from a German residential IP looks odd. Use country-US geo-targeting on your proxy to match.
Diversify your fingerprint — rotate User-Agent strings, vary header order slightly, and use realistic clientVersion values.

When to Use the Official YouTube API Instead

Residential proxy scraping is powerful, but it's not always the right tool. Prefer the Data API v3 when:

You need fewer than ~5,000 video metadata lookups per day — the API's 1-unit cost makes this trivial.
You're building a production service that needs SLA guarantees — official APIs are stable; scraping endpoints can change without notice.
You need channel ownership verification or authenticated actions (uploading, deleting comments, etc.).
Your use case involves fewer than 100 searches per day — accept the 100-unit cost and move on.

Combine both: use the Data API for metadata lookups (cheap) and InnerTube + proxies for comments, transcripts, and search at scale.

Ethical Scraping: Respecting Creators and the Platform

Technical capability does not equal ethical permission. Follow these principles:

Never redistribute transcripts or video content — auto-generated transcripts are derivative works of the creator's original content. Scraping them for your own analysis is one thing; publishing them is another entirely.
Respect rate limits — even without authentication, hammering YouTube's servers degrades the experience for everyone. Throttle your requests.
Honor robots.txt directives where technically feasible — Google's robots.txt is complex; at minimum, avoid endpoints explicitly disallowed.
Minimize personal data collection — comment text can contain personal information. Apply GDPR/CCPA principles even if you're not legally required to.
Use official APIs when they suffice — if the Data API meets your needs within quota, don't scrape.
Don't build competitive products from scraped data — creating a YouTube clone or a direct competitor using scraped data violates ToS and is ethically dubious.

Key Takeaways

The YouTube Data API v3 is sufficient for low-volume metadata lookups but becomes prohibitively expensive for comments, search, and transcripts at scale.
The InnerTube API (/youtubei/v1/player, /next, /browse) returns the same rich JSON a browser sees — including comments, transcripts, and real-time view counts.
Residential proxies are essential for YouTube scraping at scale because Google aggressively flags datacenter IP ranges.
Use per-request rotation for independent fetches and sticky sessions for paginated comment threads.
Maintain respectful rate limits: 1–2 req/s per IP, randomized delays, and geo-matched proxies.
Never redistribute transcripts or scraped video content — this violates creator ownership and potentially copyright law.
Combine the Data API (for cheap metadata) with InnerTube + proxies (for comments, search, transcripts) for the most cost-effective pipeline.

Ready to build your YouTube data pipeline? Explore ProxyHat's residential proxy plans to get started with geo-targeted, rotating IPs designed for scale — or check out our web scraping use case guide for more implementation patterns.

How to Scrape YouTube Data with Proxies: InnerTube, Transcripts & Rate Limits

Why YouTube Data Extraction Demands More Than the Official API

When the YouTube Data API v3 Is Enough — And When It Isn't

Quota costs that add up fast

What the API can't give you at all

Public YouTube Data You Can Access Without Logging In

Understanding the YouTube InnerTube API

Key InnerTube endpoints

Continuation tokens and pagination

Required headers and fingerprinting

Why Residential Proxies Are Essential for YouTube Scraping

Residential vs. datacenter for YouTube

Python: Scraping YouTube Metadata & Comments with InnerTube

Setup

Fetching video metadata via InnerTube player endpoint

Walking comment threads with continuation tokens

Fetching transcripts with youtube-transcript-api through a proxy

Node.js: InnerTube Video Metadata Fetcher

Proxy Rotation Strategy for YouTube Scraping

Per-request rotation (sticky sessions not needed)

Sticky sessions for paginated data

Rate limiting discipline

Handling CAPTCHAs and Blocks

When to Use the Official YouTube API Instead

Ethical Scraping: Respecting Creators and the Platform

Key Takeaways

Ready to get started?

Why YouTube Data Extraction Demands More Than the Official API

When the YouTube Data API v3 Is Enough — And When It Isn't

Quota costs that add up fast

What the API can't give you at all

Public YouTube Data You Can Access Without Logging In

Understanding the YouTube InnerTube API

Key InnerTube endpoints

Continuation tokens and pagination

Required headers and fingerprinting

Why Residential Proxies Are Essential for YouTube Scraping

Residential vs. datacenter for YouTube

Python: Scraping YouTube Metadata & Comments with InnerTube

Setup

Fetching video metadata via InnerTube player endpoint

Walking comment threads with continuation tokens

Fetching transcripts with youtube-transcript-api through a proxy

Node.js: InnerTube Video Metadata Fetcher

Proxy Rotation Strategy for YouTube Scraping

Per-request rotation (sticky sessions not needed)

Sticky sessions for paginated data

Rate limiting discipline

Handling CAPTCHAs and Blocks

When to Use the Official YouTube API Instead

Ethical Scraping: Respecting Creators and the Platform

Key Takeaways

Ready to get started?

You might also be interested in

Proxies for Cryptocurrency Market Data: A Practical Architecture Guide

Proxies for Cryptocurrency Market Data: A Practical Guide

Crypto Market Data Scraping: Proxies for Exchange APIs and On-Chain Feeds

Proxies for Cryptocurrency Market Data: CEX Scraping, On-Chain Access & Low-Latency Architecture