yt-dlp-dags/airflow/README-proxy.md
2025-08-26 18:00:55 +03:00

23 KiB

Proxy and Account Management Strategy

This document describes the intelligent resource management strategy (for proxies and accounts) used by the ytdlp-ops-server. The goal of this system is to maximize the success rate, minimize blocks, and ensure fault tolerance.

The server can run in different roles to support a distributed architecture, separating management tasks from token generation work.


Service Roles and Architecture

The server is designed to run in one of three roles, specified by the --service-role flag:

  • management: A single, lightweight service instance responsible for all management API calls.

    • Purpose: Provides a centralized endpoint for monitoring and managing the state of all proxies and accounts across the system.
    • Behavior: Exposes only management functions (getProxyStatus, banAccount, etc.). Calls to token generation functions will fail.
    • Deployment: Runs as a single container (ytdlp-ops-management) and exposes its port directly to the host (e.g., port 9091), bypassing Envoy.
  • worker: The primary workhorse for token and info.json generation.

    • Purpose: Handles all token generation requests.
    • Behavior: Implements the full API, but its management functions are scoped to its own server_identity.
    • Deployment: Runs as a scalable service (ytdlp-ops-worker) behind the Envoy load balancer (e.g., port 9080).
  • all-in-one (Default): A single instance that performs both management and worker roles. Ideal for local development or small-scale deployments.

This architecture allows for a robust, federated system where workers manage their own resources locally, while a central service provides a global view for management and monitoring.


1. Account Lifecycle Management (Cooldown / Resting)

Goal: To prevent excessive use and subsequent blocking of accounts by providing them with "rest" periods after intensive work.

How It Works:

The account lifecycle consists of three states:

  • ACTIVE: The account is active and used for tasks. An activity timer starts on its first successful use.
  • RESTING: If an account has been ACTIVE for longer than the configured limit, the AccountManager automatically moves it to a "resting" state. The Airflow worker will not select it for new jobs.
  • Return to ACTIVE: After the cooldown period ends, the AccountManager automatically returns the account to the ACTIVE state, making it available again.

Configuration:

These parameters are configured when starting the ytdlp-ops-server.

  • --account-active-duration-min: The "action time" in minutes an account can be continuously active before being moved to RESTING.
    • Default: 30 (minutes).
  • --account-cooldown-duration-min: The "rest time" in minutes an account must remain in the RESTING state.
    • Default: 60 (minutes).

Where to Configure: The parameters are passed as command-line arguments to the server. When using Docker Compose, this is done in airflow/docker-compose-ytdlp-ops.yaml:

    command:
      # ... other parameters
      - "--account-active-duration-min"
      - "${ACCOUNT_ACTIVE_DURATION_MIN:-30}"
      - "--account-cooldown-duration-min"
      - "${ACCOUNT_COOLDOWN_DURATION_MIN:-60}"

You can change the default values by setting the ACCOUNT_ACTIVE_DURATION_MIN and ACCOUNT_COOLDOWN_DURATION_MIN environment variables in your .env file.

Relevant Files:

  • server_fix/account_manager.py: Contains the core logic for state transitions.
  • ytdlp_ops_server_fix.py: Parses the command-line arguments.
  • airflow/docker-compose-ytdlp-ops.yaml: Passes the arguments to the server container.

2. Smart Banning Strategy

Goal: To avoid unfairly banning good proxies. The problem is often with the account, not the proxy it's using.

How It Works:

Stage 1: Ban the Account First

  • When a serious, bannable error occurs (e.g., BOT_DETECTED or SOCKS5_CONNECTION_FAILED), the system penalizes only the account that caused the error.
  • For the proxy, this error is simply recorded as a single failure, but the proxy itself is not banned and remains in rotation.

Stage 2: Ban the Proxy via "Sliding Window"

  • A proxy is banned automatically only if it shows systematic failures with DIFFERENT accounts over a short period.
  • This is a reliable indicator that the proxy itself is the problem. The ProxyManager on the server tracks this and automatically bans such a proxy.

Configuration:

These parameters are hard-coded as constants in the source code. Changing them requires editing the file.

Where to Configure:

  • File: server_fix/proxy_manager.py
  • Constants in the ProxyManager class:
    • FAILURE_WINDOW_SECONDS: The time window in seconds for analyzing failures.
      • Default: 3600 (1 hour).
    • FAILURE_THRESHOLD_COUNT: The minimum total number of failures to trigger a check.
      • Default: 3.
    • FAILURE_THRESHOLD_UNIQUE_ACCOUNTS: The minimum number of unique accounts that must have failed with the proxy to trigger a ban.
      • Default: 3.

Relevant Files:

  • server_fix/proxy_manager.py: Contains the sliding window logic and constants.
  • airflow/dags/ytdlp_ops_worker_per_url.py: The handle_bannable_error_callable function implements the "account-only" ban policy.

Account Statuses Explained

You can view the status of all accounts using the ytdlp_mgmt_proxy_account DAG. The statuses have the following meanings:

  • ACTIVE: The account is healthy and available for use. An account is considered ACTIVE by default if it has no specific status set.
  • BANNED: The account has been temporarily disabled due to repeated failures (e.g., BOT_DETECTED errors) or by a manual ban. The status will show the time remaining until it automatically becomes ACTIVE again (e.g., BANNED (active in 55m)).
  • RESTING: The account has been used for an extended period and is in a mandatory "cooldown" period to prevent burnout. The status will show the time remaining until it becomes ACTIVE again (e.g., RESTING (active in 25m)).
  • (Blank Status): In older versions, an account that had only ever failed (and never succeeded) might appear with a blank status. This has been fixed; these accounts are now correctly shown as ACTIVE.

3. End-to-End Rotation Flow: How It All Works Together

This section describes the step-by-step flow of how a worker gets assigned an account and a proxy for a single job, integrating all the management strategies described above.

  1. Worker Initialization (ytdlp_ops_worker_per_url)

    • The DAG run starts, triggered either by the orchestrator or by its previous successful run.
    • The pull_url_from_redis task fetches a URL from the Redis _inbox queue.
  2. Account Selection (Airflow Worker)

    • The assign_account task is executed.
    • It generates the full list of potential account IDs based on the account_pool (e.g., my_prefix_01 to my_prefix_50).
    • It connects to Redis and iterates through this list, checking the status of each account.
    • It builds a new, temporary list containing only accounts that are not in a BANNED or RESTING state.
    • If the resulting list of active accounts is empty, the worker fails (unless auto-creation is enabled).
    • It then takes the filtered list of active accounts and uses random.choice() to select one.
    • The chosen account_id is passed to the next task.
  3. Proxy Selection (ytdlp-ops-server)

    • The get_token task runs, sending the randomly chosen account_id in a Thrift RPC call to the ytdlp-ops-server.
    • On the server, the ProxyManager is asked for a proxy. This happens on every single request.
    • The ProxyManager performs the following steps on every call to ensure it has the most up-to-date information: a. Query Redis: It fetches the entire current state of all proxies from Redis. This ensures it immediately knows about any status changes (e.g., a ban) made by other workers. b. Rebuild Active List: It rebuilds its internal in-memory list of proxies, including only those with an ACTIVE status. c. Apply Sliding Window Ban: It checks the recent failure history for each active proxy. If a proxy has failed too many times with different accounts, it is banned on the spot, even if its status was ACTIVE. d. Select Proxy: It selects the next available proxy from the final, filtered active list using a round-robin index. e. Return Proxy: It returns the selected proxy_url to be used for the token generation task.
    • Worker Affinity: Crucially, even though workers may share a proxy state in Redis under a common server_identity, each worker instance will only ever use the proxies it was configured with at startup. It uses Redis to check the status of its own proxies but will ignore other proxies in the shared pool.
  4. Execution and Reporting

    • The server now has both the account_id (from Airflow) and the proxy_url (from its ProxyManager).
    • It proceeds with the token generation process using these resources.
    • Upon completion (success or failure), it reports the outcome to Redis, updating the status for both the specific account and proxy that were used. This affects their failure counters, cooldown timers, etc., for the next run.

This separation of concerns is key:

  • The Airflow worker (assign_account task) is responsible for the random selection of an active account, while maintaining affinity (re-using the same account after a success).
  • The ytdlp-ops-server is responsible for the round-robin selection of an active proxy.

4. Automatic Account Ban on Consecutive Failures

Goal: To automatically remove accounts from rotation that consistently cause non-bannable errors (e.g., incorrect password, authorization issues).

How It Works:

  • The AccountManager tracks the number of consecutive failures for each account.
  • On any successful operation, this counter is reset.
  • If the number of consecutive failures reaches a set threshold, the account is automatically banned for a specified duration.

Configuration:

These parameters are set in the AccountManager constructor.

Where to Configure:

  • File: server_fix/account_manager.py
  • Parameters in the __init__ method of AccountManager:
    • failure_threshold: The number of consecutive failures before a ban.
      • Default: 5.
    • ban_duration_s: The duration of the ban in seconds.
      • Default: 3600 (1 hour).

5. Monitoring and Recovery

How to Check Statuses

The ytdlp_mgmt_proxy_account DAG is the primary tool for monitoring the health of your resources. It connects directly to the management service to perform actions.

  • DAG ID: ytdlp_mgmt_proxy_account
  • How to Use: Trigger the DAG from the Airflow UI. Ensure the management_host and management_port parameters are correctly set to point to your ytdlp-ops-management service instance. To get a full overview, set the parameters:
    • entity: all
    • action: list
  • Result: The DAG log will display tables with the current status of all accounts and proxies. For BANNED or RESTING accounts, it shows the time remaining until they become active again (e.g., RESTING (active in 45m)). For proxies, it highlights which proxy is (next) in the round-robin rotation for a specific worker.

Worker vs. Management Service Roles in Automatic State Changes

It is important to understand the distinct roles each service plays in the automatic state management of accounts and proxies. The system uses a reactive, "on-read" update mechanism.

  • The worker service is proactive. It is responsible for putting resources into a "bad" state.

    • When a worker encounters too many failures with an account, it moves the account to BANNED.
    • When an account's activity timer expires, the worker moves it to RESTING.
    • When a proxy fails the sliding window check during a token request, the worker bans it.
  • The management service is reactive but crucial for recovery. It is responsible for taking resources out of a "bad" state.

    • The logic to check if a ban has expired or a rest period is over is located in the getAccountStatus and getProxyStatus methods.
    • This means an account or proxy is only returned to an ACTIVE state when its status is queried.
    • Since the ytdlp_mgmt_proxy_account DAG calls these methods on the management service, running this DAG is the primary mechanism for automatically clearing expired bans and rest periods.

In summary, workers put resources into timeout, and the management service (when queried) brings them back. This makes periodic checks with the management DAG important for overall system health and recovery.

Important Note on Unbanning Proxies

When a proxy is unbanned (either individually via unban or collectively via unban_all), the system performs two critical actions:

  1. It sets the proxy's status back to ACTIVE.
  2. It deletes the proxy's entire failure history from Redis.

This second step is crucial. Without it, the ProxyManager's "Sliding Window" check would see the old failures, immediately re-ban the "active" proxy on its next use, and lead to a NO_ACTIVE_PROXIES error. Clearing the history ensures that an unbanned proxy gets a truly fresh start.

What Happens When All Accounts Are Banned or Resting?

If the entire pool of accounts becomes unavailable (either BANNED or RESTING), the system will effectively pause by default.

  • The ytdlp_ops_worker_per_url DAG will fail at the assign_account step with an AirflowException because the active account pool will be empty.
  • This will stop the processing loops. The system will remain paused until accounts are either manually unbanned or their ban/rest timers expire, at which point you can re-start the processing loops using the ytdlp_ops_orchestrator DAG.
  • The DAG graph for ytdlp_ops_worker_per_url now explicitly shows tasks for assign_account, get_token, ban_account, retry_get_token, etc., making the process flow and failure points much clearer.

The system can be configured to automatically create new accounts to prevent processing from halting completely.

Automatic Account Creation on Exhaustion

  • Goal: Ensure the processing pipeline continues to run even if all accounts in the primary pool are temporarily banned or resting.
  • How it works: If the auto_create_new_accounts_on_exhaustion parameter is set to True and the account pool is defined using a prefix (not an explicit list), the system will generate a new, unique account ID when it finds the active pool empty.
  • New Account Naming: New accounts are created with the format {prefix}-auto-{unique_id}.
  • Configuration:
    • Parameter: auto_create_new_accounts_on_exhaustion
    • Where to set: In the ytdlp_ops_orchestrator DAG configuration when triggering a run.
    • Default: True.

6. Failure Handling and Retry Policy

Goal: To provide flexible control over how the system behaves when a worker encounters a "bannable" error (e.g., BOT_DETECTED).

How It Works

When a worker's get_token task fails with a bannable error, the system's behavior is determined by the on_bannable_failure policy, which can be configured when starting the ytdlp_ops_orchestrator.

Configuration

  • Parameter: on_bannable_failure
  • Where to set: In the ytdlp_ops_orchestrator DAG configuration.
  • Options:
    • stop_loop (Strictest):
      • The account used is banned.
      • The URL is marked as failed in the _fail Redis hash.
      • The worker's processing loop is stopped. The lane becomes inactive.
    • retry_with_new_account (Default, Most Resilient):
      • The failing account is banned.
      • The worker immediately retries the same URL with a new, unused account from the pool.
      • If the retry succeeds, the worker continues its loop to the next URL.
      • If the retry also fails, the second account and the proxy are also banned, and the worker's loop is stopped.
    • retry_and_ban_account_only:
      • Similar to retry_with_new_account, but on the second failure, it bans only the second account, not the proxy.
      • This is useful when you trust your proxies but want to aggressively cycle through failing accounts.
    • retry_without_ban (Most Lenient):
      • The worker retries with a new account, but no accounts or proxies are ever banned.
      • This policy is useful for debugging or when you are confident that failures are transient and not the fault of the resources.

This policy allows the system to be resilient to single account failures without losing the URL, while providing granular control over when to ban accounts and/or proxies if the problem persists.


7. Worker DAG Logic (ytdlp_ops_worker_per_url)

This DAG is the "workhorse" of the system. It is designed as a self-sustaining loop to process one URL per run. The logic for handling failures and retries is now explicitly visible in the DAG's task graph.

Tasks and Their Purpose:

  • pull_url_from_redis: Fetches one URL from the Redis _inbox queue. If the queue is empty, the DAG run is skipped, stopping this worker's processing "lane".
  • assign_account: Selects an account for the job. It maintains account affinity by re-using the same account from the previous successful run in its "lane". If it's the first run or the previous run failed, it picks a random active account.
  • get_token: The primary attempt to get tokens and info.json by calling the ytdlp-ops-server.
  • handle_bannable_error_branch: A branching task that runs if get_token fails. It inspects the error and decides the next step based on the on_bannable_failure policy.
  • ban_account_and_prepare_for_retry: If a retry is permitted, this task bans the failed account and selects a new one.
  • retry_get_token: A second attempt to get the token using the new account.
  • ban_second_account_and_proxy: If the retry also fails, this task bans the second account and the proxy that was used.
  • download_and_probe: If get_token or retry_get_token succeeds, this task uses yt-dlp to download the media and ffmpeg to verify that the downloaded file is a valid media file.
  • mark_url_as_success: If download_and_probe succeeds, this task records the successful result in the Redis _result hash.
  • handle_generic_failure: If any task fails non-recoverably, this task records the detailed error information in the Redis _fail hash.
  • decide_what_to_do_next: A final branching task that decides whether to continue the loop (trigger_self_run), stop it gracefully (stop_loop), or mark it as failed (fail_loop).
  • trigger_self_run: The task that actually triggers the next DAG run, creating the continuous loop.

8. Proxy State Lifecycle in Redis

This section details how a proxy's state (e.g., ACTIVE, BANNED) is managed and persisted in Redis. The system uses a "lazy initialization" pattern, meaning a proxy's state is only written to Redis when it is first needed.

Step 1: Configuration and In-Memory Initialization

The server first learns about the list of available proxies from its startup configuration, not from Redis.

  1. Source of Truth: Proxies are defined in the .env file (e.g., CAMOUFOX_PROXIES, SOCKS5_SOCK_SERVER_IP).
  2. Injection: The airflow/generate_envoy_config.py script aggregates these into a single list, which is passed to the ytdlp-ops-server via the --proxies command-line argument during Docker Compose startup.
  3. In-Memory State: The ProxyManager in server_fix/proxy_manager.py receives this list and holds it in memory. At this point, Redis is not involved.

Step 2: First Write to Redis (Lazy Initialization)

A proxy's state is only persisted to Redis the first time it is actively managed or queried.

  • Trigger: This typically happens on the first API call that requires proxy state, such as getProxyStatus.
  • Action: The ProxyManager checks Redis for a hash with the key proxies:<server_identity> (e.g., proxies:ytdlp-ops-airflow-service).
  • Initialization: If the key does not exist, the ProxyManager iterates through its in-memory list of proxies and writes each one to the Redis hash with a default state of ACTIVE.

Step 3: Runtime Updates (Success and Failure)

The proxy's state in Redis is updated in real-time based on the outcome of token generation tasks.

  • On Success: When a task using a proxy succeeds, ProxyManager.report_success() is called. This updates the proxy's success_count and last_success_timestamp in the Redis hash.
  • On Failure: When a task fails, ProxyManager.report_failure() is called.
    1. A record of the failure (including the account ID and job ID) is added to a separate Redis sorted set with the key proxy_failures:<proxy_url>. This key has a TTL and is used for the sliding window ban strategy.
    2. The proxy's failure_count and last_failure_timestamp are updated in the main Redis hash.
  • Automatic Ban: If the conditions for the "Sliding Window" ban are met (too many failures from different accounts in a short time), ProxyManager.ban_proxy() is called, which updates the proxy's status to BANNED in the Redis hash.

Step 4: Observation and Manual Control

You can view and modify the proxy states stored in Redis using the provided management tools.

  • Observation:
    • Airflow DAG: The ytdlp_mgmt_proxy_account DAG (action: list_statuses, entity: proxy).
    • CLI Client: The proxy_manager_client.py script (list command).
    • These tools call the getProxyStatus API endpoint, which reads directly from the proxies:<server_identity> hash in Redis.
  • Manual Control:
    • The same tools provide ban, unban, and reset actions.
    • These actions call API endpoints that directly modify the status field for a proxy in the proxies:<server_identity> Redis hash.
    • The delete_from_redis action in the DAG provides a way to completely remove a proxy's state and failure history from Redis, forcing it to be re-initialized as ACTIVE on its next use.

Summary of Redis Keys

Redis Key Pattern Type Purpose
proxies:<server_identity> Hash The primary store for proxy state. Maps proxy_url to a JSON string containing its status (ACTIVE/BANNED), success/failure counts, and timestamps.
proxy_failures:<proxy_url> Sorted Set A temporary log of recent failures for a specific proxy, used by the sliding window ban logic. The score is the timestamp of the failure.