# Proxy and Account Management Strategy This document describes the intelligent resource management strategy (for proxies and accounts) used by the `ytdlp-ops-server`. The goal of this system is to maximize the success rate, minimize blocks, and ensure fault tolerance. The server can run in different roles to support a distributed architecture, separating management tasks from token generation work. --- ## Service Roles and Architecture The server is designed to run in one of three roles, specified by the `--service-role` flag: - **`management`**: A single, lightweight service instance responsible for all management API calls. - **Purpose**: Provides a centralized endpoint for monitoring and managing the state of all proxies and accounts across the system. - **Behavior**: Exposes only management functions (`getProxyStatus`, `banAccount`, etc.). Calls to token generation functions will fail. - **Deployment**: Runs as a single container (`ytdlp-ops-management`) and exposes its port directly to the host (e.g., port `9091`), bypassing Envoy. - **`worker`**: The primary workhorse for token and `info.json` generation. - **Purpose**: Handles all token generation requests. - **Behavior**: Implements the full API, but its management functions are scoped to its own `server_identity`. - **Deployment**: Runs as a scalable service (`ytdlp-ops-worker`) behind the Envoy load balancer (e.g., port `9080`). - **`all-in-one`** (Default): A single instance that performs both management and worker roles. Ideal for local development or small-scale deployments. This architecture allows for a robust, federated system where workers manage their own resources locally, while a central service provides a global view for management and monitoring. --- ## 1. Account Lifecycle Management (Cooldown / Resting) **Goal:** To prevent excessive use and subsequent blocking of accounts by providing them with "rest" periods after intensive work. ### How It Works: The account lifecycle consists of three states: - **`ACTIVE`**: The account is active and used for tasks. An activity timer starts on its first successful use. - **`RESTING`**: If an account has been `ACTIVE` for longer than the configured limit, the `AccountManager` automatically moves it to a "resting" state. The Airflow worker will not select it for new jobs. - **Return to `ACTIVE`**: After the cooldown period ends, the `AccountManager` automatically returns the account to the `ACTIVE` state, making it available again. ### Configuration: These parameters are configured when starting the `ytdlp-ops-server`. - `--account-active-duration-min`: The "action time" in **minutes** an account can be continuously active before being moved to `RESTING`. - **Default:** `30` (minutes). - `--account-cooldown-duration-min`: The "rest time" in **minutes** an account must remain in the `RESTING` state. - **Default:** `60` (minutes). **Where to Configure:** The parameters are passed as command-line arguments to the server. When using Docker Compose, this is done in `airflow/docker-compose-ytdlp-ops.yaml`: ```yaml command: # ... other parameters - "--account-active-duration-min" - "${ACCOUNT_ACTIVE_DURATION_MIN:-30}" - "--account-cooldown-duration-min" - "${ACCOUNT_COOLDOWN_DURATION_MIN:-60}" ``` You can change the default values by setting the `ACCOUNT_ACTIVE_DURATION_MIN` and `ACCOUNT_COOLDOWN_DURATION_MIN` environment variables in your `.env` file. **Relevant Files:** - `server_fix/account_manager.py`: Contains the core logic for state transitions. - `ytdlp_ops_server_fix.py`: Parses the command-line arguments. - `airflow/docker-compose-ytdlp-ops.yaml`: Passes the arguments to the server container. --- ## 2. Smart Banning Strategy **Goal:** To avoid unfairly banning good proxies. The problem is often with the account, not the proxy it's using. ### How It Works: #### Stage 1: Ban the Account First - When a serious, bannable error occurs (e.g., `BOT_DETECTED` or `SOCKS5_CONNECTION_FAILED`), the system penalizes **only the account** that caused the error. - For the proxy, this error is simply recorded as a single failure, but the proxy itself is **not banned** and remains in rotation. #### Stage 2: Ban the Proxy via "Sliding Window" - A proxy is banned automatically only if it shows **systematic failures with DIFFERENT accounts** over a short period. - This is a reliable indicator that the proxy itself is the problem. The `ProxyManager` on the server tracks this and automatically bans such a proxy. ### Configuration: These parameters are **hard-coded** as constants in the source code. Changing them requires editing the file. **Where to Configure:** - **File:** `server_fix/proxy_manager.py` - **Constants** in the `ProxyManager` class: - `FAILURE_WINDOW_SECONDS`: The time window in seconds for analyzing failures. - **Default:** `3600` (1 hour). - `FAILURE_THRESHOLD_COUNT`: The minimum total number of failures to trigger a check. - **Default:** `3`. - `FAILURE_THRESHOLD_UNIQUE_ACCOUNTS`: The minimum number of **unique accounts** that must have failed with the proxy to trigger a ban. - **Default:** `3`. **Relevant Files:** - `server_fix/proxy_manager.py`: Contains the sliding window logic and constants. - `airflow/dags/ytdlp_ops_worker_per_url.py`: The `handle_bannable_error_callable` function implements the "account-only" ban policy. --- ### Account Statuses Explained You can view the status of all accounts using the `ytdlp_mgmt_proxy_account` DAG. The statuses have the following meanings: - **`ACTIVE`**: The account is healthy and available for use. An account is considered `ACTIVE` by default if it has no specific status set. - **`BANNED`**: The account has been temporarily disabled due to repeated failures (e.g., `BOT_DETECTED` errors) or by a manual ban. The status will show the time remaining until it automatically becomes `ACTIVE` again (e.g., `BANNED (active in 55m)`). - **`RESTING`**: The account has been used for an extended period and is in a mandatory "cooldown" period to prevent burnout. The status will show the time remaining until it becomes `ACTIVE` again (e.g., `RESTING (active in 25m)`). - **(Blank Status)**: In older versions, an account that had only ever failed (and never succeeded) might appear with a blank status. This has been fixed; these accounts are now correctly shown as `ACTIVE`. --- ## 3. End-to-End Rotation Flow: How It All Works Together This section describes the step-by-step flow of how a worker gets assigned an account and a proxy for a single job, integrating all the management strategies described above. 1. **Worker Initialization (`ytdlp_ops_worker_per_url`)** - The DAG run starts, triggered either by the orchestrator or by its previous successful run. - The `pull_url_from_redis` task fetches a URL from the Redis `_inbox` queue. 2. **Account Selection (Airflow Worker)** - The `assign_account` task is executed. - It generates the full list of potential account IDs based on the `account_pool` (e.g., `my_prefix_01` to `my_prefix_50`). - It connects to Redis and iterates through this list, checking the status of each account. - It builds a new, temporary list containing only accounts that are **not** in a `BANNED` or `RESTING` state. - If the resulting list of active accounts is empty, the worker fails (unless auto-creation is enabled). - It then takes the filtered list of active accounts and uses **`random.choice()`** to select one. - The chosen `account_id` is passed to the next task. 3. **Proxy Selection (`ytdlp-ops-server`)** - The `get_token` task runs, sending the randomly chosen `account_id` in a Thrift RPC call to the `ytdlp-ops-server`. - On the server, the `ProxyManager` is asked for a proxy. This happens on **every single request**. - The `ProxyManager` performs the following steps on every call to ensure it has the most up-to-date information: a. **Query Redis:** It fetches the *entire* current state of all proxies from Redis. This ensures it immediately knows about any status changes (e.g., a ban) made by other workers. b. **Rebuild Active List:** It rebuilds its internal in-memory list of proxies, including only those with an `ACTIVE` status. c. **Apply Sliding Window Ban:** It checks the recent failure history for each active proxy. If a proxy has failed too many times with different accounts, it is banned on the spot, even if its status was `ACTIVE`. d. **Select Proxy:** It selects the next available proxy from the final, filtered active list using a **round-robin** index. e. **Return Proxy:** It returns the selected `proxy_url` to be used for the token generation task. - **Worker Affinity**: Crucially, even though workers may share a proxy state in Redis under a common `server_identity`, each worker instance will **only ever use the proxies it was configured with at startup**. It uses Redis to check the status of its own proxies but will ignore other proxies in the shared pool. 4. **Execution and Reporting** - The server now has both the `account_id` (from Airflow) and the `proxy_url` (from its `ProxyManager`). - It proceeds with the token generation process using these resources. - Upon completion (success or failure), it reports the outcome to Redis, updating the status for both the specific account and proxy that were used. This affects their failure counters, cooldown timers, etc., for the next run. This separation of concerns is key: - **The Airflow worker (`assign_account` task)** is responsible for the **random selection of an active account**, while maintaining affinity (re-using the same account after a success). - **The `ytdlp-ops-server`** is responsible for the **round-robin selection of an active proxy**. --- ## 4. Automatic Account Ban on Consecutive Failures **Goal:** To automatically remove accounts from rotation that consistently cause non-bannable errors (e.g., incorrect password, authorization issues). ### How It Works: - The `AccountManager` tracks the number of **consecutive** failures for each account. - On any successful operation, this counter is reset. - If the number of consecutive failures reaches a set threshold, the account is automatically banned for a specified duration. ### Configuration: These parameters are set in the `AccountManager` constructor. **Where to Configure:** - **File:** `server_fix/account_manager.py` - **Parameters** in the `__init__` method of `AccountManager`: - `failure_threshold`: The number of consecutive failures before a ban. - **Default:** `5`. - `ban_duration_s`: The duration of the ban in seconds. - **Default:** `3600` (1 hour). --- ## 5. Monitoring and Recovery ### How to Check Statuses The **`ytdlp_mgmt_proxy_account`** DAG is the primary tool for monitoring the health of your resources. It connects directly to the **management service** to perform actions. - **DAG ID:** `ytdlp_mgmt_proxy_account` - **How to Use:** Trigger the DAG from the Airflow UI. Ensure the `management_host` and `management_port` parameters are correctly set to point to your `ytdlp-ops-management` service instance. To get a full overview, set the parameters: - `entity`: `all` - `action`: `list` - **Result:** The DAG log will display tables with the current status of all accounts and proxies. For `BANNED` or `RESTING` accounts, it shows the time remaining until they become active again (e.g., `RESTING (active in 45m)`). For proxies, it highlights which proxy is `(next)` in the round-robin rotation for a specific worker. ### Worker vs. Management Service Roles in Automatic State Changes It is important to understand the distinct roles each service plays in the automatic state management of accounts and proxies. The system uses a reactive, "on-read" update mechanism. - **The `worker` service is proactive.** It is responsible for putting resources into a "bad" state. - When a worker encounters too many failures with an account, it moves the account to `BANNED`. - When an account's activity timer expires, the worker moves it to `RESTING`. - When a proxy fails the sliding window check during a token request, the worker bans it. - **The `management` service is reactive but crucial for recovery.** It is responsible for taking resources out of a "bad" state. - The logic to check if a ban has expired or a rest period is over is located in the `getAccountStatus` and `getProxyStatus` methods. - This means an account or proxy is only returned to an `ACTIVE` state **when its status is queried**. - Since the `ytdlp_mgmt_proxy_account` DAG calls these methods on the `management` service, running this DAG is the primary mechanism for automatically clearing expired bans and rest periods. In summary, workers put resources into timeout, and the management service (when queried) brings them back. This makes periodic checks with the management DAG important for overall system health and recovery. ### Important Note on Unbanning Proxies When a proxy is unbanned (either individually via `unban` or collectively via `unban_all`), the system performs two critical actions: 1. It sets the proxy's status back to `ACTIVE`. 2. It **deletes the proxy's entire failure history** from Redis. This second step is crucial. Without it, the `ProxyManager`'s "Sliding Window" check would see the old failures, immediately re-ban the "active" proxy on its next use, and lead to a `NO_ACTIVE_PROXIES` error. Clearing the history ensures that an unbanned proxy gets a truly fresh start. ### What Happens When All Accounts Are Banned or Resting? If the entire pool of accounts becomes unavailable (either `BANNED` or `RESTING`), the system will effectively pause by default. - The `ytdlp_ops_worker_per_url` DAG will fail at the `assign_account` step with an `AirflowException` because the active account pool will be empty. - This will stop the processing loops. The system will remain paused until accounts are either manually unbanned or their ban/rest timers expire, at which point you can re-start the processing loops using the `ytdlp_ops_orchestrator` DAG. - The DAG graph for `ytdlp_ops_worker_per_url` now explicitly shows tasks for `assign_account`, `get_token`, `ban_account`, `retry_get_token`, etc., making the process flow and failure points much clearer. The system can be configured to automatically create new accounts to prevent processing from halting completely. #### Automatic Account Creation on Exhaustion - **Goal**: Ensure the processing pipeline continues to run even if all accounts in the primary pool are temporarily banned or resting. - **How it works**: If the `auto_create_new_accounts_on_exhaustion` parameter is set to `True` and the account pool is defined using a prefix (not an explicit list), the system will generate a new, unique account ID when it finds the active pool empty. - **New Account Naming**: New accounts are created with the format `{prefix}-auto-{unique_id}`. - **Configuration**: - **Parameter**: `auto_create_new_accounts_on_exhaustion` - **Where to set**: In the `ytdlp_ops_orchestrator` DAG configuration when triggering a run. - **Default**: `True`. --- ## 6. Failure Handling and Retry Policy **Goal:** To provide flexible control over how the system behaves when a worker encounters a "bannable" error (e.g., `BOT_DETECTED`). ### How It Works When a worker's `get_token` task fails with a bannable error, the system's behavior is determined by the `on_bannable_failure` policy, which can be configured when starting the `ytdlp_ops_orchestrator`. ### Configuration - **Parameter**: `on_bannable_failure` - **Where to set**: In the `ytdlp_ops_orchestrator` DAG configuration. - **Options**: - `stop_loop` (Strictest): - The account used is banned. - The URL is marked as failed in the `_fail` Redis hash. - The worker's processing loop is **stopped**. The lane becomes inactive. - `retry_with_new_account` (Default, Most Resilient): - The failing account is banned. - The worker immediately retries the **same URL** with a new, unused account from the pool. - If the retry succeeds, the worker continues its loop to the next URL. - If the retry also fails, the second account **and the proxy** are also banned, and the worker's loop is stopped. - `retry_and_ban_account_only`: - Similar to `retry_with_new_account`, but on the second failure, it bans **only the second account**, not the proxy. - This is useful when you trust your proxies but want to aggressively cycle through failing accounts. - `retry_without_ban` (Most Lenient): - The worker retries with a new account, but **no accounts or proxies are ever banned**. - This policy is useful for debugging or when you are confident that failures are transient and not the fault of the resources. This policy allows the system to be resilient to single account failures without losing the URL, while providing granular control over when to ban accounts and/or proxies if the problem persists. --- ## 7. Worker DAG Logic (`ytdlp_ops_worker_per_url`) This DAG is the "workhorse" of the system. It is designed as a self-sustaining loop to process one URL per run. The logic for handling failures and retries is now explicitly visible in the DAG's task graph. ### Tasks and Their Purpose: - **`pull_url_from_redis`**: Fetches one URL from the Redis `_inbox` queue. If the queue is empty, the DAG run is skipped, stopping this worker's processing "lane". - **`assign_account`**: Selects an account for the job. It maintains **account affinity** by re-using the same account from the previous successful run in its "lane". If it's the first run or the previous run failed, it picks a random active account. - **`get_token`**: The primary attempt to get tokens and `info.json` by calling the `ytdlp-ops-server`. - **`handle_bannable_error_branch`**: A branching task that runs if `get_token` fails. It inspects the error and decides the next step based on the `on_bannable_failure` policy. - **`ban_account_and_prepare_for_retry`**: If a retry is permitted, this task bans the failed account and selects a new one. - **`retry_get_token`**: A second attempt to get the token using the new account. - **`ban_second_account_and_proxy`**: If the retry also fails, this task bans the second account and the proxy that was used. - **`download_and_probe`**: If `get_token` or `retry_get_token` succeeds, this task uses `yt-dlp` to download the media and `ffmpeg` to verify that the downloaded file is a valid media file. - **`mark_url_as_success`**: If `download_and_probe` succeeds, this task records the successful result in the Redis `_result` hash. - **`handle_generic_failure`**: If any task fails non-recoverably, this task records the detailed error information in the Redis `_fail` hash. - **`decide_what_to_do_next`**: A final branching task that decides whether to continue the loop (`trigger_self_run`), stop it gracefully (`stop_loop`), or mark it as failed (`fail_loop`). - **`trigger_self_run`**: The task that actually triggers the next DAG run, creating the continuous loop. --- ## 8. Proxy State Lifecycle in Redis This section details how a proxy's state (e.g., `ACTIVE`, `BANNED`) is managed and persisted in Redis. The system uses a "lazy initialization" pattern, meaning a proxy's state is only written to Redis when it is first needed. ### Step 1: Configuration and In-Memory Initialization The server first learns about the list of available proxies from its startup configuration, not from Redis. 1. **Source of Truth**: Proxies are defined in the `.env` file (e.g., `CAMOUFOX_PROXIES`, `SOCKS5_SOCK_SERVER_IP`). 2. **Injection**: The `airflow/generate_envoy_config.py` script aggregates these into a single list, which is passed to the `ytdlp-ops-server` via the `--proxies` command-line argument during Docker Compose startup. 3. **In-Memory State**: The `ProxyManager` in `server_fix/proxy_manager.py` receives this list and holds it in memory. At this point, Redis is not involved. ### Step 2: First Write to Redis (Lazy Initialization) A proxy's state is only persisted to Redis the first time it is actively managed or queried. * **Trigger**: This typically happens on the first API call that requires proxy state, such as `getProxyStatus`. * **Action**: The `ProxyManager` checks Redis for a hash with the key `proxies:` (e.g., `proxies:ytdlp-ops-airflow-service`). * **Initialization**: If the key does not exist, the `ProxyManager` iterates through its in-memory list of proxies and writes each one to the Redis hash with a default state of `ACTIVE`. ### Step 3: Runtime Updates (Success and Failure) The proxy's state in Redis is updated in real-time based on the outcome of token generation tasks. * **On Success**: When a task using a proxy succeeds, `ProxyManager.report_success()` is called. This updates the proxy's `success_count` and `last_success_timestamp` in the Redis hash. * **On Failure**: When a task fails, `ProxyManager.report_failure()` is called. 1. A record of the failure (including the account ID and job ID) is added to a separate Redis sorted set with the key `proxy_failures:`. This key has a TTL and is used for the sliding window ban strategy. 2. The proxy's `failure_count` and `last_failure_timestamp` are updated in the main Redis hash. * **Automatic Ban**: If the conditions for the "Sliding Window" ban are met (too many failures from different accounts in a short time), `ProxyManager.ban_proxy()` is called, which updates the proxy's `status` to `BANNED` in the Redis hash. ### Step 4: Observation and Manual Control You can view and modify the proxy states stored in Redis using the provided management tools. * **Observation**: * **Airflow DAG**: The `ytdlp_mgmt_proxy_account` DAG (`action: list_statuses`, `entity: proxy`). * **CLI Client**: The `proxy_manager_client.py` script (`list` command). * These tools call the `getProxyStatus` API endpoint, which reads directly from the `proxies:` hash in Redis. * **Manual Control**: * The same tools provide `ban`, `unban`, and `reset` actions. * These actions call API endpoints that directly modify the `status` field for a proxy in the `proxies:` Redis hash. * The `delete_from_redis` action in the DAG provides a way to completely remove a proxy's state and failure history from Redis, forcing it to be re-initialized as `ACTIVE` on its next use. ### Summary of Redis Keys | Redis Key Pattern | Type | Purpose | | :--- | :--- | :--- | | `proxies:` | Hash | The primary store for proxy state. Maps `proxy_url` to a JSON string containing its status (`ACTIVE`/`BANNED`), success/failure counts, and timestamps. | | `proxy_failures:` | Sorted Set | A temporary log of recent failures for a specific proxy, used by the sliding window ban logic. The score is the timestamp of the failure. |