Provide updates on ytdlp dags

2025-08-06 18:02:44 +03:00 · 2025-08-06 18:02:44 +03:00 · 274bef5370
commit 274bef5370
parent 61906a57ef
9 changed files with 1617 additions and 862 deletions
--- a/dags/README.ru.md
+++ b/dags/README.ru.md
@ -1,46 +1,78 @@
 # Архитектура и описание YTDLP Airflow DAGs

-Этот документ описывает архитектуру и назначение DAG'ов, используемых для скачивания видео с YouTube. Система построена по паттерну "Сенсор/Воркер" для обеспечения непрерывной и параллельной обработки.
+Этот документ описывает архитектуру и назначение DAG'ов, используемых для скачивания видео с YouTube. Система построена на модели непрерывного, самоподдерживающегося цикла для параллельной и отказоустойчивой обработки.

 ## Основной цикл обработки

-### `ytdlp_sensor_redis_queue` (Сенсор)
+Обработка выполняется двумя основными DAG'ами, которые работают в паре: оркестратор и воркер.

- **Назначение:** Забирает URL на скачивание из очереди Redis и запускает воркеры для их обработки.
- **Принцип работы (Запуск по триггеру):**
-    - **По триггеру:** Когда воркер `ytdlp_worker_per_url` успешно завершает работу, он немедленно запускает сенсор. Это обеспечивает непрерывную обработку без задержек. Запуск по расписанию отключен, чтобы избежать повторного запуска задач для заблокированных аккаунтов.
-    - **Логика:** Извлекает из Redis (`_inbox` лист) пачку URL. Если очередь пуста, DAG успешно завершается до следующего запуска по триггеру.
+### `ytdlp_ops_orchestrator` (Система "зажигания")

-### `ytdlp_worker_per_url` (Воркер)
-
- **Назначение:** Обрабатывает один URL, скачивает видео и продолжает цикл.
+- **Назначение:** Этот DAG действует как "система зажигания" для запуска обработки. Он запускается вручную для старта указанного количества параллельных циклов-воркеров.
 - **Принцип работы:**
-    - Получает один URL от сенсора.
-    - Обращается к сервису `ytdlp-ops-auth` для получения `info.json` и `socks5` прокси.
-    - Скачивает видео, используя полученные данные. (TODO: заменить вызов `yt-dlp` как команды на вызов библиотеки).
-    - В зависимости от статуса (успех/неуспех), помещает результат в соответствующий хэш Redis (`_result` или `_fail`).
-    - В случае успеха, повторно запускает сенсор `ytdlp_sensor_redis_queue` для продолжения цикла обработки. В случае ошибки цикл останавливается для ручной диагностики.
+    - Он **не** обрабатывает URL-адреса самостоятельно.
+    - Его единственная задача — запустить сконфигурированное количество DAG'ов `ytdlp_ops_worker_per_url`.
+    - Он передает всю необходимую конфигурацию (пул аккаунтов, подключение к Redis и т.д.) воркерам.
+
+### `ytdlp_ops_worker_per_url` (Самоподдерживающийся воркер)
+
+- **Назначение:** Этот DAG обрабатывает один URL и спроектирован для работы в непрерывном цикле.
+- **Принцип работы:**
+    1. **Запуск:** Начальный запуск инициируется `ytdlp_ops_orchestrator`.
+    2. **Получение задачи:** Воркер извлекает один URL из очереди `_inbox` в Redis. Если очередь пуста, выполнение воркера завершается, и его "линия" обработки останавливается.
+    3. **Обработка:** Он взаимодействует с сервисом `ytdlp-ops-server` для получения `info.json` и прокси, после чего скачивает видео.
+    4. **Продолжение или остановка:**
+        - **В случае успеха:** Он запускает новый экземпляр самого себя, создавая непрерывный цикл для обработки следующего URL.
+        - **В случае сбоя:** Цикл прерывается (если `stop_on_failure` установлено в `True`), останавливая эту "линию" обработки. Это предотвращает остановку всей системы из-за одного проблемного URL или аккаунта.

 ## Управляющие DAG'и

-Эти DAG'и предназначены для ручного управления очередями и не участвуют в автоматическом цикле.
+### `ytdlp_mgmt_proxy_account`

- **`ytdlp_mgmt_queue_add_and_verify`**: Добавление URL в очередь задач (`_inbox`) и последующая проверка статуса этой очереди.
- **`ytdlp_mgmt_queues_check_status`**: Просмотр состояния и содержимого всех ключевых очередей (`_inbox`, `_progress`, `_result`, `_fail`). Помогает отслеживать процесс обработки.
- **`ytdlp_mgmt_queue_clear`**: Очистка (полное удаление) указанной очереди Redis. **Использовать с осторожностью**, так как операция необратима.
+- **Назначение:** Это основной инструмент для мониторинга и управления состоянием ресурсов, используемых `ytdlp-ops-server`.
+- **Функциональность:**
+    - **Просмотр статусов:** Позволяет увидеть текущий статус всех прокси и аккаунтов (например, `ACTIVE`, `BANNED`, `RESTING`).
+    - **Управление прокси:** Позволяет вручную банить, разбанивать или сбрасывать статус прокси.
+    - **Управление аккаунтами:** Позволяет вручную банить или разбанивать аккаунты.
+
+## Стратегия управления ресурсами (Прокси и Аккаунты)
+
+Система использует интеллектуальную стратегию для управления жизненным циклом и состоянием аккаунтов и прокси, чтобы максимизировать процент успеха и минимизировать блокировки.
+
+-   **Жизненный цикл аккаунта ("Cooldown"):**
+    -   Чтобы предотвратить "выгорание", аккаунты автоматически переходят в состояние "отдыха" (`RESTING`) после периода интенсивного использования.
+    -   По истечении периода отдыха они автоматически возвращаются в `ACTIVE` и снова становятся доступными для воркеров.
+
+-   **Умная стратегия банов:**
+    -   **Сначала бан аккаунта:** При возникновении серьезной ошибки (например, `BOT_DETECTED`) система наказывает **только аккаунт**, который вызвал сбой. Прокси при этом продолжает работать.
+    -   **Бан прокси по "скользящему окну":** Прокси банится автоматически, только если он демонстрирует **систематические сбои с РАЗНЫМИ аккаунтами** за короткий промежуток времени. Это является надежным индикатором того, что проблема именно в прокси.
+
+-   **Мониторинг:**
+    -   DAG `ytdlp_mgmt_proxy_account` является основным инструментом для мониторинга. Он показывает текущий статус всех ресурсов, включая время, оставшееся до активации забаненных или отдыхающих аккаунтов.
+    -   Граф выполнения DAG `ytdlp_ops_worker_per_url` теперь явно показывает шаги, такие как `assign_account`, `get_token`, `ban_account`, `retry_get_token`, что делает процесс отладки более наглядным.

 ## Внешние сервисы

-### `ytdlp-ops-auth` (Thrift Service)
+### `ytdlp-ops-server` (Thrift Service)

 - **Назначение:** Внешний сервис, который предоставляет аутентификационные данные (токены, cookies, proxy) для скачивания видео.
- **Взаимодействие:** Worker DAG (`ytdlp_worker_per_url`) обращается к этому сервису перед началом загрузки для получения необходимых данных для `yt-dlp`.
+- **Взаимодействие:** Worker DAG (`ytdlp_ops_worker_per_url`) обращается к этому сервису перед началом загрузки для получения необходимых данных для `yt-dlp`.

-## TODO (Планы на доработку)
+## Логика работы Worker DAG (`ytdlp_ops_worker_per_url`)

- **Реализовать механизм "Circuit Breaker" (автоматического выключателя):**
-  - **Проблема:** Если воркер падает с ошибкой (например, из-за бана аккаунта), сенсор, запускаемый по расписанию, продолжает создавать новые задачи для этого же аккаунта, усугубляя проблему.
-  - **Решение:**
-    1. **Воркер (`ytdlp_worker_per_url`):** При сбое задачи, воркер должен устанавливать в Redis флаг временной блокировки для своего `account_id` (например, на 5-10 минут).
-    2. **Сенсор (`ytdlp_sensor_redis_queue`):** Перед проверкой очереди, сенсор должен проверять наличие флага блокировки для своего `account_id`. Если аккаунт заблокирован, сенсор должен пропустить выполнение, предотвращая запуск новых воркеров для проблемного аккаунта.
-  - **Результат:** Это предотвратит многократные повторные запросы к заблокированному аккаунту и даст системе время на восстановление.
+Этот DAG является "рабочей лошадкой" системы. Он спроектирован как самоподдерживающийся цикл для обработки одного URL за запуск.
+
+### Задачи и их назначение:
+
+-   **`pull_url_from_redis`**: Извлекает один URL из очереди `_inbox` в Redis. Если очередь пуста, DAG завершается со статусом `skipped`, останавливая эту "линию" обработки.
+-   **`assign_account`**: Выбирает аккаунт для выполнения задачи. Он будет повторно использовать тот же аккаунт, который был успешно использован в предыдущем запуске в своей "линии" (привязка аккаунта). Если это первый запуск, он выбирает случайный аккаунт.
+-   **`get_token`**: Основная задача. Она обращается к `ytdlp-ops-server` для получения `info.json`.
+-   **`handle_bannable_error_branch`**: Если `get_token` завершается с ошибкой, требующей бана, эта задача-развилка решает, что делать дальше, в зависимости от политики `on_bannable_failure`.
+-   **`ban_account_and_prepare_for_retry`**: Если политика разрешает повтор, эта задача банит сбойный аккаунт и выбирает новый для повторной попытки.
+-   **`retry_get_token`**: Выполняет вторую попытку получить токен с новым аккаунтом.
+-   **`ban_second_account_and_proxy`**: Если и вторая попытка неудачна, эта задача банит второй аккаунт и использованный прокси.
+-   **`download_and_probe`**: Если `get_token` (или `retry_get_token`) завершилась успешно, эта задача использует `yt-dlp` для скачивания медиа и `ffmpeg` для проверки целостности скачанного файла.
+-   **`mark_url_as_success`**: Если `download_and_probe` завершилась успешно, эта задача записывает результат в хэш `_result` в Redis.
+-   **`handle_generic_failure`**: Если любая из основных задач завершается с неисправимой ошибкой, эта задача записывает подробную информацию об ошибке в хэш `_fail` в Redis.
+-   **`decide_what_to_do_next`**: Задача-развилка, которая запускается после успеха или неудачи. Она решает, продолжать ли цикл.
+-   **`trigger_self_run`**: Задача, которая фактически запускает следующий экземпляр DAG, создавая непрерывный цикл.
--- a/dags/ytdlp_mgmt_proxy.py
+++ b/dags/ytdlp_mgmt_proxy.py
@ -1,197 +0,0 @@
-"""
-DAG to manage the state of proxies used by the ytdlp-ops-server.
-"""
-from __future__ import annotations
-
-import logging
-from datetime import datetime
-
-from airflow.models.dag import DAG
-from airflow.models.param import Param
-from airflow.operators.python import PythonOperator
-from airflow.utils.dates import days_ago
-
-# Configure logging
-logger = logging.getLogger(__name__)
-
-# Import and apply Thrift exceptions patch for Airflow compatibility
-try:
-    from thrift_exceptions_patch import patch_thrift_exceptions
-    patch_thrift_exceptions()
-    logger.info("Applied Thrift exceptions patch for Airflow compatibility.")
-except ImportError:
-    logger.warning("Could not import thrift_exceptions_patch. Compatibility may be affected.")
-except Exception as e:
-    logger.error(f"Error applying Thrift exceptions patch: {e}")
-
-# Thrift imports
-try:
-    from thrift.transport import TSocket, TTransport
-    from thrift.protocol import TBinaryProtocol
-    from pangramia.yt.tokens_ops import YTTokenOpService
-    from pangramia.yt.exceptions.ttypes import PBServiceException, PBUserException
-except ImportError as e:
-    logger.critical(f"Could not import Thrift modules: {e}. Ensure ytdlp-ops-auth package is installed.")
-    # Fail DAG parsing if thrift modules are not available
-    raise
-
-def format_timestamp(ts_str: str) -> str:
-    """Formats a string timestamp into a human-readable date string."""
-    if not ts_str:
-        return ""
-    try:
-        ts_float = float(ts_str)
-        if ts_float <= 0:
-            return ""
-        # Use datetime from the imported 'from datetime import datetime'
-        dt_obj = datetime.fromtimestamp(ts_float)
-        return dt_obj.strftime('%Y-%m-%d %H:%M:%S')
-    except (ValueError, TypeError):
-        return ts_str  # Return original string if conversion fails
-
-def get_thrift_client(host: str, port: int):
-    """Helper function to create and connect a Thrift client."""
-    transport = TSocket.TSocket(host, port)
-    transport = TTransport.TFramedTransport(transport)
-    protocol = TBinaryProtocol.TBinaryProtocol(transport)
-    client = YTTokenOpService.Client(protocol)
-    transport.open()
-    logger.info(f"Connected to Thrift server at {host}:{port}")
-    return client, transport
-
-def manage_proxies_callable(**context):
-    """Main callable to interact with the proxy management endpoints."""
-    params = context["params"]
-    action = params["action"]
-    host = params["host"]
-    port = params["port"]
-    server_identity = params.get("server_identity")
-    proxy_url = params.get("proxy_url")
-
-    if not server_identity and action in ["ban", "unban", "reset_all"]:
-        raise ValueError(f"A 'server_identity' is required for the '{action}' action.")
-
-    client, transport = None, None
-    try:
-        client, transport = get_thrift_client(host, port)
-
-        if action == "list":
-            logger.info(f"Listing proxy statuses for server: {server_identity or 'ALL'}")
-            statuses = client.getProxyStatus(server_identity)
-            if not statuses:
-                logger.info("No proxy statuses found.")
-                print("No proxy statuses found.")
-            else:
-                from tabulate import tabulate
-                status_list = [
-                    {
-                        "Server": s.serverIdentity,
-                        "Proxy URL": s.proxyUrl,
-                        "Status": s.status,
-                        "Success": s.successCount,
-                        "Failures": s.failureCount,
-                        "Last Success": format_timestamp(s.lastSuccessTimestamp),
-                        "Last Failure": format_timestamp(s.lastFailureTimestamp),
-                    }
-                    for s in statuses
-                ]
-                print("\n--- Proxy Statuses ---")
-                print(tabulate(status_list, headers="keys", tablefmt="grid"))
-                print("----------------------\n")
-
-        elif action == "ban":
-            if not proxy_url:
-                raise ValueError("A 'proxy_url' is required to ban a proxy.")
-            logger.info(f"Banning proxy '{proxy_url}' for server '{server_identity}'...")
-            success = client.banProxy(proxy_url, server_identity)
-            if success:
-                logger.info("Successfully banned proxy.")
-                print(f"Successfully banned proxy '{proxy_url}' for server '{server_identity}'.")
-            else:
-                logger.error("Failed to ban proxy.")
-                raise Exception("Server returned failure for banProxy operation.")
-
-        elif action == "unban":
-            if not proxy_url:
-                raise ValueError("A 'proxy_url' is required to unban a proxy.")
-            logger.info(f"Unbanning proxy '{proxy_url}' for server '{server_identity}'...")
-            success = client.unbanProxy(proxy_url, server_identity)
-            if success:
-                logger.info("Successfully unbanned proxy.")
-                print(f"Successfully unbanned proxy '{proxy_url}' for server '{server_identity}'.")
-            else:
-                logger.error("Failed to unban proxy.")
-                raise Exception("Server returned failure for unbanProxy operation.")
-
-        elif action == "reset_all":
-            logger.info(f"Resetting all proxy statuses for server '{server_identity}'...")
-            success = client.resetAllProxyStatuses(server_identity)
-            if success:
-                logger.info("Successfully reset all proxy statuses.")
-                print(f"Successfully reset all proxy statuses for server '{server_identity}'.")
-            else:
-                logger.error("Failed to reset all proxy statuses.")
-                raise Exception("Server returned failure for resetAllProxyStatuses operation.")
-
-        else:
-            raise ValueError(f"Invalid action: {action}")
-
-    except (PBServiceException, PBUserException) as e:
-        logger.error(f"Thrift error performing action '{action}': {e.message}", exc_info=True)
-        raise
-    except Exception as e:
-        logger.error(f"Error performing action '{action}': {e}", exc_info=True)
-        raise
-    finally:
-        if transport and transport.isOpen():
-            transport.close()
-            logger.info("Thrift connection closed.")
-
-with DAG(
-    dag_id="ytdlp_mgmt_proxy",
-    start_date=days_ago(1),
-    schedule=None,
-    catchup=False,
-    tags=["ytdlp", "utility", "proxy"],
-    doc_md="""
-    ### YT-DLP Proxy Manager DAG
-
-    This DAG provides tools to manage the state of proxies used by the `ytdlp-ops-server`.
-    You can view statuses, and manually ban, unban, or reset proxies for a specific server instance.
-
-    **Parameters:**
-    - `host`: The hostname or IP of the `ytdlp-ops-server` Thrift service.
-    - `port`: The port of the Thrift service.
-    - `action`: The operation to perform.
-        - `list`: List proxy statuses. Provide a `server_identity` to query a specific server, or leave it blank to query the server instance you are connected to.
-        - `ban`: Ban a specific proxy. Requires `server_identity` and `proxy_url`.
-        - `unban`: Un-ban a specific proxy. Requires `server_identity` and `proxy_url`.
-        - `reset_all`: Reset all proxies for a server to `ACTIVE`. Requires `server_identity`.
-    - `server_identity`: The unique identifier for the server instance (e.g., `ytdlp-ops-airflow-service`).
-    - `proxy_url`: The full URL of the proxy to act upon (e.g., `socks5://host:port`).
-    """,
-    params={
-        "host": Param("89.253.221.173", type="string", description="The hostname of the ytdlp-ops-server service."),
-        "port": Param(9090, type="integer", description="The port of the ytdlp-ops-server service."),
-        "action": Param(
-            "list",
-            type="string",
-            enum=["list", "ban", "unban", "reset_all"],
-            description="The management action to perform.",
-        ),
-        "server_identity": Param(
-            "ytdlp-ops-airflow-service",
-            type=["null", "string"],
-            description="The identity of the server to manage. Leave blank to query the connected server instance.",
-        ),
-        "proxy_url": Param(
-            None,
-            type=["null", "string"],
-            description="The proxy URL to ban/unban (e.g., 'socks5://host:port').",
-        ),
-    },
-) as dag:
-    proxy_management_task = PythonOperator(
-        task_id="proxy_management_task",
-        python_callable=manage_proxies_callable,
-    )
--- a/dags/ytdlp_mgmt_proxy_account.py
+++ b/dags/ytdlp_mgmt_proxy_account.py
@ -0,0 +1,405 @@
+"""
+DAG to manage the state of proxies and accounts used by the ytdlp-ops-server.
+"""
+from __future__ import annotations
+
+import logging
+from datetime import datetime
+import socket
+
+from airflow.exceptions import AirflowException
+from airflow.models.dag import DAG
+from airflow.models.param import Param
+from airflow.operators.python import PythonOperator
+from airflow.utils.dates import days_ago
+from airflow.models.variable import Variable
+from airflow.providers.redis.hooks.redis import RedisHook
+
+# Configure logging
+logger = logging.getLogger(__name__)
+
+# Import and apply Thrift exceptions patch for Airflow compatibility
+try:
+    from thrift_exceptions_patch import patch_thrift_exceptions
+    patch_thrift_exceptions()
+    logger.info("Applied Thrift exceptions patch for Airflow compatibility.")
+except ImportError:
+    logger.warning("Could not import thrift_exceptions_patch. Compatibility may be affected.")
+except Exception as e:
+    logger.error(f"Error applying Thrift exceptions patch: {e}")
+
+# Thrift imports
+try:
+    from thrift.transport import TSocket, TTransport
+    from thrift.protocol import TBinaryProtocol
+    from pangramia.yt.tokens_ops import YTTokenOpService
+    from pangramia.yt.exceptions.ttypes import PBServiceException, PBUserException
+except ImportError as e:
+    logger.critical(f"Could not import Thrift modules: {e}. Ensure ytdlp-ops-auth package is installed.")
+    # Fail DAG parsing if thrift modules are not available
+    raise
+
+DEFAULT_YT_AUTH_SERVICE_IP = Variable.get("YT_AUTH_SERVICE_IP", default_var="16.162.82.212")
+DEFAULT_YT_AUTH_SERVICE_PORT = Variable.get("YT_AUTH_SERVICE_PORT", default_var=9080)
+DEFAULT_REDIS_CONN_ID = "redis_default"
+
+
+# Helper function to connect to Redis, similar to other DAGs
+def _get_redis_client(redis_conn_id: str):
+    """Gets a Redis client from an Airflow connection."""
+    try:
+        # Use the imported RedisHook
+        redis_hook = RedisHook(redis_conn_id=redis_conn_id)
+        # get_conn returns a redis.Redis client
+        return redis_hook.get_conn()
+    except Exception as e:
+        logger.error(f"Failed to connect to Redis using connection '{redis_conn_id}': {e}")
+        # Use the imported AirflowException
+        raise AirflowException(f"Redis connection failed: {e}")
+
+
+def format_timestamp(ts_str: str) -> str:
+    """Formats a string timestamp into a human-readable date string."""
+    if not ts_str:
+        return ""
+    try:
+        ts_float = float(ts_str)
+        if ts_float <= 0:
+            return ""
+        # Use datetime from the imported 'from datetime import datetime'
+        dt_obj = datetime.fromtimestamp(ts_float)
+        return dt_obj.strftime('%Y-%m-%d %H:%M:%S')
+    except (ValueError, TypeError):
+        return ts_str  # Return original string if conversion fails
+
+def get_thrift_client(host: str, port: int):
+    """Helper function to create and connect a Thrift client."""
+    transport = TSocket.TSocket(host, port)
+    transport.setTimeout(30 * 1000)  # 30s timeout
+    transport = TTransport.TFramedTransport(transport)
+    protocol = TBinaryProtocol.TBinaryProtocol(transport)
+    client = YTTokenOpService.Client(protocol)
+    transport.open()
+    logger.info(f"Connected to Thrift server at {host}:{port}")
+    return client, transport
+
+def _list_proxy_statuses(client, server_identity):
+    """Lists the status of proxies."""
+    logger.info(f"Listing proxy statuses for server: {server_identity or 'ALL'}")
+    statuses = client.getProxyStatus(server_identity)
+    if not statuses:
+        logger.info("No proxy statuses found.")
+        print("No proxy statuses found.")
+        return
+
+    from tabulate import tabulate
+    status_list = []
+    # This is forward-compatible: it checks for new attributes before using them.
+    has_extended_info = hasattr(statuses[0], 'recentAccounts') or hasattr(statuses[0], 'recentMachines')
+
+    headers = ["Server", "Proxy URL", "Status", "Success", "Failures", "Last Success", "Last Failure"]
+    if has_extended_info:
+        headers.extend(["Recent Accounts", "Recent Machines"])
+
+    for s in statuses:
+        status_item = {
+            "Server": s.serverIdentity,
+            "Proxy URL": s.proxyUrl,
+            "Status": s.status,
+            "Success": s.successCount,
+            "Failures": s.failureCount,
+            "Last Success": format_timestamp(s.lastSuccessTimestamp),
+            "Last Failure": format_timestamp(s.lastFailureTimestamp),
+        }
+        if has_extended_info:
+            recent_accounts = getattr(s, 'recentAccounts', [])
+            recent_machines = getattr(s, 'recentMachines', [])
+            status_item["Recent Accounts"] = "\n".join(recent_accounts) if recent_accounts else "N/A"
+            status_item["Recent Machines"] = "\n".join(recent_machines) if recent_machines else "N/A"
+        status_list.append(status_item)
+
+    print("\n--- Proxy Statuses ---")
+    # The f-string with a newline ensures the table starts on a new line in the logs.
+    print(f"\n{tabulate(status_list, headers='keys', tablefmt='grid')}")
+    print("----------------------\n")
+    if not has_extended_info:
+        logger.warning("Server does not seem to support 'recentAccounts' or 'recentMachines' fields yet.")
+        print("NOTE: To see Recent Accounts/Machines, the server's `getProxyStatus` method must be updated to return these fields.")
+
+
+def _list_account_statuses(client, account_id):
+    """Lists the status of accounts."""
+    logger.info(f"Listing account statuses for account: {account_id or 'ALL'}")
+    try:
+        # The thrift method takes accountId (specific) or accountPrefix.
+        # If account_id is provided, we use it. If not, we get all by leaving both params as None.
+        statuses = client.getAccountStatus(accountId=account_id, accountPrefix=None)
+        if not statuses:
+            logger.info("No account statuses found.")
+            print("\n--- Account Statuses ---\nNo account statuses found.\n------------------------\n")
+            return
+
+        from tabulate import tabulate
+        status_list = []
+        
+        for s in statuses:
+            # Determine the last activity timestamp for sorting
+            last_success = float(s.lastSuccessTimestamp) if s.lastSuccessTimestamp else 0
+            last_failure = float(s.lastFailureTimestamp) if s.lastFailureTimestamp else 0
+            last_activity = max(last_success, last_failure)
+
+            status_item = {
+                "Account ID": s.accountId,
+                "Status": s.status,
+                "Success": s.successCount,
+                "Failures": s.failureCount,
+                "Last Success": format_timestamp(s.lastSuccessTimestamp),
+                "Last Failure": format_timestamp(s.lastFailureTimestamp),
+                "Last Proxy": s.lastUsedProxy or "N/A",
+                "Last Machine": s.lastUsedMachine or "N/A",
+                "_last_activity": last_activity,  # Add a temporary key for sorting
+            }
+            status_list.append(status_item)
+
+        # Sort the list by the last activity timestamp in descending order
+        status_list.sort(key=lambda item: item.get('_last_activity', 0), reverse=True)
+
+        # Remove the temporary sort key before printing
+        for item in status_list:
+            del item['_last_activity']
+
+        print("\n--- Account Statuses ---")
+        # The f-string with a newline ensures the table starts on a new line in the logs.
+        print(f"\n{tabulate(status_list, headers='keys', tablefmt='grid')}")
+        print("------------------------\n")
+    except (PBServiceException, PBUserException) as e:
+        logger.error(f"Failed to get account statuses: {e.message}", exc_info=True)
+        print(f"\nERROR: Could not retrieve account statuses. Server returned: {e.message}\n")
+    except Exception as e:
+        logger.error(f"An unexpected error occurred while getting account statuses: {e}", exc_info=True)
+        print(f"\nERROR: An unexpected error occurred: {e}\n")
+
+
+def manage_system_callable(**context):
+    """Main callable to interact with the system management endpoints."""
+    params = context["params"]
+    entity = params["entity"]
+    action = params["action"]
+    host = params["host"]
+    port = params["port"]
+    server_identity = params.get("server_identity")
+    proxy_url = params.get("proxy_url")
+    account_id = params.get("account_id")
+
+    if action in ["ban", "unban", "reset_all"] and entity == "proxy" and not server_identity:
+        raise ValueError(f"A 'server_identity' is required for proxy action '{action}'.")
+    if action in ["ban", "unban"] and entity == "account" and not account_id:
+        raise ValueError(f"An 'account_id' is required for account action '{action}'.")
+
+    # Handle direct Redis action separately to avoid creating an unnecessary Thrift connection.
+    if entity == "account" and action == "remove_all":
+        confirm = params.get("confirm_remove_all_accounts", False)
+        if not confirm:
+            message = "FATAL: 'remove_all' action requires 'confirm_remove_all_accounts' to be set to True. No accounts were removed."
+            logger.error(message)
+            print(f"\nERROR: {message}\n")
+            raise ValueError(message)
+
+        redis_conn_id = params["redis_conn_id"]
+        account_prefix = params.get("account_id")  # Repurpose account_id param as an optional prefix
+
+        redis_client = _get_redis_client(redis_conn_id)
+
+        pattern = f"account_status:{account_prefix}*" if account_prefix else "account_status:*"
+        logger.warning(f"Searching for account status keys in Redis with pattern: '{pattern}'")
+
+        # scan_iter returns bytes, so we don't need to decode for deletion
+        keys_to_delete = [key for key in redis_client.scan_iter(pattern)]
+
+        if not keys_to_delete:
+            logger.info(f"No account keys found matching pattern '{pattern}'. Nothing to do.")
+            print(f"\nNo accounts found matching pattern '{pattern}'.\n")
+            return
+
+        logger.warning(f"Found {len(keys_to_delete)} account keys to delete. This is a destructive operation!")
+        print(f"\nWARNING: Found {len(keys_to_delete)} accounts to remove from Redis.")
+        # Decode for printing
+        for key in keys_to_delete[:10]:
+            print(f"  - {key.decode('utf-8')}")
+        if len(keys_to_delete) > 10:
+            print(f"  ... and {len(keys_to_delete) - 10} more.")
+
+        deleted_count = redis_client.delete(*keys_to_delete)
+        logger.info(f"Successfully deleted {deleted_count} account keys from Redis.")
+        print(f"\nSuccessfully removed {deleted_count} accounts from Redis.\n")
+        return  # End execution for this action
+
+    client, transport = None, None
+    try:
+        client, transport = get_thrift_client(host, port)
+
+        if entity == "proxy":
+            if action == "list":
+                _list_proxy_statuses(client, server_identity)
+            elif action == "ban":
+                if not proxy_url: raise ValueError("A 'proxy_url' is required.")
+                logger.info(f"Banning proxy '{proxy_url}' for server '{server_identity}'...")
+                client.banProxy(proxy_url, server_identity)
+                print(f"Successfully sent request to ban proxy '{proxy_url}'.")
+            elif action == "unban":
+                if not proxy_url: raise ValueError("A 'proxy_url' is required.")
+                logger.info(f"Unbanning proxy '{proxy_url}' for server '{server_identity}'...")
+                client.unbanProxy(proxy_url, server_identity)
+                print(f"Successfully sent request to unban proxy '{proxy_url}'.")
+            elif action == "reset_all":
+                logger.info(f"Resetting all proxy statuses for server '{server_identity}'...")
+                client.resetAllProxyStatuses(server_identity)
+                print(f"Successfully sent request to reset all proxy statuses for '{server_identity}'.")
+            else:
+                raise ValueError(f"Invalid action '{action}' for entity 'proxy'.")
+
+        elif entity == "account":
+            if action == "list":
+                _list_account_statuses(client, account_id)
+            elif action == "ban":
+                if not account_id: raise ValueError("An 'account_id' is required.")
+                reason = f"Manual ban from Airflow mgmt DAG by {socket.gethostname()}"
+                logger.info(f"Banning account '{account_id}'...")
+                client.banAccount(accountId=account_id, reason=reason)
+                print(f"Successfully sent request to ban account '{account_id}'.")
+            elif action == "unban":
+                if not account_id: raise ValueError("An 'account_id' is required.")
+                reason = f"Manual un-ban from Airflow mgmt DAG by {socket.gethostname()}"
+                logger.info(f"Unbanning account '{account_id}'...")
+                client.unbanAccount(accountId=account_id, reason=reason)
+                print(f"Successfully sent request to unban account '{account_id}'.")
+            elif action == "reset_all":
+                account_prefix = account_id  # Repurpose account_id param as an optional prefix
+                logger.info(f"Resetting all account statuses to ACTIVE (prefix: '{account_prefix or 'ALL'}')...")
+                
+                all_statuses = client.getAccountStatus(accountId=None, accountPrefix=account_prefix)
+                if not all_statuses:
+                    print(f"No accounts found with prefix '{account_prefix or 'ALL'}' to reset.")
+                    return
+
+                accounts_to_reset = [s.accountId for s in all_statuses]
+                logger.info(f"Found {len(accounts_to_reset)} accounts to reset.")
+                print(f"Found {len(accounts_to_reset)} accounts. Sending unban request for each...")
+
+                reset_count = 0
+                fail_count = 0
+                for acc_id in accounts_to_reset:
+                    try:
+                        reason = f"Manual reset from Airflow mgmt DAG by {socket.gethostname()}"
+                        client.unbanAccount(accountId=acc_id, reason=reason)
+                        logger.info(f"  - Sent reset (unban) for '{acc_id}'.")
+                        reset_count += 1
+                    except Exception as e:
+                        logger.error(f"  - Failed to reset account '{acc_id}': {e}")
+                        fail_count += 1
+                
+                print(f"\nSuccessfully sent reset requests for {reset_count} accounts.")
+                if fail_count > 0:
+                    print(f"Failed to send reset requests for {fail_count} accounts. See logs for details.")
+                
+                # Optionally, list statuses again to confirm
+                print("\n--- Listing statuses after reset ---")
+                _list_account_statuses(client, account_prefix)
+            else:
+                raise ValueError(f"Invalid action '{action}' for entity 'account'.")
+
+        elif entity == "all":
+            if action == "list":
+                print("\nListing all entities...")
+                _list_proxy_statuses(client, server_identity)
+                _list_account_statuses(client, account_id)
+            else:
+                raise ValueError(f"Action '{action}' is not supported for entity 'all'. Only 'list' is supported.")
+
+    except (PBServiceException, PBUserException) as e:
+        logger.error(f"Thrift error performing action '{action}': {e.message}", exc_info=True)
+        raise
+    except NotImplementedError as e:
+        logger.error(f"Feature not implemented: {e}", exc_info=True)
+        raise
+    except Exception as e:
+        logger.error(f"Error performing action '{action}': {e}", exc_info=True)
+        raise
+    finally:
+        if transport and transport.isOpen():
+            transport.close()
+            logger.info("Thrift connection closed.")
+
+with DAG(
+    dag_id="ytdlp_mgmt_proxy_account",
+    start_date=days_ago(1),
+    schedule=None,
+    catchup=False,
+    tags=["ytdlp", "utility", "proxy", "account", "management"],
+    doc_md="""
+    ### YT-DLP Proxy and Account Manager DAG
+
+    This DAG provides tools to manage the state of **proxies and accounts** used by the `ytdlp-ops-server`.
+
+    **Parameters:**
+    - `host`, `port`: Connection details for the `ytdlp-ops-server` Thrift service.
+    - `entity`: The type of resource to manage (`proxy`, `account`, or `all`).
+    - `action`: The operation to perform.
+        - `list`: View statuses. For `entity: all`, lists both proxies and accounts.
+        - `ban`: Ban a specific proxy or account.
+        - `unban`: Un-ban a specific proxy or account.
+        - `reset_all`: Reset all proxies for a server (or all accounts) to `ACTIVE`.
+        - `remove_all`: **Deletes all account status keys** from Redis for a given prefix. This is a destructive action.
+    - `server_identity`: Required for most proxy actions.
+    - `proxy_url`: Required for banning/unbanning a specific proxy.
+    - `account_id`: Required for managing a specific account. For `action: reset_all` or `remove_all` on `entity: account`, this can be used as an optional prefix to filter which accounts to act on.
+    - `confirm_remove_all_accounts`: **Required for `remove_all` action.** Must be set to `True` to confirm deletion.
+    """,
+    params={
+        "host": Param(DEFAULT_YT_AUTH_SERVICE_IP, type="string", description="The hostname of the ytdlp-ops-server service. Default is from Airflow variable YT_AUTH_SERVICE_IP or hardcoded."),
+        "port": Param(DEFAULT_YT_AUTH_SERVICE_PORT, type="integer", description="The port of the ytdlp-ops-server service (Envoy load balancer). Default is from Airflow variable YT_AUTH_SERVICE_PORT or hardcoded."),
+        "entity": Param(
+            "all",
+            type="string",
+            enum=["proxy", "account", "all"],
+            description="The type of entity to manage. Use 'all' with action 'list' to see both.",
+        ),
+        "action": Param(
+            "list",
+            type="string",
+            enum=["list", "ban", "unban", "reset_all", "remove_all"],
+            description="The management action to perform. `reset_all` for proxies/accounts. `remove_all` for accounts only.",
+        ),
+        "server_identity": Param(
+            "ytdlp-ops-airflow-service",
+            type=["null", "string"],
+            description="The identity of the server instance (for proxy management).",
+        ),
+        "proxy_url": Param(
+            None,
+            type=["null", "string"],
+            description="The proxy URL to act upon (e.g., 'socks5://host:port').",
+        ),
+        "account_id": Param(
+            None,
+            type=["null", "string"],
+            description="The account ID to act upon. For `reset_all` or `remove_all` on accounts, this can be an optional prefix.",
+        ),
+        "confirm_remove_all_accounts": Param(
+            False,
+            type="boolean",
+            title="[remove_all] Confirm Deletion",
+            description="Must be set to True to execute the 'remove_all' action for accounts. This is a destructive operation.",
+        ),
+        "redis_conn_id": Param(
+            DEFAULT_REDIS_CONN_ID,
+            type="string",
+            title="Redis Connection ID",
+            description="The Airflow connection ID for the Redis server (used for 'remove_all').",
+        ),
+    },
+) as dag:
+    system_management_task = PythonOperator(
+        task_id="system_management_task",
+        python_callable=manage_system_callable,
+    )
--- a/dags/ytdlp_mgmt_queues.py
+++ b/dags/ytdlp_mgmt_queues.py
@ -164,7 +164,7 @@ def clear_queue_callable(**context):
    redis_conn_id = params['redis_conn_id']
    queue_to_clear = params['queue_to_clear']
    dump_queues = params['dump_queues']
-    # Get the rendered dump_dir from the templates_dict passed to the operator
+    # The value from templates_dict is already rendered by Airflow.
    dump_dir = context['templates_dict']['dump_dir']
    dump_patterns = params['dump_patterns'].split(',') if params.get('dump_patterns') else []

@ -191,34 +191,43 @@ def clear_queue_callable(**context):


 def list_contents_callable(**context):
-    """Lists the contents of the specified Redis key (list or hash)."""
+    """Lists the contents of the specified Redis key(s) (list or hash)."""
    params = context['params']
    redis_conn_id = params['redis_conn_id']
-    queue_to_list = params['queue_to_list']
+    queues_to_list_str = params.get('queue_to_list')
    max_items = params.get('max_items', 10)

-    if not queue_to_list:
+    if not queues_to_list_str:
        raise ValueError("Parameter 'queue_to_list' cannot be empty.")

-    logger.info(f"Attempting to list contents of Redis key '{queue_to_list}' (max: {max_items}) using connection '{redis_conn_id}'.")
-    try:
+    queues_to_list = [q.strip() for q in queues_to_list_str.split(',') if q.strip()]
+    
+    if not queues_to_list:
+        logger.info("No valid queue names provided in 'queue_to_list'. Nothing to do.")
+        return
+
+    logger.info(f"Attempting to list contents for {len(queues_to_list)} Redis key(s): {queues_to_list}")
+    
    redis_client = _get_redis_client(redis_conn_id)
+
+    for queue_to_list in queues_to_list:
+        # Add a newline for better separation in logs
+        logger.info(f"\n--- Listing contents of Redis key '{queue_to_list}' (max: {max_items}) ---")
+        try:
            key_type_bytes = redis_client.type(queue_to_list)
            key_type = key_type_bytes.decode('utf-8') # Decode type

            if key_type == 'list':
                list_length = redis_client.llen(queue_to_list)
-            # Get the last N items, which are the most recently added with rpush
                items_to_fetch = min(max_items, list_length)
-            # lrange with negative indices gets items from the end of the list.
-            # -N to -1 gets the last N items.
                contents_bytes = redis_client.lrange(queue_to_list, -items_to_fetch, -1)
                contents = [item.decode('utf-8') for item in contents_bytes]
-            # Reverse the list so the absolute most recent item is printed first
                contents.reverse()
-            logger.info(f"--- Contents of Redis List '{queue_to_list}' (showing most recent {len(contents)} of {list_length}) ---")
+                logger.info(f"--- Contents of Redis List '{queue_to_list}' ---")
+                logger.info(f"Total items in list: {list_length}")
+                if contents:
+                    logger.info(f"Showing most recent {len(contents)} item(s):")
                    for i, item in enumerate(contents):
-                # The index here is just for display, 0 is the most recent
                        logger.info(f"  [recent_{i}]: {item}")
                    if list_length > len(contents):
                        logger.info(f"  ... ({list_length - len(contents)} older items not shown)")
@ -226,26 +235,25 @@ def list_contents_callable(**context):

            elif key_type == 'hash':
                hash_size = redis_client.hlen(queue_to_list)
-            # HGETALL can be risky for large hashes. Consider HSCAN for production.
-            # For manual inspection, HGETALL is often acceptable.
-            if hash_size > max_items * 2: # Heuristic: avoid huge HGETALL
+                if hash_size > max_items * 2:
                     logger.warning(f"Hash '{queue_to_list}' has {hash_size} fields, which is large. Listing might be slow or incomplete. Consider using redis-cli HSCAN.")
-            # hgetall returns dict of bytes keys and bytes values, decode them
                contents_bytes = redis_client.hgetall(queue_to_list)
                contents = {k.decode('utf-8'): v.decode('utf-8') for k, v in contents_bytes.items()}
-            logger.info(f"--- Contents of Redis Hash '{queue_to_list}' ({len(contents)} fields) ---")
+                logger.info(f"--- Contents of Redis Hash '{queue_to_list}' ---")
+                logger.info(f"Total fields in hash: {hash_size}")
+                if contents:
+                    logger.info(f"Showing up to {max_items} item(s):")
                    item_count = 0
-            for key, value in contents.items(): # key and value are now strings
+                    for key, value in contents.items():
                        if item_count >= max_items:
                            logger.info(f"  ... (stopped listing after {max_items} items of {hash_size})")
                            break
-                # Attempt to pretty-print if value is JSON
                        try:
                            parsed_value = json.loads(value)
                            pretty_value = json.dumps(parsed_value, indent=2)
                            logger.info(f"  '{key}':\n{pretty_value}")
                        except json.JSONDecodeError:
-                    logger.info(f"  '{key}': {value}") # Print as string if not JSON
+                            logger.info(f"  '{key}': {value}")
                        item_count += 1
                logger.info(f"--- End of Hash Contents ---")

@ -256,7 +264,7 @@ def list_contents_callable(**context):

        except Exception as e:
            logger.error(f"Failed to list contents of Redis key '{queue_to_list}': {e}", exc_info=True)
-        raise AirflowException(f"Failed to list Redis key contents: {e}")
+            # Continue to the next key in the list instead of failing the whole task


 def check_status_callable(**context):
@ -292,6 +300,63 @@ def check_status_callable(**context):
        raise AirflowException(f"Failed to check queue status: {e}")


+def requeue_failed_callable(**context):
+    """
+    Copies all URLs from the fail hash to the inbox list and optionally clears the fail hash.
+    """
+    params = context['params']
+    redis_conn_id = params['redis_conn_id']
+    queue_name = params['queue_name_for_requeue']
+    clear_fail_queue = params['clear_fail_queue_after_requeue']
+
+    fail_queue_name = f"{queue_name}_fail"
+    inbox_queue_name = f"{queue_name}_inbox"
+
+    logger.info(f"Requeuing failed URLs from '{fail_queue_name}' to '{inbox_queue_name}'.")
+    print(f"Requeuing failed URLs from '{fail_queue_name}' to '{inbox_queue_name}'.")
+
+    redis_client = _get_redis_client(redis_conn_id)
+
+    try:
+        # The fail queue is a hash. The keys are the URLs.
+        failed_urls_bytes = redis_client.hkeys(fail_queue_name)
+        if not failed_urls_bytes:
+            logger.info(f"Fail queue '{fail_queue_name}' is empty. Nothing to requeue.")
+            print(f"Fail queue '{fail_queue_name}' is empty. Nothing to requeue.")
+            return
+
+        failed_urls = [url.decode('utf-8') for url in failed_urls_bytes]
+        logger.info(f"Found {len(failed_urls)} URLs to requeue.")
+        print(f"Found {len(failed_urls)} URLs to requeue:")
+        for url in failed_urls:
+            print(f"  - {url}")
+
+        # Add URLs to the inbox list
+        if failed_urls:
+            with redis_client.pipeline() as pipe:
+                pipe.rpush(inbox_queue_name, *failed_urls)
+                if clear_fail_queue:
+                    pipe.delete(fail_queue_name)
+                pipe.execute()
+
+        final_list_length = redis_client.llen(inbox_queue_name)
+        success_message = (
+            f"Successfully requeued {len(failed_urls)} URLs to '{inbox_queue_name}'. "
+            f"The list now contains {final_list_length} items."
+        )
+        logger.info(success_message)
+        print(f"\n{success_message}")
+
+        if clear_fail_queue:
+            logger.info(f"Successfully cleared fail queue '{fail_queue_name}'.")
+        else:
+            logger.info(f"Fail queue '{fail_queue_name}' was not cleared as per configuration.")
+
+    except Exception as e:
+        logger.error(f"Failed to requeue failed URLs: {e}", exc_info=True)
+        raise AirflowException(f"Failed to requeue failed URLs: {e}")
+
+
 def add_videos_to_queue_callable(**context):
    """
    Parses video inputs, normalizes them to URLs, and adds them to a Redis queue.
@ -381,13 +446,14 @@ with DAG(
    - `add_videos`: Add one or more YouTube videos to a queue.
    - `clear_queue`: Dump and/or delete a specific Redis key.
    - `list_contents`: View the contents of a Redis key (list or hash).
-    - `check_status`: (Placeholder) Check the overall status of the queues.
+    - `check_status`: Check the overall status of the queues.
+    - `requeue_failed`: Copy all URLs from the `_fail` hash to the `_inbox` list and clear the `_fail` hash.
    """,
    params={
        "action": Param(
            "add_videos",
            type="string",
-            enum=["add_videos", "clear_queue", "list_contents", "check_status"],
+            enum=["add_videos", "clear_queue", "list_contents", "check_status", "requeue_failed"],
            title="Action",
            description="The management action to perform.",
        ),
@ -437,10 +503,10 @@ with DAG(
        ),
        # --- Params for 'list_contents' ---
        "queue_to_list": Param(
-            'video_queue_inbox',
+            'video_queue_inbox,video_queue_fail',
            type="string",
-            title="[list_contents] Queue to List",
-            description="Exact name of the Redis key to list.",
+            title="[list_contents] Queues to List",
+            description="Comma-separated list of exact Redis key names to list.",
        ),
        "max_items": Param(
            10,
@ -455,6 +521,19 @@ with DAG(
            title="[check_status] Base Queue Name",
            description="Base name of the queues to check (e.g., 'video_queue').",
        ),
+        # --- Params for 'requeue_failed' ---
+        "queue_name_for_requeue": Param(
+            DEFAULT_QUEUE_NAME,
+            type="string",
+            title="[requeue_failed] Base Queue Name",
+            description="Base name of the queues to requeue from (e.g., 'video_queue' will use 'video_queue_fail').",
+        ),
+        "clear_fail_queue_after_requeue": Param(
+            True,
+            type="boolean",
+            title="[requeue_failed] Clear Fail Queue",
+            description="If True, deletes the `_fail` hash after requeueing items.",
+        ),
        # --- Common Params ---
        "redis_conn_id": Param(
            DEFAULT_REDIS_CONN_ID,
@ -489,5 +568,16 @@ with DAG(
        python_callable=check_status_callable,
    )

-    # --- Placeholder Tasks ---
-    branch_on_action >> [action_add_videos, action_clear_queue, action_list_contents, action_check_status]
+    action_requeue_failed = PythonOperator(
+        task_id="action_requeue_failed",
+        python_callable=requeue_failed_callable,
+    )
+
+    # --- Wire up tasks ---
+    branch_on_action >> [
+        action_add_videos,
+        action_clear_queue,
+        action_list_contents,
+        action_check_status,
+        action_requeue_failed,
+    ]
--- a/dags/ytdlp_ops_orchestrator.py
+++ b/dags/ytdlp_ops_orchestrator.py
@ -0,0 +1,194 @@
+# -*- coding: utf-8 -*-
+# vim:fenc=utf-8
+#
+# Copyright © 2024 rl <rl@rlmbp>
+#
+# Distributed under terms of the MIT license.
+
+"""
+DAG to orchestrate ytdlp_ops_worker_per_url DAG runs based on a defined policy.
+It fetches URLs from a Redis queue and launches workers in controlled bunches.
+"""
+
+from airflow import DAG
+from airflow.exceptions import AirflowException, AirflowSkipException
+from airflow.operators.python import PythonOperator
+from airflow.models.param import Param
+from airflow.models.variable import Variable
+from airflow.utils.dates import days_ago
+from airflow.api.common.trigger_dag import trigger_dag
+from airflow.models.dagrun import DagRun
+from airflow.models.dag import DagModel
+from datetime import timedelta
+import logging
+import random
+import time
+
+# Import utility functions
+from utils.redis_utils import _get_redis_client
+
+# Import Thrift modules for proxy status check
+from pangramia.yt.tokens_ops import YTTokenOpService
+from thrift.protocol import TBinaryProtocol
+from thrift.transport import TSocket, TTransport
+
+# Configure logging
+logger = logging.getLogger(__name__)
+
+# Default settings
+DEFAULT_QUEUE_NAME = 'video_queue'
+DEFAULT_REDIS_CONN_ID = 'redis_default'
+DEFAULT_TOTAL_WORKERS = 3
+DEFAULT_WORKERS_PER_BUNCH = 1
+DEFAULT_WORKER_DELAY_S = 5
+DEFAULT_BUNCH_DELAY_S = 20
+
+DEFAULT_YT_AUTH_SERVICE_IP = Variable.get("YT_AUTH_SERVICE_IP", default_var="16.162.82.212")
+DEFAULT_YT_AUTH_SERVICE_PORT = Variable.get("YT_AUTH_SERVICE_PORT", default_var=9080)
+
+# --- Helper Functions ---
+
+
+# --- Main Orchestration Callable ---
+
+def orchestrate_workers_ignition_callable(**context):
+    """
+    Main orchestration logic. Triggers a specified number of worker DAGs
+    to initiate self-sustaining processing loops.
+    """
+    params = context['params']
+    logger.info("Starting worker ignition sequence.")
+
+    worker_dag_id = 'ytdlp_ops_worker_per_url'
+    dag_model = DagModel.get_dagmodel(worker_dag_id)
+    if dag_model and dag_model.is_paused:
+        raise AirflowException(f"Worker DAG '{worker_dag_id}' is paused. Cannot start worker loops.")
+
+    total_workers = int(params['total_workers'])
+    workers_per_bunch = int(params['workers_per_bunch'])
+    worker_delay = int(params['delay_between_workers_s'])
+    bunch_delay = int(params['delay_between_bunches_s'])
+
+    # Create a list of worker numbers to trigger
+    worker_indices = list(range(total_workers))
+    bunches = [worker_indices[i:i + workers_per_bunch] for i in range(0, len(worker_indices), workers_per_bunch)]
+
+    logger.info(f"Plan: Starting {total_workers} total workers in {len(bunches)} bunches.")
+    
+    dag_run_id = context['dag_run'].run_id
+    total_triggered = 0
+
+    # Pass all orchestrator params to the worker so it has the full context for its loop.
+    conf_to_pass = {p: params[p] for p in params}
+    # The worker pulls its own URL, so we don't pass one.
+    if 'url' in conf_to_pass:
+        del conf_to_pass['url']
+
+    for i, bunch in enumerate(bunches):
+        logger.info(f"--- Igniting Bunch {i+1}/{len(bunches)} (contains {len(bunch)} worker(s)) ---")
+        for j, _ in enumerate(bunch):
+            # Create a unique run_id for each worker loop starter
+            run_id = f"ignited_{dag_run_id}_{total_triggered}"
+            
+            logger.info(f"Igniting worker {j+1}/{len(bunch)} in bunch {i+1} (loop {total_triggered + 1}/{total_workers}) (Run ID: {run_id})")
+            logger.debug(f"Full conf for worker loop {run_id}: {conf_to_pass}")
+            
+            trigger_dag(
+                dag_id=worker_dag_id,
+                run_id=run_id,
+                conf=conf_to_pass,
+                replace_microseconds=False
+            )
+            total_triggered += 1
+            
+            # Delay between workers in a bunch
+            if j < len(bunch) - 1:
+                logger.info(f"Waiting {worker_delay}s before next worker in bunch...")
+                time.sleep(worker_delay)
+        
+        # Delay between bunches
+        if i < len(bunches) - 1:
+            logger.info(f"--- Bunch {i+1} ignited. Waiting {bunch_delay}s before next bunch... ---")
+            time.sleep(bunch_delay)
+    
+    logger.info(f"--- Ignition sequence complete. Total worker loops started: {total_triggered}. ---")
+
+
+
+
+# =============================================================================
+# DAG Definition
+# =============================================================================
+
+default_args = {
+    'owner': 'airflow',
+    'depends_on_past': False,
+    'email_on_failure': False,
+    'email_on_retry': False,
+    'retries': 1,
+    'retry_delay': timedelta(minutes=1),
+    'start_date': days_ago(1),
+}
+
+with DAG(
+    dag_id='ytdlp_ops_orchestrator',
+    default_args=default_args,
+    schedule_interval=None, # This DAG runs only when triggered.
+    max_active_runs=1, # Only one ignition process should run at a time.
+    catchup=False,
+    description='Ignition system for ytdlp_ops_worker_per_url DAGs. Starts self-sustaining worker loops.',
+    doc_md="""
+    ### YT-DLP Worker Ignition System
+
+    This DAG acts as an "ignition system" to start one or more self-sustaining worker loops.
+    It does **not** process URLs itself. Its only job is to trigger a specified number of `ytdlp_ops_worker_per_url` DAGs.
+
+    #### How it Works:
+
+    1.  **Manual Trigger:** You manually trigger this DAG with parameters defining how many worker loops to start (`total_workers`), in what configuration (`workers_per_bunch`, delays).
+    2.  **Ignition:** The orchestrator triggers the initial set of worker DAGs in a "fire-and-forget" manner, passing all its configuration parameters to them.
+    3.  **Completion:** Once all initial workers have been triggered, the orchestrator's job is complete.
+
+    The workers then take over, each running its own continuous processing loop.
+    """,
+    tags=['ytdlp', 'orchestrator', 'ignition'],
+    params={
+        # --- Ignition Control Parameters ---
+        'total_workers': Param(DEFAULT_TOTAL_WORKERS, type="integer", description="Total number of worker loops to start."),
+        'workers_per_bunch': Param(DEFAULT_WORKERS_PER_BUNCH, type="integer", description="Number of workers to start in each bunch."),
+        'delay_between_workers_s': Param(DEFAULT_WORKER_DELAY_S, type="integer", description="Delay in seconds between starting each worker within a bunch."),
+        'delay_between_bunches_s': Param(DEFAULT_BUNCH_DELAY_S, type="integer", description="Delay in seconds between starting each bunch."),
+        
+        # --- Worker Passthrough Parameters ---
+        'on_bannable_failure': Param(
+            'retry_with_new_account',
+            type="string",
+            enum=['stop_loop', 'retry_with_new_account'],
+            title="[Worker Param] On Bannable Failure Policy",
+            description="Policy for a worker when a bannable error occurs. "
+                        "'stop_loop': Ban the account, mark URL as failed, and stop the worker's loop. "
+                        "'retry_with_new_account': Ban the failed account, retry ONCE with a new account. If retry fails, ban the second account and proxy, then stop."
+        ),
+        'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="[Worker Param] Base name for Redis queues."),
+        'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="[Worker Param] Airflow Redis connection ID."),
+        'clients': Param('mweb,ios,android', type="string", description="[Worker Param] Comma-separated list of clients for token generation."),
+        'account_pool': Param('ytdlp_account', type="string", description="[Worker Param] Account pool prefix or comma-separated list."),
+        'account_pool_size': Param(10, type=["integer", "null"], description="[Worker Param] If using a prefix for 'account_pool', this specifies the number of accounts to generate (e.g., 10 for 'prefix_01' through 'prefix_10'). Required when using a prefix."),
+        'service_ip': Param(DEFAULT_YT_AUTH_SERVICE_IP, type="string", description="[Worker Param] IP of the ytdlp-ops-server. Default is from Airflow variable YT_AUTH_SERVICE_IP or hardcoded."),
+        'service_port': Param(DEFAULT_YT_AUTH_SERVICE_PORT, type="integer", description="[Worker Param] Port of the Envoy load balancer. Default is from Airflow variable YT_AUTH_SERVICE_PORT or hardcoded."),
+        'machine_id': Param("ytdlp-ops-airflow-service", type="string", description="[Worker Param] Identifier for the client machine."),
+        'auto_create_new_accounts_on_exhaustion': Param(True, type="boolean", description="[Worker Param] If True and all accounts in a prefix-based pool are exhausted, create a new one automatically."),
+        'retrigger_delay_on_empty_s': Param(60, type="integer", description="[Worker Param] Delay in seconds before a worker re-triggers itself if the queue is empty. Set to -1 to stop the loop."),
+    }
+) as dag:
+
+    orchestrate_task = PythonOperator(
+        task_id='start_worker_loops',
+        python_callable=orchestrate_workers_ignition_callable,
+    )
+    orchestrate_task.doc_md = """
+    ### Start Worker Loops
+    This is the main task that executes the ignition policy.
+    - It triggers `ytdlp_ops_worker_per_url` DAGs according to the batch settings.
+    - It passes all its parameters down to the workers, which will use them to run their continuous loops.
+    """
--- a/dags/ytdlp_ops_sensor_queue.py
+++ b/dags/ytdlp_ops_sensor_queue.py
@ -1,215 +0,0 @@
-# -*- coding: utf-8 -*-
-# vim:fenc=utf-8
-#
-# Copyright © 2024 rl <rl@rlmbp>
-#
-# Distributed under terms of the MIT license.
-
-"""
-DAG to sense a Redis queue for new URLs and trigger the ytdlp_worker_per_url DAG.
-This is the "Sensor" part of a Sensor/Worker pattern.
-"""
-
-from airflow import DAG
-from airflow.exceptions import AirflowException, AirflowSkipException
-from airflow.operators.python import PythonOperator
-from airflow.operators.trigger_dagrun import TriggerDagRunOperator
-from airflow.providers.redis.hooks.redis import RedisHook
-from airflow.models.param import Param
-from airflow.utils.dates import days_ago
-from datetime import timedelta
-import logging
-import redis
-
-# Import utility functions
-from utils.redis_utils import _get_redis_client
-
-# Configure logging
-logger = logging.getLogger(__name__)
-
-# Default settings
-DEFAULT_QUEUE_NAME = 'video_queue'
-DEFAULT_REDIS_CONN_ID = 'redis_default'
-DEFAULT_TIMEOUT = 30
-DEFAULT_MAX_URLS = '1' # Default number of URLs to process per run
-
-# --- Task Callables ---
-
-def select_account_callable(**context):
-    """
-    Placeholder task for future logic to dynamically select an account.
-    For now, it just passes through the account_id from the DAG params.
-    """
-    params = context['params']
-    account_id = params.get('account_id', 'default_account')
-    logger.info(f"Selected account for this run: {account_id}")
-    # This task could push the selected account_id to XComs in the future.
-    # For now, the next task will just read it from params.
-    return account_id
-
-
-def log_trigger_info_callable(**context):
-    """Logs information about how the DAG run was triggered."""
-    dag_run = context['dag_run']
-    trigger_type = dag_run.run_type
-    logger.info(f"Sensor DAG triggered. Run ID: {dag_run.run_id}, Type: {trigger_type}")
-
-    if trigger_type == 'manual':
-        logger.info("Trigger source: Manual execution from Airflow UI or CLI.")
-    elif trigger_type == 'scheduled':
-        logger.info("Trigger source: Scheduled run (periodic check).")
-    elif trigger_type == 'dag_run':
-        # In Airflow 2.2+ we can get the triggering run object
-        try:
-            triggering_dag_run = dag_run.get_triggering_dagrun()
-            if triggering_dag_run:
-                triggering_dag_id = triggering_dag_run.dag_id
-                triggering_run_id = triggering_dag_run.run_id
-                logger.info(f"Trigger source: DAG Run from '{triggering_dag_id}' (Run ID: {triggering_run_id}).")
-                # Check if it's a worker by looking at the conf keys
-                conf = dag_run.conf or {}
-                if all(k in conf for k in ['queue_name', 'redis_conn_id', 'max_urls_per_run']):
-                     logger.info("This appears to be a standard trigger from a worker DAG continuing the loop.")
-                else:
-                     logger.warning(f"Triggered by another DAG but conf does not match worker pattern. Conf: {conf}")
-            else:
-                logger.warning("Trigger type is 'dag_run' but could not retrieve triggering DAG run details.")
-        except Exception as e:
-            logger.error(f"Could not get triggering DAG run details: {e}")
-    else:
-        logger.info(f"Trigger source: {trigger_type}")
-
-
-def check_queue_for_urls_batch(**context):
-    """
-    Pops a batch of URLs from the inbox queue.
-    Returns a list of configuration dictionaries for the TriggerDagRunOperator.
-    If the queue is empty, it raises AirflowSkipException.
-    """
-    params = context['params']
-    queue_name = params['queue_name']
-    inbox_queue = f"{queue_name}_inbox"
-    redis_conn_id = params.get('redis_conn_id', DEFAULT_REDIS_CONN_ID)
-    max_urls_raw = params.get('max_urls_per_run', DEFAULT_MAX_URLS)
-    try:
-        max_urls = int(max_urls_raw)
-    except (ValueError, TypeError):
-        logger.warning(f"Invalid value for max_urls_per_run: '{max_urls_raw}'. Using default: {DEFAULT_MAX_URLS}")
-        max_urls = DEFAULT_MAX_URLS
-
-    urls_to_process = []
-    try:
-        client = _get_redis_client(redis_conn_id)
-        current_queue_size = client.llen(inbox_queue)
-        logger.info(f"Queue '{inbox_queue}' has {current_queue_size} URLs. Attempting to pop up to {max_urls}.")
-
-        for _ in range(max_urls):
-            url_bytes = client.lpop(inbox_queue)
-            if url_bytes:
-                url = url_bytes.decode('utf-8') if isinstance(url_bytes, bytes) else url_bytes
-                logger.info(f"  - Popped URL: {url}")
-                urls_to_process.append(url)
-            else:
-                # Queue is empty, stop trying to pop
-                break
-
-        if urls_to_process:
-            logger.info(f"Found {len(urls_to_process)} URLs in queue. Generating trigger configurations.")
-            # Create a list of 'conf' objects for the trigger operator to expand
-            trigger_configs = []
-            for url in urls_to_process:
-                # The worker DAG will use its own default params for its operations.
-                # We only need to provide the URL for processing, and the sensor's own
-                # params so the worker can trigger the sensor again to continue the loop.
-                worker_conf = {
-                    'url': url,
-                    'queue_name': queue_name,
-                    'redis_conn_id': redis_conn_id,
-                    'max_urls_per_run': int(max_urls),
-                    'stop_on_failure': params.get('stop_on_failure', True),
-                    'account_id': params.get('account_id', 'default_account')
-                }
-                trigger_configs.append(worker_conf)
-            return trigger_configs
-        else:
-            logger.info(f"Queue '{inbox_queue}' is empty. Skipping trigger.")
-            raise AirflowSkipException(f"Redis queue '{inbox_queue}' is empty.")
-    except AirflowSkipException:
-        raise
-    except Exception as e:
-        logger.error(f"Error popping URLs from Redis queue '{inbox_queue}': {e}", exc_info=True)
-        raise AirflowException(f"Failed to pop URLs from Redis: {e}")
-
-# =============================================================================
-# DAG Definition
-# =============================================================================
-
-default_args = {
-    'owner': 'airflow',
-    'depends_on_past': False,
-    'email_on_failure': False,
-    'email_on_retry': False,
-    'retries': 0, # The sensor itself should not retry on failure, it will run again on schedule
-    'start_date': days_ago(1),
-}
-
-with DAG(
-    dag_id='ytdlp_ops_sensor_queue',
-    default_args=default_args,
-    schedule_interval=None, # Runs only on trigger, not on a schedule.
-    max_active_runs=1, # Prevent multiple sensors from running at once
-    catchup=False,
-    description='Polls Redis queue on trigger for URLs and starts worker DAGs.',
-    tags=['ytdlp', 'sensor', 'queue', 'redis', 'batch'],
-    params={
-        'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues."),
-        'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
-        'max_urls_per_run': Param(DEFAULT_MAX_URLS, type="string", description="Maximum number of URLs to process in one batch."),
-        'stop_on_failure': Param(True, type="boolean", description="If True, a worker failure will stop the entire processing loop."),
-        'account_id': Param('default_account', type="string", description="The account ID to use for processing the batch."),
-    }
-) as dag:
-
-    log_trigger_info_task = PythonOperator(
-        task_id='log_trigger_info',
-        python_callable=log_trigger_info_callable,
-    )
-    log_trigger_info_task.doc_md = """
-    ### Log Trigger Information
-    Logs details about how this DAG run was initiated (e.g., manually or by a worker DAG).
-    This provides visibility into the processing loop.
-    """
-
-    poll_redis_task = PythonOperator(
-        task_id='check_queue_for_urls_batch',
-        python_callable=check_queue_for_urls_batch,
-    )
-    poll_redis_task.doc_md = """
-    ### Poll Redis Queue for Batch
-    Checks the Redis inbox queue for a batch of new URLs (up to `max_urls_per_run`).
-    - **On Success (URLs found):** Returns a list of configuration objects for the trigger task.
-    - **On Skip (Queue empty):** Skips this task and the trigger task. The DAG run succeeds.
-    """
-
-    # This operator will be dynamically expanded based on the output of poll_redis_task
-    trigger_worker_dags = TriggerDagRunOperator.partial(
-        task_id='trigger_worker_dags',
-        trigger_dag_id='ytdlp_ops_worker_per_url',
-        wait_for_completion=False, # Fire and forget
-        doc_md="""
-### Trigger Worker DAGs (Dynamically Mapped)
-Triggers one `ytdlp_worker_per_url` DAG run for each URL found by the polling task.
-Each triggered DAG receives its own specific configuration (including the URL).
-This task is skipped if the polling task finds no URLs.
-"""
-    ).expand(
-        conf=poll_redis_task.output
-    )
-
-    select_account_task = PythonOperator(
-        task_id='select_account',
-        python_callable=select_account_callable,
-    )
-    select_account_task.doc_md = "### Select Account\n(Placeholder for future dynamic account selection logic)"
-
-    log_trigger_info_task >> select_account_task >> poll_redis_task >> trigger_worker_dags
--- a/dags/ytdlp_ops_worker_per_url.py
+++ b/dags/ytdlp_ops_worker_per_url.py
--- a/docker-compose-ytdlp-ops.yaml
+++ b/docker-compose-ytdlp-ops.yaml
@ -1,4 +1,42 @@
 services:
+  config-generator:
+    image: python:3.9-slim
+    container_name: ytdlp-ops-config-generator
+    working_dir: /app
+    volumes:
+      # Mount the current directory to access the template, .env, and script
+      - .:/app
+    env_file:
+      - ./.env
+    environment:
+      ENVOY_CLUSTER_TYPE: STRICT_DNS
+      # Pass worker count and base port to ensure Envoy config matches the workers
+      YTDLP_WORKERS: ${YTDLP_WORKERS:-3}
+      YTDLP_BASE_PORT: ${YTDLP_BASE_PORT:-9090}
+    # This command cleans up old runs, installs jinja2, and generates the config.
+    command: >
+      sh -c "rm -rf ./envoy.yaml && 
+             pip install --no-cache-dir -q jinja2 && 
+             python3 ./generate_envoy_config.py"
+
+  envoy:
+    image: envoyproxy/envoy:v1.29-latest
+    container_name: envoy-thrift-lb
+    restart: unless-stopped
+    volumes:
+      # Mount the generated config file from the host
+      - ./envoy.yaml:/etc/envoy/envoy.yaml:ro
+    ports:
+      # This is the single public port for all Thrift traffic
+      - "${ENVOY_PORT:-9080}:${ENVOY_PORT:-9080}"
+    networks:
+      - airflow_prod_proxynet
+    depends_on:
+      config-generator:
+        condition: service_completed_successfully
+      ytdlp-ops:
+        condition: service_started
+
  camoufox:
    build:
      context: ./camoufox # Path relative to the docker-compose file
@ -15,9 +53,8 @@ services:
      "--ws-host", "0.0.0.0",
      "--port", "12345",
      "--ws-path", "mypath",
-      "--proxy-url", "socks5://sslocal-rust-1084:1084",
+      "--proxy-url", "socks5://${SOCKS5_SOCK_SERVER_IP:-89.253.221.173}:1084",
      "--locale", "en-US",
-      "--geoip",
      "--extensions", "/app/extensions/google_sign_in_popup_blocker-1.0.2.xpi,/app/extensions/spoof_timezone-0.3.4.xpi,/app/extensions/youtube_ad_auto_skipper-0.6.0.xpi"
    ]
    restart: unless-stopped
@ -25,25 +62,36 @@ services:

  ytdlp-ops: 
    image: pangramia/ytdlp-ops-server:latest # Don't comment out or remove, build is performed externally
+    container_name: ytdlp-ops-workers # Renamed for clarity
    depends_on:
      - camoufox # Ensure camoufox starts first
-    ports:
-      - "9090:9090" # Main RPC port
-      - "9091:9091" # Health check port
+    # Ports are no longer exposed directly. Envoy will connect to them on the internal network.
+    env_file:
+      - ./.env # Path is relative to the compose file
    volumes:
      - context-data:/app/context-data
+      # Mount the plugin source code for live updates without rebuilding the image.
+      # Assumes the plugin source is in a 'bgutil-ytdlp-pot-provider' directory
+      # next to your docker-compose.yaml file.
+      #- ./bgutil-ytdlp-pot-provider:/app/bgutil-ytdlp-pot-provider
    networks:
      - airflow_prod_proxynet
    command:
+      - "--script-dir"
+      - "/app"
      - "--context-dir"
      - "/app/context-data"
+      # Use environment variables for port and worker count
      - "--port"
-      - "9090"
+      - "${YTDLP_BASE_PORT:-9090}"
+      - "--workers"
+      - "${YTDLP_WORKERS:-3}"
      - "--clients"
-      # Add 'web' client since we now have camoufox, test firstly
      - "web,ios,android,mweb"
      - "--proxies"
-      - "socks5://sslocal-rust-1081:1081,socks5://sslocal-rust-1082:1082,socks5://sslocal-rust-1083:1083,socks5://sslocal-rust-1084:1084,socks5://sslocal-rust-1085:1085"
+      #- "socks5://sslocal-rust-1081:1081,socks5://sslocal-rust-1082:1082,socks5://sslocal-rust-1083:1083,socks5://sslocal-rust-1084:1084,socks5://sslocal-rust-1085:1085"
+      - "socks5://${SOCKS5_SOCK_SERVER_IP:-89.253.221.173}:1084"
+      #      
      # Add the endpoint argument pointing to the camoufox service
      - "--endpoint"
      - "ws://camoufox:12345/mypath"
@ -61,6 +109,13 @@ services:
      - "${REDIS_PORT:-6379}"
      - "--redis-password"
      - "${REDIS_PASSWORD}"
+      # Add account cooldown parameters (values are in minutes)
+      - "--account-active-duration-min"
+      - "${ACCOUNT_ACTIVE_DURATION_MIN:-30}"
+      - "--account-cooldown-duration-min"
+      - "${ACCOUNT_COOLDOWN_DURATION_MIN:-60}"
+      # Add flag to clean context directory on start
+      - "--clean-context-dir"
    restart: unless-stopped
    pull_policy: always

@ -69,5 +124,4 @@ volumes:
    name: context-data

 networks:
-  airflow_prod_proxynet:
-    external: true
+  airflow_prod_proxynet: {}
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,9 @@
+thrift>=0.16.0,<=0.20.0
+backoff>=2.2.1
+python-dotenv==1.0.1
+psutil>=5.9.0
+docker>=6.0.0
+apache-airflow-providers-docker
+redis
+ffprobe3
+ffmpeg-python