Provide updates on ytdlp dags

This commit is contained in:
aperez 2025-08-06 18:02:44 +03:00
parent 61906a57ef
commit 274bef5370
9 changed files with 1617 additions and 862 deletions

View File

@ -1,46 +1,78 @@
# Архитектура и описание YTDLP Airflow DAGs # Архитектура и описание YTDLP Airflow DAGs
Этот документ описывает архитектуру и назначение DAG'ов, используемых для скачивания видео с YouTube. Система построена по паттерну "Сенсор/Воркер" для обеспечения непрерывной и параллельной обработки. Этот документ описывает архитектуру и назначение DAG'ов, используемых для скачивания видео с YouTube. Система построена на модели непрерывного, самоподдерживающегося цикла для параллельной и отказоустойчивой обработки.
## Основной цикл обработки ## Основной цикл обработки
### `ytdlp_sensor_redis_queue` (Сенсор) Обработка выполняется двумя основными DAG'ами, которые работают в паре: оркестратор и воркер.
- **Назначение:** Забирает URL на скачивание из очереди Redis и запускает воркеры для их обработки. ### `ytdlp_ops_orchestrator` (Система "зажигания")
- **Принцип работы (Запуск по триггеру):**
- **По триггеру:** Когда воркер `ytdlp_worker_per_url` успешно завершает работу, он немедленно запускает сенсор. Это обеспечивает непрерывную обработку без задержек. Запуск по расписанию отключен, чтобы избежать повторного запуска задач для заблокированных аккаунтов.
- **Логика:** Извлекает из Redis (`_inbox` лист) пачку URL. Если очередь пуста, DAG успешно завершается до следующего запуска по триггеру.
### `ytdlp_worker_per_url` (Воркер) - **Назначение:** Этот DAG действует как "система зажигания" для запуска обработки. Он запускается вручную для старта указанного количества параллельных циклов-воркеров.
- **Назначение:** Обрабатывает один URL, скачивает видео и продолжает цикл.
- **Принцип работы:** - **Принцип работы:**
- Получает один URL от сенсора. - Он **не** обрабатывает URL-адреса самостоятельно.
- Обращается к сервису `ytdlp-ops-auth` для получения `info.json` и `socks5` прокси. - Его единственная задача — запустить сконфигурированное количество DAG'ов `ytdlp_ops_worker_per_url`.
- Скачивает видео, используя полученные данные. (TODO: заменить вызов `yt-dlp` как команды на вызов библиотеки). - Он передает всю необходимую конфигурацию (пул аккаунтов, подключение к Redis и т.д.) воркерам.
- В зависимости от статуса (успех/неуспех), помещает результат в соответствующий хэш Redis (`_result` или `_fail`).
- В случае успеха, повторно запускает сенсор `ytdlp_sensor_redis_queue` для продолжения цикла обработки. В случае ошибки цикл останавливается для ручной диагностики. ### `ytdlp_ops_worker_per_url` (Самоподдерживающийся воркер)
- **Назначение:** Этот DAG обрабатывает один URL и спроектирован для работы в непрерывном цикле.
- **Принцип работы:**
1. **Запуск:** Начальный запуск инициируется `ytdlp_ops_orchestrator`.
2. **Получение задачи:** Воркер извлекает один URL из очереди `_inbox` в Redis. Если очередь пуста, выполнение воркера завершается, и его "линия" обработки останавливается.
3. **Обработка:** Он взаимодействует с сервисом `ytdlp-ops-server` для получения `info.json` и прокси, после чего скачивает видео.
4. **Продолжение или остановка:**
- **В случае успеха:** Он запускает новый экземпляр самого себя, создавая непрерывный цикл для обработки следующего URL.
- **В случае сбоя:** Цикл прерывается (если `stop_on_failure` установлено в `True`), останавливая эту "линию" обработки. Это предотвращает остановку всей системы из-за одного проблемного URL или аккаунта.
## Управляющие DAG'и ## Управляющие DAG'и
Эти DAG'и предназначены для ручного управления очередями и не участвуют в автоматическом цикле. ### `ytdlp_mgmt_proxy_account`
- **`ytdlp_mgmt_queue_add_and_verify`**: Добавление URL в очередь задач (`_inbox`) и последующая проверка статуса этой очереди. - **Назначение:** Это основной инструмент для мониторинга и управления состоянием ресурсов, используемых `ytdlp-ops-server`.
- **`ytdlp_mgmt_queues_check_status`**: Просмотр состояния и содержимого всех ключевых очередей (`_inbox`, `_progress`, `_result`, `_fail`). Помогает отслеживать процесс обработки. - **Функциональность:**
- **`ytdlp_mgmt_queue_clear`**: Очистка (полное удаление) указанной очереди Redis. **Использовать с осторожностью**, так как операция необратима. - **Просмотр статусов:** Позволяет увидеть текущий статус всех прокси и аккаунтов (например, `ACTIVE`, `BANNED`, `RESTING`).
- **Управление прокси:** Позволяет вручную банить, разбанивать или сбрасывать статус прокси.
- **Управление аккаунтами:** Позволяет вручную банить или разбанивать аккаунты.
## Стратегия управления ресурсами (Прокси и Аккаунты)
Система использует интеллектуальную стратегию для управления жизненным циклом и состоянием аккаунтов и прокси, чтобы максимизировать процент успеха и минимизировать блокировки.
- **Жизненный цикл аккаунта ("Cooldown"):**
- Чтобы предотвратить "выгорание", аккаунты автоматически переходят в состояние "отдыха" (`RESTING`) после периода интенсивного использования.
- По истечении периода отдыха они автоматически возвращаются в `ACTIVE` и снова становятся доступными для воркеров.
- **Умная стратегия банов:**
- **Сначала бан аккаунта:** При возникновении серьезной ошибки (например, `BOT_DETECTED`) система наказывает **только аккаунт**, который вызвал сбой. Прокси при этом продолжает работать.
- **Бан прокси по "скользящему окну":** Прокси банится автоматически, только если он демонстрирует **систематические сбои с РАЗНЫМИ аккаунтами** за короткий промежуток времени. Это является надежным индикатором того, что проблема именно в прокси.
- **Мониторинг:**
- DAG `ytdlp_mgmt_proxy_account` является основным инструментом для мониторинга. Он показывает текущий статус всех ресурсов, включая время, оставшееся до активации забаненных или отдыхающих аккаунтов.
- Граф выполнения DAG `ytdlp_ops_worker_per_url` теперь явно показывает шаги, такие как `assign_account`, `get_token`, `ban_account`, `retry_get_token`, что делает процесс отладки более наглядным.
## Внешние сервисы ## Внешние сервисы
### `ytdlp-ops-auth` (Thrift Service) ### `ytdlp-ops-server` (Thrift Service)
- **Назначение:** Внешний сервис, который предоставляет аутентификационные данные (токены, cookies, proxy) для скачивания видео. - **Назначение:** Внешний сервис, который предоставляет аутентификационные данные (токены, cookies, proxy) для скачивания видео.
- **Взаимодействие:** Worker DAG (`ytdlp_worker_per_url`) обращается к этому сервису перед началом загрузки для получения необходимых данных для `yt-dlp`. - **Взаимодействие:** Worker DAG (`ytdlp_ops_worker_per_url`) обращается к этому сервису перед началом загрузки для получения необходимых данных для `yt-dlp`.
## TODO (Планы на доработку) ## Логика работы Worker DAG (`ytdlp_ops_worker_per_url`)
- **Реализовать механизм "Circuit Breaker" (автоматического выключателя):** Этот DAG является "рабочей лошадкой" системы. Он спроектирован как самоподдерживающийся цикл для обработки одного URL за запуск.
- **Проблема:** Если воркер падает с ошибкой (например, из-за бана аккаунта), сенсор, запускаемый по расписанию, продолжает создавать новые задачи для этого же аккаунта, усугубляя проблему.
- **Решение:** ### Задачи и их назначение:
1. **Воркер (`ytdlp_worker_per_url`):** При сбое задачи, воркер должен устанавливать в Redis флаг временной блокировки для своего `account_id` (например, на 5-10 минут).
2. **Сенсор (`ytdlp_sensor_redis_queue`):** Перед проверкой очереди, сенсор должен проверять наличие флага блокировки для своего `account_id`. Если аккаунт заблокирован, сенсор должен пропустить выполнение, предотвращая запуск новых воркеров для проблемного аккаунта. - **`pull_url_from_redis`**: Извлекает один URL из очереди `_inbox` в Redis. Если очередь пуста, DAG завершается со статусом `skipped`, останавливая эту "линию" обработки.
- **Результат:** Это предотвратит многократные повторные запросы к заблокированному аккаунту и даст системе время на восстановление. - **`assign_account`**: Выбирает аккаунт для выполнения задачи. Он будет повторно использовать тот же аккаунт, который был успешно использован в предыдущем запуске в своей "линии" (привязка аккаунта). Если это первый запуск, он выбирает случайный аккаунт.
- **`get_token`**: Основная задача. Она обращается к `ytdlp-ops-server` для получения `info.json`.
- **`handle_bannable_error_branch`**: Если `get_token` завершается с ошибкой, требующей бана, эта задача-развилка решает, что делать дальше, в зависимости от политики `on_bannable_failure`.
- **`ban_account_and_prepare_for_retry`**: Если политика разрешает повтор, эта задача банит сбойный аккаунт и выбирает новый для повторной попытки.
- **`retry_get_token`**: Выполняет вторую попытку получить токен с новым аккаунтом.
- **`ban_second_account_and_proxy`**: Если и вторая попытка неудачна, эта задача банит второй аккаунт и использованный прокси.
- **`download_and_probe`**: Если `get_token` (или `retry_get_token`) завершилась успешно, эта задача использует `yt-dlp` для скачивания медиа и `ffmpeg` для проверки целостности скачанного файла.
- **`mark_url_as_success`**: Если `download_and_probe` завершилась успешно, эта задача записывает результат в хэш `_result` в Redis.
- **`handle_generic_failure`**: Если любая из основных задач завершается с неисправимой ошибкой, эта задача записывает подробную информацию об ошибке в хэш `_fail` в Redis.
- **`decide_what_to_do_next`**: Задача-развилка, которая запускается после успеха или неудачи. Она решает, продолжать ли цикл.
- **`trigger_self_run`**: Задача, которая фактически запускает следующий экземпляр DAG, создавая непрерывный цикл.

View File

@ -1,197 +0,0 @@
"""
DAG to manage the state of proxies used by the ytdlp-ops-server.
"""
from __future__ import annotations
import logging
from datetime import datetime
from airflow.models.dag import DAG
from airflow.models.param import Param
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
# Configure logging
logger = logging.getLogger(__name__)
# Import and apply Thrift exceptions patch for Airflow compatibility
try:
from thrift_exceptions_patch import patch_thrift_exceptions
patch_thrift_exceptions()
logger.info("Applied Thrift exceptions patch for Airflow compatibility.")
except ImportError:
logger.warning("Could not import thrift_exceptions_patch. Compatibility may be affected.")
except Exception as e:
logger.error(f"Error applying Thrift exceptions patch: {e}")
# Thrift imports
try:
from thrift.transport import TSocket, TTransport
from thrift.protocol import TBinaryProtocol
from pangramia.yt.tokens_ops import YTTokenOpService
from pangramia.yt.exceptions.ttypes import PBServiceException, PBUserException
except ImportError as e:
logger.critical(f"Could not import Thrift modules: {e}. Ensure ytdlp-ops-auth package is installed.")
# Fail DAG parsing if thrift modules are not available
raise
def format_timestamp(ts_str: str) -> str:
"""Formats a string timestamp into a human-readable date string."""
if not ts_str:
return ""
try:
ts_float = float(ts_str)
if ts_float <= 0:
return ""
# Use datetime from the imported 'from datetime import datetime'
dt_obj = datetime.fromtimestamp(ts_float)
return dt_obj.strftime('%Y-%m-%d %H:%M:%S')
except (ValueError, TypeError):
return ts_str # Return original string if conversion fails
def get_thrift_client(host: str, port: int):
"""Helper function to create and connect a Thrift client."""
transport = TSocket.TSocket(host, port)
transport = TTransport.TFramedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = YTTokenOpService.Client(protocol)
transport.open()
logger.info(f"Connected to Thrift server at {host}:{port}")
return client, transport
def manage_proxies_callable(**context):
"""Main callable to interact with the proxy management endpoints."""
params = context["params"]
action = params["action"]
host = params["host"]
port = params["port"]
server_identity = params.get("server_identity")
proxy_url = params.get("proxy_url")
if not server_identity and action in ["ban", "unban", "reset_all"]:
raise ValueError(f"A 'server_identity' is required for the '{action}' action.")
client, transport = None, None
try:
client, transport = get_thrift_client(host, port)
if action == "list":
logger.info(f"Listing proxy statuses for server: {server_identity or 'ALL'}")
statuses = client.getProxyStatus(server_identity)
if not statuses:
logger.info("No proxy statuses found.")
print("No proxy statuses found.")
else:
from tabulate import tabulate
status_list = [
{
"Server": s.serverIdentity,
"Proxy URL": s.proxyUrl,
"Status": s.status,
"Success": s.successCount,
"Failures": s.failureCount,
"Last Success": format_timestamp(s.lastSuccessTimestamp),
"Last Failure": format_timestamp(s.lastFailureTimestamp),
}
for s in statuses
]
print("\n--- Proxy Statuses ---")
print(tabulate(status_list, headers="keys", tablefmt="grid"))
print("----------------------\n")
elif action == "ban":
if not proxy_url:
raise ValueError("A 'proxy_url' is required to ban a proxy.")
logger.info(f"Banning proxy '{proxy_url}' for server '{server_identity}'...")
success = client.banProxy(proxy_url, server_identity)
if success:
logger.info("Successfully banned proxy.")
print(f"Successfully banned proxy '{proxy_url}' for server '{server_identity}'.")
else:
logger.error("Failed to ban proxy.")
raise Exception("Server returned failure for banProxy operation.")
elif action == "unban":
if not proxy_url:
raise ValueError("A 'proxy_url' is required to unban a proxy.")
logger.info(f"Unbanning proxy '{proxy_url}' for server '{server_identity}'...")
success = client.unbanProxy(proxy_url, server_identity)
if success:
logger.info("Successfully unbanned proxy.")
print(f"Successfully unbanned proxy '{proxy_url}' for server '{server_identity}'.")
else:
logger.error("Failed to unban proxy.")
raise Exception("Server returned failure for unbanProxy operation.")
elif action == "reset_all":
logger.info(f"Resetting all proxy statuses for server '{server_identity}'...")
success = client.resetAllProxyStatuses(server_identity)
if success:
logger.info("Successfully reset all proxy statuses.")
print(f"Successfully reset all proxy statuses for server '{server_identity}'.")
else:
logger.error("Failed to reset all proxy statuses.")
raise Exception("Server returned failure for resetAllProxyStatuses operation.")
else:
raise ValueError(f"Invalid action: {action}")
except (PBServiceException, PBUserException) as e:
logger.error(f"Thrift error performing action '{action}': {e.message}", exc_info=True)
raise
except Exception as e:
logger.error(f"Error performing action '{action}': {e}", exc_info=True)
raise
finally:
if transport and transport.isOpen():
transport.close()
logger.info("Thrift connection closed.")
with DAG(
dag_id="ytdlp_mgmt_proxy",
start_date=days_ago(1),
schedule=None,
catchup=False,
tags=["ytdlp", "utility", "proxy"],
doc_md="""
### YT-DLP Proxy Manager DAG
This DAG provides tools to manage the state of proxies used by the `ytdlp-ops-server`.
You can view statuses, and manually ban, unban, or reset proxies for a specific server instance.
**Parameters:**
- `host`: The hostname or IP of the `ytdlp-ops-server` Thrift service.
- `port`: The port of the Thrift service.
- `action`: The operation to perform.
- `list`: List proxy statuses. Provide a `server_identity` to query a specific server, or leave it blank to query the server instance you are connected to.
- `ban`: Ban a specific proxy. Requires `server_identity` and `proxy_url`.
- `unban`: Un-ban a specific proxy. Requires `server_identity` and `proxy_url`.
- `reset_all`: Reset all proxies for a server to `ACTIVE`. Requires `server_identity`.
- `server_identity`: The unique identifier for the server instance (e.g., `ytdlp-ops-airflow-service`).
- `proxy_url`: The full URL of the proxy to act upon (e.g., `socks5://host:port`).
""",
params={
"host": Param("89.253.221.173", type="string", description="The hostname of the ytdlp-ops-server service."),
"port": Param(9090, type="integer", description="The port of the ytdlp-ops-server service."),
"action": Param(
"list",
type="string",
enum=["list", "ban", "unban", "reset_all"],
description="The management action to perform.",
),
"server_identity": Param(
"ytdlp-ops-airflow-service",
type=["null", "string"],
description="The identity of the server to manage. Leave blank to query the connected server instance.",
),
"proxy_url": Param(
None,
type=["null", "string"],
description="The proxy URL to ban/unban (e.g., 'socks5://host:port').",
),
},
) as dag:
proxy_management_task = PythonOperator(
task_id="proxy_management_task",
python_callable=manage_proxies_callable,
)

View File

@ -0,0 +1,405 @@
"""
DAG to manage the state of proxies and accounts used by the ytdlp-ops-server.
"""
from __future__ import annotations
import logging
from datetime import datetime
import socket
from airflow.exceptions import AirflowException
from airflow.models.dag import DAG
from airflow.models.param import Param
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.models.variable import Variable
from airflow.providers.redis.hooks.redis import RedisHook
# Configure logging
logger = logging.getLogger(__name__)
# Import and apply Thrift exceptions patch for Airflow compatibility
try:
from thrift_exceptions_patch import patch_thrift_exceptions
patch_thrift_exceptions()
logger.info("Applied Thrift exceptions patch for Airflow compatibility.")
except ImportError:
logger.warning("Could not import thrift_exceptions_patch. Compatibility may be affected.")
except Exception as e:
logger.error(f"Error applying Thrift exceptions patch: {e}")
# Thrift imports
try:
from thrift.transport import TSocket, TTransport
from thrift.protocol import TBinaryProtocol
from pangramia.yt.tokens_ops import YTTokenOpService
from pangramia.yt.exceptions.ttypes import PBServiceException, PBUserException
except ImportError as e:
logger.critical(f"Could not import Thrift modules: {e}. Ensure ytdlp-ops-auth package is installed.")
# Fail DAG parsing if thrift modules are not available
raise
DEFAULT_YT_AUTH_SERVICE_IP = Variable.get("YT_AUTH_SERVICE_IP", default_var="16.162.82.212")
DEFAULT_YT_AUTH_SERVICE_PORT = Variable.get("YT_AUTH_SERVICE_PORT", default_var=9080)
DEFAULT_REDIS_CONN_ID = "redis_default"
# Helper function to connect to Redis, similar to other DAGs
def _get_redis_client(redis_conn_id: str):
"""Gets a Redis client from an Airflow connection."""
try:
# Use the imported RedisHook
redis_hook = RedisHook(redis_conn_id=redis_conn_id)
# get_conn returns a redis.Redis client
return redis_hook.get_conn()
except Exception as e:
logger.error(f"Failed to connect to Redis using connection '{redis_conn_id}': {e}")
# Use the imported AirflowException
raise AirflowException(f"Redis connection failed: {e}")
def format_timestamp(ts_str: str) -> str:
"""Formats a string timestamp into a human-readable date string."""
if not ts_str:
return ""
try:
ts_float = float(ts_str)
if ts_float <= 0:
return ""
# Use datetime from the imported 'from datetime import datetime'
dt_obj = datetime.fromtimestamp(ts_float)
return dt_obj.strftime('%Y-%m-%d %H:%M:%S')
except (ValueError, TypeError):
return ts_str # Return original string if conversion fails
def get_thrift_client(host: str, port: int):
"""Helper function to create and connect a Thrift client."""
transport = TSocket.TSocket(host, port)
transport.setTimeout(30 * 1000) # 30s timeout
transport = TTransport.TFramedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = YTTokenOpService.Client(protocol)
transport.open()
logger.info(f"Connected to Thrift server at {host}:{port}")
return client, transport
def _list_proxy_statuses(client, server_identity):
"""Lists the status of proxies."""
logger.info(f"Listing proxy statuses for server: {server_identity or 'ALL'}")
statuses = client.getProxyStatus(server_identity)
if not statuses:
logger.info("No proxy statuses found.")
print("No proxy statuses found.")
return
from tabulate import tabulate
status_list = []
# This is forward-compatible: it checks for new attributes before using them.
has_extended_info = hasattr(statuses[0], 'recentAccounts') or hasattr(statuses[0], 'recentMachines')
headers = ["Server", "Proxy URL", "Status", "Success", "Failures", "Last Success", "Last Failure"]
if has_extended_info:
headers.extend(["Recent Accounts", "Recent Machines"])
for s in statuses:
status_item = {
"Server": s.serverIdentity,
"Proxy URL": s.proxyUrl,
"Status": s.status,
"Success": s.successCount,
"Failures": s.failureCount,
"Last Success": format_timestamp(s.lastSuccessTimestamp),
"Last Failure": format_timestamp(s.lastFailureTimestamp),
}
if has_extended_info:
recent_accounts = getattr(s, 'recentAccounts', [])
recent_machines = getattr(s, 'recentMachines', [])
status_item["Recent Accounts"] = "\n".join(recent_accounts) if recent_accounts else "N/A"
status_item["Recent Machines"] = "\n".join(recent_machines) if recent_machines else "N/A"
status_list.append(status_item)
print("\n--- Proxy Statuses ---")
# The f-string with a newline ensures the table starts on a new line in the logs.
print(f"\n{tabulate(status_list, headers='keys', tablefmt='grid')}")
print("----------------------\n")
if not has_extended_info:
logger.warning("Server does not seem to support 'recentAccounts' or 'recentMachines' fields yet.")
print("NOTE: To see Recent Accounts/Machines, the server's `getProxyStatus` method must be updated to return these fields.")
def _list_account_statuses(client, account_id):
"""Lists the status of accounts."""
logger.info(f"Listing account statuses for account: {account_id or 'ALL'}")
try:
# The thrift method takes accountId (specific) or accountPrefix.
# If account_id is provided, we use it. If not, we get all by leaving both params as None.
statuses = client.getAccountStatus(accountId=account_id, accountPrefix=None)
if not statuses:
logger.info("No account statuses found.")
print("\n--- Account Statuses ---\nNo account statuses found.\n------------------------\n")
return
from tabulate import tabulate
status_list = []
for s in statuses:
# Determine the last activity timestamp for sorting
last_success = float(s.lastSuccessTimestamp) if s.lastSuccessTimestamp else 0
last_failure = float(s.lastFailureTimestamp) if s.lastFailureTimestamp else 0
last_activity = max(last_success, last_failure)
status_item = {
"Account ID": s.accountId,
"Status": s.status,
"Success": s.successCount,
"Failures": s.failureCount,
"Last Success": format_timestamp(s.lastSuccessTimestamp),
"Last Failure": format_timestamp(s.lastFailureTimestamp),
"Last Proxy": s.lastUsedProxy or "N/A",
"Last Machine": s.lastUsedMachine or "N/A",
"_last_activity": last_activity, # Add a temporary key for sorting
}
status_list.append(status_item)
# Sort the list by the last activity timestamp in descending order
status_list.sort(key=lambda item: item.get('_last_activity', 0), reverse=True)
# Remove the temporary sort key before printing
for item in status_list:
del item['_last_activity']
print("\n--- Account Statuses ---")
# The f-string with a newline ensures the table starts on a new line in the logs.
print(f"\n{tabulate(status_list, headers='keys', tablefmt='grid')}")
print("------------------------\n")
except (PBServiceException, PBUserException) as e:
logger.error(f"Failed to get account statuses: {e.message}", exc_info=True)
print(f"\nERROR: Could not retrieve account statuses. Server returned: {e.message}\n")
except Exception as e:
logger.error(f"An unexpected error occurred while getting account statuses: {e}", exc_info=True)
print(f"\nERROR: An unexpected error occurred: {e}\n")
def manage_system_callable(**context):
"""Main callable to interact with the system management endpoints."""
params = context["params"]
entity = params["entity"]
action = params["action"]
host = params["host"]
port = params["port"]
server_identity = params.get("server_identity")
proxy_url = params.get("proxy_url")
account_id = params.get("account_id")
if action in ["ban", "unban", "reset_all"] and entity == "proxy" and not server_identity:
raise ValueError(f"A 'server_identity' is required for proxy action '{action}'.")
if action in ["ban", "unban"] and entity == "account" and not account_id:
raise ValueError(f"An 'account_id' is required for account action '{action}'.")
# Handle direct Redis action separately to avoid creating an unnecessary Thrift connection.
if entity == "account" and action == "remove_all":
confirm = params.get("confirm_remove_all_accounts", False)
if not confirm:
message = "FATAL: 'remove_all' action requires 'confirm_remove_all_accounts' to be set to True. No accounts were removed."
logger.error(message)
print(f"\nERROR: {message}\n")
raise ValueError(message)
redis_conn_id = params["redis_conn_id"]
account_prefix = params.get("account_id") # Repurpose account_id param as an optional prefix
redis_client = _get_redis_client(redis_conn_id)
pattern = f"account_status:{account_prefix}*" if account_prefix else "account_status:*"
logger.warning(f"Searching for account status keys in Redis with pattern: '{pattern}'")
# scan_iter returns bytes, so we don't need to decode for deletion
keys_to_delete = [key for key in redis_client.scan_iter(pattern)]
if not keys_to_delete:
logger.info(f"No account keys found matching pattern '{pattern}'. Nothing to do.")
print(f"\nNo accounts found matching pattern '{pattern}'.\n")
return
logger.warning(f"Found {len(keys_to_delete)} account keys to delete. This is a destructive operation!")
print(f"\nWARNING: Found {len(keys_to_delete)} accounts to remove from Redis.")
# Decode for printing
for key in keys_to_delete[:10]:
print(f" - {key.decode('utf-8')}")
if len(keys_to_delete) > 10:
print(f" ... and {len(keys_to_delete) - 10} more.")
deleted_count = redis_client.delete(*keys_to_delete)
logger.info(f"Successfully deleted {deleted_count} account keys from Redis.")
print(f"\nSuccessfully removed {deleted_count} accounts from Redis.\n")
return # End execution for this action
client, transport = None, None
try:
client, transport = get_thrift_client(host, port)
if entity == "proxy":
if action == "list":
_list_proxy_statuses(client, server_identity)
elif action == "ban":
if not proxy_url: raise ValueError("A 'proxy_url' is required.")
logger.info(f"Banning proxy '{proxy_url}' for server '{server_identity}'...")
client.banProxy(proxy_url, server_identity)
print(f"Successfully sent request to ban proxy '{proxy_url}'.")
elif action == "unban":
if not proxy_url: raise ValueError("A 'proxy_url' is required.")
logger.info(f"Unbanning proxy '{proxy_url}' for server '{server_identity}'...")
client.unbanProxy(proxy_url, server_identity)
print(f"Successfully sent request to unban proxy '{proxy_url}'.")
elif action == "reset_all":
logger.info(f"Resetting all proxy statuses for server '{server_identity}'...")
client.resetAllProxyStatuses(server_identity)
print(f"Successfully sent request to reset all proxy statuses for '{server_identity}'.")
else:
raise ValueError(f"Invalid action '{action}' for entity 'proxy'.")
elif entity == "account":
if action == "list":
_list_account_statuses(client, account_id)
elif action == "ban":
if not account_id: raise ValueError("An 'account_id' is required.")
reason = f"Manual ban from Airflow mgmt DAG by {socket.gethostname()}"
logger.info(f"Banning account '{account_id}'...")
client.banAccount(accountId=account_id, reason=reason)
print(f"Successfully sent request to ban account '{account_id}'.")
elif action == "unban":
if not account_id: raise ValueError("An 'account_id' is required.")
reason = f"Manual un-ban from Airflow mgmt DAG by {socket.gethostname()}"
logger.info(f"Unbanning account '{account_id}'...")
client.unbanAccount(accountId=account_id, reason=reason)
print(f"Successfully sent request to unban account '{account_id}'.")
elif action == "reset_all":
account_prefix = account_id # Repurpose account_id param as an optional prefix
logger.info(f"Resetting all account statuses to ACTIVE (prefix: '{account_prefix or 'ALL'}')...")
all_statuses = client.getAccountStatus(accountId=None, accountPrefix=account_prefix)
if not all_statuses:
print(f"No accounts found with prefix '{account_prefix or 'ALL'}' to reset.")
return
accounts_to_reset = [s.accountId for s in all_statuses]
logger.info(f"Found {len(accounts_to_reset)} accounts to reset.")
print(f"Found {len(accounts_to_reset)} accounts. Sending unban request for each...")
reset_count = 0
fail_count = 0
for acc_id in accounts_to_reset:
try:
reason = f"Manual reset from Airflow mgmt DAG by {socket.gethostname()}"
client.unbanAccount(accountId=acc_id, reason=reason)
logger.info(f" - Sent reset (unban) for '{acc_id}'.")
reset_count += 1
except Exception as e:
logger.error(f" - Failed to reset account '{acc_id}': {e}")
fail_count += 1
print(f"\nSuccessfully sent reset requests for {reset_count} accounts.")
if fail_count > 0:
print(f"Failed to send reset requests for {fail_count} accounts. See logs for details.")
# Optionally, list statuses again to confirm
print("\n--- Listing statuses after reset ---")
_list_account_statuses(client, account_prefix)
else:
raise ValueError(f"Invalid action '{action}' for entity 'account'.")
elif entity == "all":
if action == "list":
print("\nListing all entities...")
_list_proxy_statuses(client, server_identity)
_list_account_statuses(client, account_id)
else:
raise ValueError(f"Action '{action}' is not supported for entity 'all'. Only 'list' is supported.")
except (PBServiceException, PBUserException) as e:
logger.error(f"Thrift error performing action '{action}': {e.message}", exc_info=True)
raise
except NotImplementedError as e:
logger.error(f"Feature not implemented: {e}", exc_info=True)
raise
except Exception as e:
logger.error(f"Error performing action '{action}': {e}", exc_info=True)
raise
finally:
if transport and transport.isOpen():
transport.close()
logger.info("Thrift connection closed.")
with DAG(
dag_id="ytdlp_mgmt_proxy_account",
start_date=days_ago(1),
schedule=None,
catchup=False,
tags=["ytdlp", "utility", "proxy", "account", "management"],
doc_md="""
### YT-DLP Proxy and Account Manager DAG
This DAG provides tools to manage the state of **proxies and accounts** used by the `ytdlp-ops-server`.
**Parameters:**
- `host`, `port`: Connection details for the `ytdlp-ops-server` Thrift service.
- `entity`: The type of resource to manage (`proxy`, `account`, or `all`).
- `action`: The operation to perform.
- `list`: View statuses. For `entity: all`, lists both proxies and accounts.
- `ban`: Ban a specific proxy or account.
- `unban`: Un-ban a specific proxy or account.
- `reset_all`: Reset all proxies for a server (or all accounts) to `ACTIVE`.
- `remove_all`: **Deletes all account status keys** from Redis for a given prefix. This is a destructive action.
- `server_identity`: Required for most proxy actions.
- `proxy_url`: Required for banning/unbanning a specific proxy.
- `account_id`: Required for managing a specific account. For `action: reset_all` or `remove_all` on `entity: account`, this can be used as an optional prefix to filter which accounts to act on.
- `confirm_remove_all_accounts`: **Required for `remove_all` action.** Must be set to `True` to confirm deletion.
""",
params={
"host": Param(DEFAULT_YT_AUTH_SERVICE_IP, type="string", description="The hostname of the ytdlp-ops-server service. Default is from Airflow variable YT_AUTH_SERVICE_IP or hardcoded."),
"port": Param(DEFAULT_YT_AUTH_SERVICE_PORT, type="integer", description="The port of the ytdlp-ops-server service (Envoy load balancer). Default is from Airflow variable YT_AUTH_SERVICE_PORT or hardcoded."),
"entity": Param(
"all",
type="string",
enum=["proxy", "account", "all"],
description="The type of entity to manage. Use 'all' with action 'list' to see both.",
),
"action": Param(
"list",
type="string",
enum=["list", "ban", "unban", "reset_all", "remove_all"],
description="The management action to perform. `reset_all` for proxies/accounts. `remove_all` for accounts only.",
),
"server_identity": Param(
"ytdlp-ops-airflow-service",
type=["null", "string"],
description="The identity of the server instance (for proxy management).",
),
"proxy_url": Param(
None,
type=["null", "string"],
description="The proxy URL to act upon (e.g., 'socks5://host:port').",
),
"account_id": Param(
None,
type=["null", "string"],
description="The account ID to act upon. For `reset_all` or `remove_all` on accounts, this can be an optional prefix.",
),
"confirm_remove_all_accounts": Param(
False,
type="boolean",
title="[remove_all] Confirm Deletion",
description="Must be set to True to execute the 'remove_all' action for accounts. This is a destructive operation.",
),
"redis_conn_id": Param(
DEFAULT_REDIS_CONN_ID,
type="string",
title="Redis Connection ID",
description="The Airflow connection ID for the Redis server (used for 'remove_all').",
),
},
) as dag:
system_management_task = PythonOperator(
task_id="system_management_task",
python_callable=manage_system_callable,
)

View File

@ -164,7 +164,7 @@ def clear_queue_callable(**context):
redis_conn_id = params['redis_conn_id'] redis_conn_id = params['redis_conn_id']
queue_to_clear = params['queue_to_clear'] queue_to_clear = params['queue_to_clear']
dump_queues = params['dump_queues'] dump_queues = params['dump_queues']
# Get the rendered dump_dir from the templates_dict passed to the operator # The value from templates_dict is already rendered by Airflow.
dump_dir = context['templates_dict']['dump_dir'] dump_dir = context['templates_dict']['dump_dir']
dump_patterns = params['dump_patterns'].split(',') if params.get('dump_patterns') else [] dump_patterns = params['dump_patterns'].split(',') if params.get('dump_patterns') else []
@ -191,34 +191,43 @@ def clear_queue_callable(**context):
def list_contents_callable(**context): def list_contents_callable(**context):
"""Lists the contents of the specified Redis key (list or hash).""" """Lists the contents of the specified Redis key(s) (list or hash)."""
params = context['params'] params = context['params']
redis_conn_id = params['redis_conn_id'] redis_conn_id = params['redis_conn_id']
queue_to_list = params['queue_to_list'] queues_to_list_str = params.get('queue_to_list')
max_items = params.get('max_items', 10) max_items = params.get('max_items', 10)
if not queue_to_list: if not queues_to_list_str:
raise ValueError("Parameter 'queue_to_list' cannot be empty.") raise ValueError("Parameter 'queue_to_list' cannot be empty.")
logger.info(f"Attempting to list contents of Redis key '{queue_to_list}' (max: {max_items}) using connection '{redis_conn_id}'.") queues_to_list = [q.strip() for q in queues_to_list_str.split(',') if q.strip()]
try:
if not queues_to_list:
logger.info("No valid queue names provided in 'queue_to_list'. Nothing to do.")
return
logger.info(f"Attempting to list contents for {len(queues_to_list)} Redis key(s): {queues_to_list}")
redis_client = _get_redis_client(redis_conn_id) redis_client = _get_redis_client(redis_conn_id)
for queue_to_list in queues_to_list:
# Add a newline for better separation in logs
logger.info(f"\n--- Listing contents of Redis key '{queue_to_list}' (max: {max_items}) ---")
try:
key_type_bytes = redis_client.type(queue_to_list) key_type_bytes = redis_client.type(queue_to_list)
key_type = key_type_bytes.decode('utf-8') # Decode type key_type = key_type_bytes.decode('utf-8') # Decode type
if key_type == 'list': if key_type == 'list':
list_length = redis_client.llen(queue_to_list) list_length = redis_client.llen(queue_to_list)
# Get the last N items, which are the most recently added with rpush
items_to_fetch = min(max_items, list_length) items_to_fetch = min(max_items, list_length)
# lrange with negative indices gets items from the end of the list.
# -N to -1 gets the last N items.
contents_bytes = redis_client.lrange(queue_to_list, -items_to_fetch, -1) contents_bytes = redis_client.lrange(queue_to_list, -items_to_fetch, -1)
contents = [item.decode('utf-8') for item in contents_bytes] contents = [item.decode('utf-8') for item in contents_bytes]
# Reverse the list so the absolute most recent item is printed first
contents.reverse() contents.reverse()
logger.info(f"--- Contents of Redis List '{queue_to_list}' (showing most recent {len(contents)} of {list_length}) ---") logger.info(f"--- Contents of Redis List '{queue_to_list}' ---")
logger.info(f"Total items in list: {list_length}")
if contents:
logger.info(f"Showing most recent {len(contents)} item(s):")
for i, item in enumerate(contents): for i, item in enumerate(contents):
# The index here is just for display, 0 is the most recent
logger.info(f" [recent_{i}]: {item}") logger.info(f" [recent_{i}]: {item}")
if list_length > len(contents): if list_length > len(contents):
logger.info(f" ... ({list_length - len(contents)} older items not shown)") logger.info(f" ... ({list_length - len(contents)} older items not shown)")
@ -226,26 +235,25 @@ def list_contents_callable(**context):
elif key_type == 'hash': elif key_type == 'hash':
hash_size = redis_client.hlen(queue_to_list) hash_size = redis_client.hlen(queue_to_list)
# HGETALL can be risky for large hashes. Consider HSCAN for production. if hash_size > max_items * 2:
# For manual inspection, HGETALL is often acceptable.
if hash_size > max_items * 2: # Heuristic: avoid huge HGETALL
logger.warning(f"Hash '{queue_to_list}' has {hash_size} fields, which is large. Listing might be slow or incomplete. Consider using redis-cli HSCAN.") logger.warning(f"Hash '{queue_to_list}' has {hash_size} fields, which is large. Listing might be slow or incomplete. Consider using redis-cli HSCAN.")
# hgetall returns dict of bytes keys and bytes values, decode them
contents_bytes = redis_client.hgetall(queue_to_list) contents_bytes = redis_client.hgetall(queue_to_list)
contents = {k.decode('utf-8'): v.decode('utf-8') for k, v in contents_bytes.items()} contents = {k.decode('utf-8'): v.decode('utf-8') for k, v in contents_bytes.items()}
logger.info(f"--- Contents of Redis Hash '{queue_to_list}' ({len(contents)} fields) ---") logger.info(f"--- Contents of Redis Hash '{queue_to_list}' ---")
logger.info(f"Total fields in hash: {hash_size}")
if contents:
logger.info(f"Showing up to {max_items} item(s):")
item_count = 0 item_count = 0
for key, value in contents.items(): # key and value are now strings for key, value in contents.items():
if item_count >= max_items: if item_count >= max_items:
logger.info(f" ... (stopped listing after {max_items} items of {hash_size})") logger.info(f" ... (stopped listing after {max_items} items of {hash_size})")
break break
# Attempt to pretty-print if value is JSON
try: try:
parsed_value = json.loads(value) parsed_value = json.loads(value)
pretty_value = json.dumps(parsed_value, indent=2) pretty_value = json.dumps(parsed_value, indent=2)
logger.info(f" '{key}':\n{pretty_value}") logger.info(f" '{key}':\n{pretty_value}")
except json.JSONDecodeError: except json.JSONDecodeError:
logger.info(f" '{key}': {value}") # Print as string if not JSON logger.info(f" '{key}': {value}")
item_count += 1 item_count += 1
logger.info(f"--- End of Hash Contents ---") logger.info(f"--- End of Hash Contents ---")
@ -256,7 +264,7 @@ def list_contents_callable(**context):
except Exception as e: except Exception as e:
logger.error(f"Failed to list contents of Redis key '{queue_to_list}': {e}", exc_info=True) logger.error(f"Failed to list contents of Redis key '{queue_to_list}': {e}", exc_info=True)
raise AirflowException(f"Failed to list Redis key contents: {e}") # Continue to the next key in the list instead of failing the whole task
def check_status_callable(**context): def check_status_callable(**context):
@ -292,6 +300,63 @@ def check_status_callable(**context):
raise AirflowException(f"Failed to check queue status: {e}") raise AirflowException(f"Failed to check queue status: {e}")
def requeue_failed_callable(**context):
"""
Copies all URLs from the fail hash to the inbox list and optionally clears the fail hash.
"""
params = context['params']
redis_conn_id = params['redis_conn_id']
queue_name = params['queue_name_for_requeue']
clear_fail_queue = params['clear_fail_queue_after_requeue']
fail_queue_name = f"{queue_name}_fail"
inbox_queue_name = f"{queue_name}_inbox"
logger.info(f"Requeuing failed URLs from '{fail_queue_name}' to '{inbox_queue_name}'.")
print(f"Requeuing failed URLs from '{fail_queue_name}' to '{inbox_queue_name}'.")
redis_client = _get_redis_client(redis_conn_id)
try:
# The fail queue is a hash. The keys are the URLs.
failed_urls_bytes = redis_client.hkeys(fail_queue_name)
if not failed_urls_bytes:
logger.info(f"Fail queue '{fail_queue_name}' is empty. Nothing to requeue.")
print(f"Fail queue '{fail_queue_name}' is empty. Nothing to requeue.")
return
failed_urls = [url.decode('utf-8') for url in failed_urls_bytes]
logger.info(f"Found {len(failed_urls)} URLs to requeue.")
print(f"Found {len(failed_urls)} URLs to requeue:")
for url in failed_urls:
print(f" - {url}")
# Add URLs to the inbox list
if failed_urls:
with redis_client.pipeline() as pipe:
pipe.rpush(inbox_queue_name, *failed_urls)
if clear_fail_queue:
pipe.delete(fail_queue_name)
pipe.execute()
final_list_length = redis_client.llen(inbox_queue_name)
success_message = (
f"Successfully requeued {len(failed_urls)} URLs to '{inbox_queue_name}'. "
f"The list now contains {final_list_length} items."
)
logger.info(success_message)
print(f"\n{success_message}")
if clear_fail_queue:
logger.info(f"Successfully cleared fail queue '{fail_queue_name}'.")
else:
logger.info(f"Fail queue '{fail_queue_name}' was not cleared as per configuration.")
except Exception as e:
logger.error(f"Failed to requeue failed URLs: {e}", exc_info=True)
raise AirflowException(f"Failed to requeue failed URLs: {e}")
def add_videos_to_queue_callable(**context): def add_videos_to_queue_callable(**context):
""" """
Parses video inputs, normalizes them to URLs, and adds them to a Redis queue. Parses video inputs, normalizes them to URLs, and adds them to a Redis queue.
@ -381,13 +446,14 @@ with DAG(
- `add_videos`: Add one or more YouTube videos to a queue. - `add_videos`: Add one or more YouTube videos to a queue.
- `clear_queue`: Dump and/or delete a specific Redis key. - `clear_queue`: Dump and/or delete a specific Redis key.
- `list_contents`: View the contents of a Redis key (list or hash). - `list_contents`: View the contents of a Redis key (list or hash).
- `check_status`: (Placeholder) Check the overall status of the queues. - `check_status`: Check the overall status of the queues.
- `requeue_failed`: Copy all URLs from the `_fail` hash to the `_inbox` list and clear the `_fail` hash.
""", """,
params={ params={
"action": Param( "action": Param(
"add_videos", "add_videos",
type="string", type="string",
enum=["add_videos", "clear_queue", "list_contents", "check_status"], enum=["add_videos", "clear_queue", "list_contents", "check_status", "requeue_failed"],
title="Action", title="Action",
description="The management action to perform.", description="The management action to perform.",
), ),
@ -437,10 +503,10 @@ with DAG(
), ),
# --- Params for 'list_contents' --- # --- Params for 'list_contents' ---
"queue_to_list": Param( "queue_to_list": Param(
'video_queue_inbox', 'video_queue_inbox,video_queue_fail',
type="string", type="string",
title="[list_contents] Queue to List", title="[list_contents] Queues to List",
description="Exact name of the Redis key to list.", description="Comma-separated list of exact Redis key names to list.",
), ),
"max_items": Param( "max_items": Param(
10, 10,
@ -455,6 +521,19 @@ with DAG(
title="[check_status] Base Queue Name", title="[check_status] Base Queue Name",
description="Base name of the queues to check (e.g., 'video_queue').", description="Base name of the queues to check (e.g., 'video_queue').",
), ),
# --- Params for 'requeue_failed' ---
"queue_name_for_requeue": Param(
DEFAULT_QUEUE_NAME,
type="string",
title="[requeue_failed] Base Queue Name",
description="Base name of the queues to requeue from (e.g., 'video_queue' will use 'video_queue_fail').",
),
"clear_fail_queue_after_requeue": Param(
True,
type="boolean",
title="[requeue_failed] Clear Fail Queue",
description="If True, deletes the `_fail` hash after requeueing items.",
),
# --- Common Params --- # --- Common Params ---
"redis_conn_id": Param( "redis_conn_id": Param(
DEFAULT_REDIS_CONN_ID, DEFAULT_REDIS_CONN_ID,
@ -489,5 +568,16 @@ with DAG(
python_callable=check_status_callable, python_callable=check_status_callable,
) )
# --- Placeholder Tasks --- action_requeue_failed = PythonOperator(
branch_on_action >> [action_add_videos, action_clear_queue, action_list_contents, action_check_status] task_id="action_requeue_failed",
python_callable=requeue_failed_callable,
)
# --- Wire up tasks ---
branch_on_action >> [
action_add_videos,
action_clear_queue,
action_list_contents,
action_check_status,
action_requeue_failed,
]

View File

@ -0,0 +1,194 @@
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#
# Copyright © 2024 rl <rl@rlmbp>
#
# Distributed under terms of the MIT license.
"""
DAG to orchestrate ytdlp_ops_worker_per_url DAG runs based on a defined policy.
It fetches URLs from a Redis queue and launches workers in controlled bunches.
"""
from airflow import DAG
from airflow.exceptions import AirflowException, AirflowSkipException
from airflow.operators.python import PythonOperator
from airflow.models.param import Param
from airflow.models.variable import Variable
from airflow.utils.dates import days_ago
from airflow.api.common.trigger_dag import trigger_dag
from airflow.models.dagrun import DagRun
from airflow.models.dag import DagModel
from datetime import timedelta
import logging
import random
import time
# Import utility functions
from utils.redis_utils import _get_redis_client
# Import Thrift modules for proxy status check
from pangramia.yt.tokens_ops import YTTokenOpService
from thrift.protocol import TBinaryProtocol
from thrift.transport import TSocket, TTransport
# Configure logging
logger = logging.getLogger(__name__)
# Default settings
DEFAULT_QUEUE_NAME = 'video_queue'
DEFAULT_REDIS_CONN_ID = 'redis_default'
DEFAULT_TOTAL_WORKERS = 3
DEFAULT_WORKERS_PER_BUNCH = 1
DEFAULT_WORKER_DELAY_S = 5
DEFAULT_BUNCH_DELAY_S = 20
DEFAULT_YT_AUTH_SERVICE_IP = Variable.get("YT_AUTH_SERVICE_IP", default_var="16.162.82.212")
DEFAULT_YT_AUTH_SERVICE_PORT = Variable.get("YT_AUTH_SERVICE_PORT", default_var=9080)
# --- Helper Functions ---
# --- Main Orchestration Callable ---
def orchestrate_workers_ignition_callable(**context):
"""
Main orchestration logic. Triggers a specified number of worker DAGs
to initiate self-sustaining processing loops.
"""
params = context['params']
logger.info("Starting worker ignition sequence.")
worker_dag_id = 'ytdlp_ops_worker_per_url'
dag_model = DagModel.get_dagmodel(worker_dag_id)
if dag_model and dag_model.is_paused:
raise AirflowException(f"Worker DAG '{worker_dag_id}' is paused. Cannot start worker loops.")
total_workers = int(params['total_workers'])
workers_per_bunch = int(params['workers_per_bunch'])
worker_delay = int(params['delay_between_workers_s'])
bunch_delay = int(params['delay_between_bunches_s'])
# Create a list of worker numbers to trigger
worker_indices = list(range(total_workers))
bunches = [worker_indices[i:i + workers_per_bunch] for i in range(0, len(worker_indices), workers_per_bunch)]
logger.info(f"Plan: Starting {total_workers} total workers in {len(bunches)} bunches.")
dag_run_id = context['dag_run'].run_id
total_triggered = 0
# Pass all orchestrator params to the worker so it has the full context for its loop.
conf_to_pass = {p: params[p] for p in params}
# The worker pulls its own URL, so we don't pass one.
if 'url' in conf_to_pass:
del conf_to_pass['url']
for i, bunch in enumerate(bunches):
logger.info(f"--- Igniting Bunch {i+1}/{len(bunches)} (contains {len(bunch)} worker(s)) ---")
for j, _ in enumerate(bunch):
# Create a unique run_id for each worker loop starter
run_id = f"ignited_{dag_run_id}_{total_triggered}"
logger.info(f"Igniting worker {j+1}/{len(bunch)} in bunch {i+1} (loop {total_triggered + 1}/{total_workers}) (Run ID: {run_id})")
logger.debug(f"Full conf for worker loop {run_id}: {conf_to_pass}")
trigger_dag(
dag_id=worker_dag_id,
run_id=run_id,
conf=conf_to_pass,
replace_microseconds=False
)
total_triggered += 1
# Delay between workers in a bunch
if j < len(bunch) - 1:
logger.info(f"Waiting {worker_delay}s before next worker in bunch...")
time.sleep(worker_delay)
# Delay between bunches
if i < len(bunches) - 1:
logger.info(f"--- Bunch {i+1} ignited. Waiting {bunch_delay}s before next bunch... ---")
time.sleep(bunch_delay)
logger.info(f"--- Ignition sequence complete. Total worker loops started: {total_triggered}. ---")
# =============================================================================
# DAG Definition
# =============================================================================
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'start_date': days_ago(1),
}
with DAG(
dag_id='ytdlp_ops_orchestrator',
default_args=default_args,
schedule_interval=None, # This DAG runs only when triggered.
max_active_runs=1, # Only one ignition process should run at a time.
catchup=False,
description='Ignition system for ytdlp_ops_worker_per_url DAGs. Starts self-sustaining worker loops.',
doc_md="""
### YT-DLP Worker Ignition System
This DAG acts as an "ignition system" to start one or more self-sustaining worker loops.
It does **not** process URLs itself. Its only job is to trigger a specified number of `ytdlp_ops_worker_per_url` DAGs.
#### How it Works:
1. **Manual Trigger:** You manually trigger this DAG with parameters defining how many worker loops to start (`total_workers`), in what configuration (`workers_per_bunch`, delays).
2. **Ignition:** The orchestrator triggers the initial set of worker DAGs in a "fire-and-forget" manner, passing all its configuration parameters to them.
3. **Completion:** Once all initial workers have been triggered, the orchestrator's job is complete.
The workers then take over, each running its own continuous processing loop.
""",
tags=['ytdlp', 'orchestrator', 'ignition'],
params={
# --- Ignition Control Parameters ---
'total_workers': Param(DEFAULT_TOTAL_WORKERS, type="integer", description="Total number of worker loops to start."),
'workers_per_bunch': Param(DEFAULT_WORKERS_PER_BUNCH, type="integer", description="Number of workers to start in each bunch."),
'delay_between_workers_s': Param(DEFAULT_WORKER_DELAY_S, type="integer", description="Delay in seconds between starting each worker within a bunch."),
'delay_between_bunches_s': Param(DEFAULT_BUNCH_DELAY_S, type="integer", description="Delay in seconds between starting each bunch."),
# --- Worker Passthrough Parameters ---
'on_bannable_failure': Param(
'retry_with_new_account',
type="string",
enum=['stop_loop', 'retry_with_new_account'],
title="[Worker Param] On Bannable Failure Policy",
description="Policy for a worker when a bannable error occurs. "
"'stop_loop': Ban the account, mark URL as failed, and stop the worker's loop. "
"'retry_with_new_account': Ban the failed account, retry ONCE with a new account. If retry fails, ban the second account and proxy, then stop."
),
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="[Worker Param] Base name for Redis queues."),
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="[Worker Param] Airflow Redis connection ID."),
'clients': Param('mweb,ios,android', type="string", description="[Worker Param] Comma-separated list of clients for token generation."),
'account_pool': Param('ytdlp_account', type="string", description="[Worker Param] Account pool prefix or comma-separated list."),
'account_pool_size': Param(10, type=["integer", "null"], description="[Worker Param] If using a prefix for 'account_pool', this specifies the number of accounts to generate (e.g., 10 for 'prefix_01' through 'prefix_10'). Required when using a prefix."),
'service_ip': Param(DEFAULT_YT_AUTH_SERVICE_IP, type="string", description="[Worker Param] IP of the ytdlp-ops-server. Default is from Airflow variable YT_AUTH_SERVICE_IP or hardcoded."),
'service_port': Param(DEFAULT_YT_AUTH_SERVICE_PORT, type="integer", description="[Worker Param] Port of the Envoy load balancer. Default is from Airflow variable YT_AUTH_SERVICE_PORT or hardcoded."),
'machine_id': Param("ytdlp-ops-airflow-service", type="string", description="[Worker Param] Identifier for the client machine."),
'auto_create_new_accounts_on_exhaustion': Param(True, type="boolean", description="[Worker Param] If True and all accounts in a prefix-based pool are exhausted, create a new one automatically."),
'retrigger_delay_on_empty_s': Param(60, type="integer", description="[Worker Param] Delay in seconds before a worker re-triggers itself if the queue is empty. Set to -1 to stop the loop."),
}
) as dag:
orchestrate_task = PythonOperator(
task_id='start_worker_loops',
python_callable=orchestrate_workers_ignition_callable,
)
orchestrate_task.doc_md = """
### Start Worker Loops
This is the main task that executes the ignition policy.
- It triggers `ytdlp_ops_worker_per_url` DAGs according to the batch settings.
- It passes all its parameters down to the workers, which will use them to run their continuous loops.
"""

View File

@ -1,215 +0,0 @@
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#
# Copyright © 2024 rl <rl@rlmbp>
#
# Distributed under terms of the MIT license.
"""
DAG to sense a Redis queue for new URLs and trigger the ytdlp_worker_per_url DAG.
This is the "Sensor" part of a Sensor/Worker pattern.
"""
from airflow import DAG
from airflow.exceptions import AirflowException, AirflowSkipException
from airflow.operators.python import PythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.providers.redis.hooks.redis import RedisHook
from airflow.models.param import Param
from airflow.utils.dates import days_ago
from datetime import timedelta
import logging
import redis
# Import utility functions
from utils.redis_utils import _get_redis_client
# Configure logging
logger = logging.getLogger(__name__)
# Default settings
DEFAULT_QUEUE_NAME = 'video_queue'
DEFAULT_REDIS_CONN_ID = 'redis_default'
DEFAULT_TIMEOUT = 30
DEFAULT_MAX_URLS = '1' # Default number of URLs to process per run
# --- Task Callables ---
def select_account_callable(**context):
"""
Placeholder task for future logic to dynamically select an account.
For now, it just passes through the account_id from the DAG params.
"""
params = context['params']
account_id = params.get('account_id', 'default_account')
logger.info(f"Selected account for this run: {account_id}")
# This task could push the selected account_id to XComs in the future.
# For now, the next task will just read it from params.
return account_id
def log_trigger_info_callable(**context):
"""Logs information about how the DAG run was triggered."""
dag_run = context['dag_run']
trigger_type = dag_run.run_type
logger.info(f"Sensor DAG triggered. Run ID: {dag_run.run_id}, Type: {trigger_type}")
if trigger_type == 'manual':
logger.info("Trigger source: Manual execution from Airflow UI or CLI.")
elif trigger_type == 'scheduled':
logger.info("Trigger source: Scheduled run (periodic check).")
elif trigger_type == 'dag_run':
# In Airflow 2.2+ we can get the triggering run object
try:
triggering_dag_run = dag_run.get_triggering_dagrun()
if triggering_dag_run:
triggering_dag_id = triggering_dag_run.dag_id
triggering_run_id = triggering_dag_run.run_id
logger.info(f"Trigger source: DAG Run from '{triggering_dag_id}' (Run ID: {triggering_run_id}).")
# Check if it's a worker by looking at the conf keys
conf = dag_run.conf or {}
if all(k in conf for k in ['queue_name', 'redis_conn_id', 'max_urls_per_run']):
logger.info("This appears to be a standard trigger from a worker DAG continuing the loop.")
else:
logger.warning(f"Triggered by another DAG but conf does not match worker pattern. Conf: {conf}")
else:
logger.warning("Trigger type is 'dag_run' but could not retrieve triggering DAG run details.")
except Exception as e:
logger.error(f"Could not get triggering DAG run details: {e}")
else:
logger.info(f"Trigger source: {trigger_type}")
def check_queue_for_urls_batch(**context):
"""
Pops a batch of URLs from the inbox queue.
Returns a list of configuration dictionaries for the TriggerDagRunOperator.
If the queue is empty, it raises AirflowSkipException.
"""
params = context['params']
queue_name = params['queue_name']
inbox_queue = f"{queue_name}_inbox"
redis_conn_id = params.get('redis_conn_id', DEFAULT_REDIS_CONN_ID)
max_urls_raw = params.get('max_urls_per_run', DEFAULT_MAX_URLS)
try:
max_urls = int(max_urls_raw)
except (ValueError, TypeError):
logger.warning(f"Invalid value for max_urls_per_run: '{max_urls_raw}'. Using default: {DEFAULT_MAX_URLS}")
max_urls = DEFAULT_MAX_URLS
urls_to_process = []
try:
client = _get_redis_client(redis_conn_id)
current_queue_size = client.llen(inbox_queue)
logger.info(f"Queue '{inbox_queue}' has {current_queue_size} URLs. Attempting to pop up to {max_urls}.")
for _ in range(max_urls):
url_bytes = client.lpop(inbox_queue)
if url_bytes:
url = url_bytes.decode('utf-8') if isinstance(url_bytes, bytes) else url_bytes
logger.info(f" - Popped URL: {url}")
urls_to_process.append(url)
else:
# Queue is empty, stop trying to pop
break
if urls_to_process:
logger.info(f"Found {len(urls_to_process)} URLs in queue. Generating trigger configurations.")
# Create a list of 'conf' objects for the trigger operator to expand
trigger_configs = []
for url in urls_to_process:
# The worker DAG will use its own default params for its operations.
# We only need to provide the URL for processing, and the sensor's own
# params so the worker can trigger the sensor again to continue the loop.
worker_conf = {
'url': url,
'queue_name': queue_name,
'redis_conn_id': redis_conn_id,
'max_urls_per_run': int(max_urls),
'stop_on_failure': params.get('stop_on_failure', True),
'account_id': params.get('account_id', 'default_account')
}
trigger_configs.append(worker_conf)
return trigger_configs
else:
logger.info(f"Queue '{inbox_queue}' is empty. Skipping trigger.")
raise AirflowSkipException(f"Redis queue '{inbox_queue}' is empty.")
except AirflowSkipException:
raise
except Exception as e:
logger.error(f"Error popping URLs from Redis queue '{inbox_queue}': {e}", exc_info=True)
raise AirflowException(f"Failed to pop URLs from Redis: {e}")
# =============================================================================
# DAG Definition
# =============================================================================
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0, # The sensor itself should not retry on failure, it will run again on schedule
'start_date': days_ago(1),
}
with DAG(
dag_id='ytdlp_ops_sensor_queue',
default_args=default_args,
schedule_interval=None, # Runs only on trigger, not on a schedule.
max_active_runs=1, # Prevent multiple sensors from running at once
catchup=False,
description='Polls Redis queue on trigger for URLs and starts worker DAGs.',
tags=['ytdlp', 'sensor', 'queue', 'redis', 'batch'],
params={
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues."),
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
'max_urls_per_run': Param(DEFAULT_MAX_URLS, type="string", description="Maximum number of URLs to process in one batch."),
'stop_on_failure': Param(True, type="boolean", description="If True, a worker failure will stop the entire processing loop."),
'account_id': Param('default_account', type="string", description="The account ID to use for processing the batch."),
}
) as dag:
log_trigger_info_task = PythonOperator(
task_id='log_trigger_info',
python_callable=log_trigger_info_callable,
)
log_trigger_info_task.doc_md = """
### Log Trigger Information
Logs details about how this DAG run was initiated (e.g., manually or by a worker DAG).
This provides visibility into the processing loop.
"""
poll_redis_task = PythonOperator(
task_id='check_queue_for_urls_batch',
python_callable=check_queue_for_urls_batch,
)
poll_redis_task.doc_md = """
### Poll Redis Queue for Batch
Checks the Redis inbox queue for a batch of new URLs (up to `max_urls_per_run`).
- **On Success (URLs found):** Returns a list of configuration objects for the trigger task.
- **On Skip (Queue empty):** Skips this task and the trigger task. The DAG run succeeds.
"""
# This operator will be dynamically expanded based on the output of poll_redis_task
trigger_worker_dags = TriggerDagRunOperator.partial(
task_id='trigger_worker_dags',
trigger_dag_id='ytdlp_ops_worker_per_url',
wait_for_completion=False, # Fire and forget
doc_md="""
### Trigger Worker DAGs (Dynamically Mapped)
Triggers one `ytdlp_worker_per_url` DAG run for each URL found by the polling task.
Each triggered DAG receives its own specific configuration (including the URL).
This task is skipped if the polling task finds no URLs.
"""
).expand(
conf=poll_redis_task.output
)
select_account_task = PythonOperator(
task_id='select_account',
python_callable=select_account_callable,
)
select_account_task.doc_md = "### Select Account\n(Placeholder for future dynamic account selection logic)"
log_trigger_info_task >> select_account_task >> poll_redis_task >> trigger_worker_dags

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,42 @@
services: services:
config-generator:
image: python:3.9-slim
container_name: ytdlp-ops-config-generator
working_dir: /app
volumes:
# Mount the current directory to access the template, .env, and script
- .:/app
env_file:
- ./.env
environment:
ENVOY_CLUSTER_TYPE: STRICT_DNS
# Pass worker count and base port to ensure Envoy config matches the workers
YTDLP_WORKERS: ${YTDLP_WORKERS:-3}
YTDLP_BASE_PORT: ${YTDLP_BASE_PORT:-9090}
# This command cleans up old runs, installs jinja2, and generates the config.
command: >
sh -c "rm -rf ./envoy.yaml &&
pip install --no-cache-dir -q jinja2 &&
python3 ./generate_envoy_config.py"
envoy:
image: envoyproxy/envoy:v1.29-latest
container_name: envoy-thrift-lb
restart: unless-stopped
volumes:
# Mount the generated config file from the host
- ./envoy.yaml:/etc/envoy/envoy.yaml:ro
ports:
# This is the single public port for all Thrift traffic
- "${ENVOY_PORT:-9080}:${ENVOY_PORT:-9080}"
networks:
- airflow_prod_proxynet
depends_on:
config-generator:
condition: service_completed_successfully
ytdlp-ops:
condition: service_started
camoufox: camoufox:
build: build:
context: ./camoufox # Path relative to the docker-compose file context: ./camoufox # Path relative to the docker-compose file
@ -15,9 +53,8 @@ services:
"--ws-host", "0.0.0.0", "--ws-host", "0.0.0.0",
"--port", "12345", "--port", "12345",
"--ws-path", "mypath", "--ws-path", "mypath",
"--proxy-url", "socks5://sslocal-rust-1084:1084", "--proxy-url", "socks5://${SOCKS5_SOCK_SERVER_IP:-89.253.221.173}:1084",
"--locale", "en-US", "--locale", "en-US",
"--geoip",
"--extensions", "/app/extensions/google_sign_in_popup_blocker-1.0.2.xpi,/app/extensions/spoof_timezone-0.3.4.xpi,/app/extensions/youtube_ad_auto_skipper-0.6.0.xpi" "--extensions", "/app/extensions/google_sign_in_popup_blocker-1.0.2.xpi,/app/extensions/spoof_timezone-0.3.4.xpi,/app/extensions/youtube_ad_auto_skipper-0.6.0.xpi"
] ]
restart: unless-stopped restart: unless-stopped
@ -25,25 +62,36 @@ services:
ytdlp-ops: ytdlp-ops:
image: pangramia/ytdlp-ops-server:latest # Don't comment out or remove, build is performed externally image: pangramia/ytdlp-ops-server:latest # Don't comment out or remove, build is performed externally
container_name: ytdlp-ops-workers # Renamed for clarity
depends_on: depends_on:
- camoufox # Ensure camoufox starts first - camoufox # Ensure camoufox starts first
ports: # Ports are no longer exposed directly. Envoy will connect to them on the internal network.
- "9090:9090" # Main RPC port env_file:
- "9091:9091" # Health check port - ./.env # Path is relative to the compose file
volumes: volumes:
- context-data:/app/context-data - context-data:/app/context-data
# Mount the plugin source code for live updates without rebuilding the image.
# Assumes the plugin source is in a 'bgutil-ytdlp-pot-provider' directory
# next to your docker-compose.yaml file.
#- ./bgutil-ytdlp-pot-provider:/app/bgutil-ytdlp-pot-provider
networks: networks:
- airflow_prod_proxynet - airflow_prod_proxynet
command: command:
- "--script-dir"
- "/app"
- "--context-dir" - "--context-dir"
- "/app/context-data" - "/app/context-data"
# Use environment variables for port and worker count
- "--port" - "--port"
- "9090" - "${YTDLP_BASE_PORT:-9090}"
- "--workers"
- "${YTDLP_WORKERS:-3}"
- "--clients" - "--clients"
# Add 'web' client since we now have camoufox, test firstly
- "web,ios,android,mweb" - "web,ios,android,mweb"
- "--proxies" - "--proxies"
- "socks5://sslocal-rust-1081:1081,socks5://sslocal-rust-1082:1082,socks5://sslocal-rust-1083:1083,socks5://sslocal-rust-1084:1084,socks5://sslocal-rust-1085:1085" #- "socks5://sslocal-rust-1081:1081,socks5://sslocal-rust-1082:1082,socks5://sslocal-rust-1083:1083,socks5://sslocal-rust-1084:1084,socks5://sslocal-rust-1085:1085"
- "socks5://${SOCKS5_SOCK_SERVER_IP:-89.253.221.173}:1084"
#
# Add the endpoint argument pointing to the camoufox service # Add the endpoint argument pointing to the camoufox service
- "--endpoint" - "--endpoint"
- "ws://camoufox:12345/mypath" - "ws://camoufox:12345/mypath"
@ -61,6 +109,13 @@ services:
- "${REDIS_PORT:-6379}" - "${REDIS_PORT:-6379}"
- "--redis-password" - "--redis-password"
- "${REDIS_PASSWORD}" - "${REDIS_PASSWORD}"
# Add account cooldown parameters (values are in minutes)
- "--account-active-duration-min"
- "${ACCOUNT_ACTIVE_DURATION_MIN:-30}"
- "--account-cooldown-duration-min"
- "${ACCOUNT_COOLDOWN_DURATION_MIN:-60}"
# Add flag to clean context directory on start
- "--clean-context-dir"
restart: unless-stopped restart: unless-stopped
pull_policy: always pull_policy: always
@ -69,5 +124,4 @@ volumes:
name: context-data name: context-data
networks: networks:
airflow_prod_proxynet: airflow_prod_proxynet: {}
external: true

9
requirements.txt Normal file
View File

@ -0,0 +1,9 @@
thrift>=0.16.0,<=0.20.0
backoff>=2.2.1
python-dotenv==1.0.1
psutil>=5.9.0
docker>=6.0.0
apache-airflow-providers-docker
redis
ffprobe3
ffmpeg-python