Update Readme and ytdlp dags

This commit is contained in:
aperez 2025-07-18 17:17:19 +03:00
parent affc59ee57
commit 1f186fd217
9 changed files with 186 additions and 219 deletions

100
README.en.old.md Normal file
View File

@ -0,0 +1,100 @@
# YTDLP Airflow DAGs
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.
## DAG Descriptions
### `ytdlp_client_dag_v2.1`
* **File:** `airflow/dags/ytdlp_client_dag_v2.1.py`
* **Purpose:** Provides a way to test the YTDLP Ops Thrift service interaction for a *single* video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
* **Parameters (Defaults):**
* `url` (`'https://www.youtube.com/watch?v=sOlTX9uxUtM'`): The video URL to process.
* `redis_enabled` (`False`): Use Redis for service discovery?
* `service_ip` (`'85.192.30.55'`): Service IP if `redis_enabled=False`.
* `service_port` (`9090`): Service port if `redis_enabled=False`.
* `account_id` (`'account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Account ID for lookup or call.
* `timeout` (`30`): Timeout in seconds for Thrift connection.
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`.
* **Results:**
* Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
* Retrieves token data for the given URL and account ID.
* Saves the video's `info.json` metadata to the specified directory.
* Extracts the SOCKS proxy (if available).
* Pushes `info_json_path`, `socks_proxy`, and the original `ytdlp_command` (with tokens) to XCom.
* Optionally stores detailed results in a Redis hash (`token_info:<video_id>`).
### `ytdlp_mgmt_queue_add_urls`
* **File:** `airflow/dags/ytdlp_mgmt_queue_add_urls.py`
* **Purpose:** Manually add video URLs to a specific YTDLP inbox queue (Redis List).
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_name` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Target Redis list (inbox queue).
* `urls` (`""`): Multiline string of video URLs to add.
* **Results:**
* Parses the multiline `urls` parameter.
* Adds each valid URL to the end of the Redis list specified by `queue_name`.
* Logs the number of URLs added.
### `ytdlp_mgmt_queue_clear`
* **File:** `airflow/dags/ytdlp_mgmt_queue_clear.py`
* **Purpose:** Manually delete a specific Redis key used by the YTDLP queues.
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_to_clear` (`'PLEASE_SPECIFY_QUEUE_TO_CLEAR'`): Exact name of the Redis key to clear. **Must be changed by user.**
* **Results:**
* Deletes the Redis key specified by the `queue_to_clear` parameter.
* **Warning:** This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., `video_queue_inbox_account_xyz`, `video_queue_progress`, `video_queue_result`, `video_queue_fail`).
### `ytdlp_mgmt_queue_check_status`
* **File:** `airflow/dags/ytdlp_mgmt_queue_check_status.py`
* **Purpose:** Manually check the type and size of a specific YTDLP Redis queue/key.
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_to_check` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to check.
* **Results:**
* Connects to Redis and determines the type of the key specified by `queue_to_check`.
* Determines the size (length for lists, number of fields for hashes).
* Logs the key type and size.
* Pushes `queue_key_type` and `queue_size` to XCom.
### `ytdlp_mgmt_queue_list_contents`
* **File:** `airflow/dags/ytdlp_mgmt_queue_list_contents.py`
* **Purpose:** Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_to_list` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to list.
* `max_items` (`100`): Maximum number of items/fields to list.
* **Results:**
* Connects to Redis and identifies the type of the key specified by `queue_to_list`.
* If it's a List, logs the first `max_items` elements.
* If it's a Hash, logs up to `max_items` key-value pairs, attempting to pretty-print JSON values.
* Logs warnings for very large hashes.
### `ytdlp_proc_sequential_processor`
* **File:** `airflow/dags/ytdlp_proc_sequential_processor.py`
* **Purpose:** Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using `yt-dlp`, and records the result.
* **Parameters (Defaults):**
* `queue_name` (`'video_queue'`): Base name for Redis queues (e.g., `video_queue_inbox`, `video_queue_progress`).
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `redis_enabled` (`False`): Use Redis for service discovery? If False, uses `service_ip`/`port`.
* `service_ip` (`None`): Required Service IP if `redis_enabled=False`.
* `service_port` (`None`): Required Service port if `redis_enabled=False`.
* `account_id` (`'default_account'`): Account ID for the API call (used for Redis lookup if `redis_enabled=True`).
* `timeout` (`30`): Timeout in seconds for the Thrift connection.
* `download_format` (`'ba[ext=m4a]/bestaudio/best'`): yt-dlp format selection string.
* `output_path_template` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"`): yt-dlp output template. Uses Airflow Variable `DOWNLOADS_TEMP`.
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`. Uses Airflow Variable `DOWNLOADS_TEMP`.
* **Results:**
* Pops one URL from the `{{ params.queue_name }}_inbox` Redis list.
* If a URL is popped, it's added to the `{{ params.queue_name }}_progress` Redis hash.
* The `YtdlpOpsOperator` (`get_token` task) attempts to get token data (including `info.json`, proxy, command) for the URL using the specified connection method and account ID.
* If token retrieval succeeds, the `download_video` task executes `yt-dlp` using the retrieved `info.json`, proxy, the `download_format` parameter, and the `output_path_template` parameter to download the actual media.
* **On Successful Download:** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_result` hash along with results (`info_json_path`, `socks_proxy`, `ytdlp_command`).
* **On Failure (Token Retrieval or Download):** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_fail` hash along with error details (message, traceback).
* If the inbox queue is empty, the DAG run skips processing via `AirflowSkipException`.

114
README.md
View File

@ -1,100 +1,38 @@
# YTDLP Airflow DAGs # Архитектура и описание YTDLP Airflow DAGs
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues. Этот документ описывает архитектуру и назначение DAG'ов, используемых для скачивания видео с YouTube. Система построена по паттерну "Сенсор/Воркер" для обеспечения непрерывной и параллельной обработки.
## DAG Descriptions ## Основной цикл обработки
### `ytdlp_client_dag_v2.1` ### `ytdlp_sensor_redis_queue` (Сенсор)
* **File:** `airflow/dags/ytdlp_client_dag_v2.1.py` - **Назначение:** Забирает URL на скачивание из очереди Redis и запускает воркеры для их обработки.
* **Purpose:** Provides a way to test the YTDLP Ops Thrift service interaction for a *single* video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system. - **Принцип работы (Гибридный запуск):**
* **Parameters (Defaults):** - **По расписанию:** Каждую минуту DAG автоматически проверяет очередь Redis. Это гарантирует, что новые задачи будут подхвачены, даже если цикл обработки был временно остановлен (из-за пустой очереди).
* `url` (`'https://www.youtube.com/watch?v=sOlTX9uxUtM'`): The video URL to process. - **По триггеру:** Когда воркер `ytdlp_worker_per_url` успешно завершает работу, он немедленно запускает сенсор, не дожидаясь следующей минуты. Это обеспечивает непрерывную обработку без задержек.
* `redis_enabled` (`False`): Use Redis for service discovery? - **Логика:** Извлекает из Redis (`_inbox` лист) пачку URL. Если очередь пуста, DAG успешно завершается до следующего запуска (по триггеру или по расписанию).
* `service_ip` (`'85.192.30.55'`): Service IP if `redis_enabled=False`.
* `service_port` (`9090`): Service port if `redis_enabled=False`.
* `account_id` (`'account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Account ID for lookup or call.
* `timeout` (`30`): Timeout in seconds for Thrift connection.
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`.
* **Results:**
* Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
* Retrieves token data for the given URL and account ID.
* Saves the video's `info.json` metadata to the specified directory.
* Extracts the SOCKS proxy (if available).
* Pushes `info_json_path`, `socks_proxy`, and the original `ytdlp_command` (with tokens) to XCom.
* Optionally stores detailed results in a Redis hash (`token_info:<video_id>`).
### `ytdlp_mgmt_queue_add_urls` ### `ytdlp_worker_per_url` (Воркер)
* **File:** `airflow/dags/ytdlp_mgmt_queue_add_urls.py` - **Назначение:** Обрабатывает один URL, скачивает видео и продолжает цикл.
* **Purpose:** Manually add video URLs to a specific YTDLP inbox queue (Redis List). - **Принцип работы:**
* **Parameters (Defaults):** - Получает один URL от сенсора.
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID. - Обращается к сервису `ytdlp-ops-auth` для получения `info.json` и `socks5` прокси.
* `queue_name` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Target Redis list (inbox queue). - Скачивает видео, используя полученные данные. (TODO: заменить вызов `yt-dlp` как команды на вызов библиотеки).
* `urls` (`""`): Multiline string of video URLs to add. - В зависимости от статуса (успех/неуспех), помещает результат в соответствующий хэш Redis (`_result` или `_fail`).
* **Results:** - В случае успеха, повторно запускает сенсор `ytdlp_sensor_redis_queue` для продолжения цикла обработки. В случае ошибки цикл останавливается для ручной диагностики.
* Parses the multiline `urls` parameter.
* Adds each valid URL to the end of the Redis list specified by `queue_name`.
* Logs the number of URLs added.
### `ytdlp_mgmt_queue_clear` ## Управляющие DAG'и
* **File:** `airflow/dags/ytdlp_mgmt_queue_clear.py` Эти DAG'и предназначены для ручного управления очередями и не участвуют в автоматическом цикле.
* **Purpose:** Manually delete a specific Redis key used by the YTDLP queues.
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_to_clear` (`'PLEASE_SPECIFY_QUEUE_TO_CLEAR'`): Exact name of the Redis key to clear. **Must be changed by user.**
* **Results:**
* Deletes the Redis key specified by the `queue_to_clear` parameter.
* **Warning:** This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., `video_queue_inbox_account_xyz`, `video_queue_progress`, `video_queue_result`, `video_queue_fail`).
### `ytdlp_mgmt_queue_check_status` - **`ytdlp_mgmt_queue_add_and_verify`**: Добавление URL в очередь задач (`_inbox`) и последующая проверка статуса этой очереди.
- **`ytdlp_mgmt_queues_check_status`**: Просмотр состояния и содержимого всех ключевых очередей (`_inbox`, `_progress`, `_result`, `_fail`). Помогает отслеживать процесс обработки.
- **`ytdlp_mgmt_queue_clear`**: Очистка (полное удаление) указанной очереди Redis. **Использовать с осторожностью**, так как операция необратима.
* **File:** `airflow/dags/ytdlp_mgmt_queue_check_status.py` ## Внешние сервисы
* **Purpose:** Manually check the type and size of a specific YTDLP Redis queue/key.
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_to_check` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to check.
* **Results:**
* Connects to Redis and determines the type of the key specified by `queue_to_check`.
* Determines the size (length for lists, number of fields for hashes).
* Logs the key type and size.
* Pushes `queue_key_type` and `queue_size` to XCom.
### `ytdlp_mgmt_queue_list_contents` ### `ytdlp-ops-auth` (Thrift Service)
* **File:** `airflow/dags/ytdlp_mgmt_queue_list_contents.py` - **Назначение:** Внешний сервис, который предоставляет аутентификационные данные (токены, cookies, proxy) для скачивания видео.
* **Purpose:** Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results. - **Взаимодействие:** Worker DAG (`ytdlp_worker_per_url`) обращается к этому сервису перед началом загрузки для получения необходимых данных для `yt-dlp`.
* **Parameters (Defaults):**
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `queue_to_list` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to list.
* `max_items` (`100`): Maximum number of items/fields to list.
* **Results:**
* Connects to Redis and identifies the type of the key specified by `queue_to_list`.
* If it's a List, logs the first `max_items` elements.
* If it's a Hash, logs up to `max_items` key-value pairs, attempting to pretty-print JSON values.
* Logs warnings for very large hashes.
### `ytdlp_proc_sequential_processor`
* **File:** `airflow/dags/ytdlp_proc_sequential_processor.py`
* **Purpose:** Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using `yt-dlp`, and records the result.
* **Parameters (Defaults):**
* `queue_name` (`'video_queue'`): Base name for Redis queues (e.g., `video_queue_inbox`, `video_queue_progress`).
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
* `redis_enabled` (`False`): Use Redis for service discovery? If False, uses `service_ip`/`port`.
* `service_ip` (`None`): Required Service IP if `redis_enabled=False`.
* `service_port` (`None`): Required Service port if `redis_enabled=False`.
* `account_id` (`'default_account'`): Account ID for the API call (used for Redis lookup if `redis_enabled=True`).
* `timeout` (`30`): Timeout in seconds for the Thrift connection.
* `download_format` (`'ba[ext=m4a]/bestaudio/best'`): yt-dlp format selection string.
* `output_path_template` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"`): yt-dlp output template. Uses Airflow Variable `DOWNLOADS_TEMP`.
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`. Uses Airflow Variable `DOWNLOADS_TEMP`.
* **Results:**
* Pops one URL from the `{{ params.queue_name }}_inbox` Redis list.
* If a URL is popped, it's added to the `{{ params.queue_name }}_progress` Redis hash.
* The `YtdlpOpsOperator` (`get_token` task) attempts to get token data (including `info.json`, proxy, command) for the URL using the specified connection method and account ID.
* If token retrieval succeeds, the `download_video` task executes `yt-dlp` using the retrieved `info.json`, proxy, the `download_format` parameter, and the `output_path_template` parameter to download the actual media.
* **On Successful Download:** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_result` hash along with results (`info_json_path`, `socks_proxy`, `ytdlp_command`).
* **On Failure (Token Retrieval or Download):** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_fail` hash along with error details (message, traceback).
* If the inbox queue is empty, the DAG run skips processing via `AirflowSkipException`.

View File

@ -8,6 +8,9 @@ from datetime import timedelta
import logging import logging
import redis # Import redis exceptions if needed import redis # Import redis exceptions if needed
# Import utility functions
from utils.redis_utils import _get_redis_client
# Configure logging # Configure logging
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -15,23 +18,6 @@ logger = logging.getLogger(__name__)
DEFAULT_QUEUE_NAME = 'video_queue' # Default base name for the queue DEFAULT_QUEUE_NAME = 'video_queue' # Default base name for the queue
DEFAULT_REDIS_CONN_ID = 'redis_default' DEFAULT_REDIS_CONN_ID = 'redis_default'
# --- Helper Functions ---
def _get_redis_client(redis_conn_id):
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
# --- Python Callables for Tasks --- # --- Python Callables for Tasks ---
def add_urls_callable(**context): def add_urls_callable(**context):

View File

@ -28,22 +28,8 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
DEFAULT_QUEUE_BASE_NAME = 'video_queue' DEFAULT_QUEUE_BASE_NAME = 'video_queue'
DEFAULT_MAX_ITEMS_TO_LIST = 25 DEFAULT_MAX_ITEMS_TO_LIST = 25
# --- Helper Function --- # Import utility functions
from utils.redis_utils import _get_redis_client
def _get_redis_client(redis_conn_id):
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
# --- Python Callable for Check and List Task --- # --- Python Callable for Check and List Task ---

View File

@ -1,5 +1,10 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# vim:fenc=utf-8 # vim:fenc=utf-8
#
# Copyright © 2024 rl <rl@rlmbp>
#
# Distributed under terms of the MIT license.
""" """
Airflow DAG for manually clearing (deleting) a specific Redis key used by YTDLP queues. Airflow DAG for manually clearing (deleting) a specific Redis key used by YTDLP queues.
""" """
@ -22,22 +27,8 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
# Provide a placeholder default, user MUST specify the queue to clear # Provide a placeholder default, user MUST specify the queue to clear
DEFAULT_QUEUE_TO_CLEAR = 'PLEASE_SPECIFY_QUEUE_TO_CLEAR' DEFAULT_QUEUE_TO_CLEAR = 'PLEASE_SPECIFY_QUEUE_TO_CLEAR'
# --- Helper Function --- # Import utility functions
from utils.redis_utils import _get_redis_client
def _get_redis_client(redis_conn_id):
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
# --- Python Callable for Clear Task --- # --- Python Callable for Clear Task ---

View File

@ -29,24 +29,8 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
DEFAULT_QUEUE_TO_LIST = 'video_queue_inbox' DEFAULT_QUEUE_TO_LIST = 'video_queue_inbox'
DEFAULT_MAX_ITEMS = 10 # Limit number of items listed by default DEFAULT_MAX_ITEMS = 10 # Limit number of items listed by default
# --- Helper Function --- # Import utility functions
from utils.redis_utils import _get_redis_client
def _get_redis_client(redis_conn_id):
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
# decode_responses=True removed as it's not supported by get_conn in some environments
# We will decode manually where needed.
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
# --- Python Callable for List Contents Task --- # --- Python Callable for List Contents Task ---

View File

@ -47,20 +47,7 @@ RETRY_DELAY_REDIS_LOOKUP = 10 # Delay (seconds) for Redis lookup retries
# --- Helper Functions --- # --- Helper Functions ---
def _get_redis_client(redis_conn_id): from utils.redis_utils import _get_redis_client
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
def _extract_video_id(url): def _extract_video_id(url):
"""Extracts YouTube video ID from URL.""" """Extracts YouTube video ID from URL."""

View File

@ -21,6 +21,9 @@ from datetime import timedelta
import logging import logging
import redis import redis
# Import utility functions
from utils.redis_utils import _get_redis_client
# Configure logging # Configure logging
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -30,23 +33,6 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
DEFAULT_TIMEOUT = 30 DEFAULT_TIMEOUT = 30
DEFAULT_MAX_URLS = '1' # Default number of URLs to process per run DEFAULT_MAX_URLS = '1' # Default number of URLs to process per run
# --- Helper Functions ---
def _get_redis_client(redis_conn_id):
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
# --- Task Callables --- # --- Task Callables ---
def log_trigger_info_callable(**context): def log_trigger_info_callable(**context):
@ -57,6 +43,8 @@ def log_trigger_info_callable(**context):
if trigger_type == 'manual': if trigger_type == 'manual':
logger.info("Trigger source: Manual execution from Airflow UI or CLI.") logger.info("Trigger source: Manual execution from Airflow UI or CLI.")
elif trigger_type == 'scheduled':
logger.info("Trigger source: Scheduled run (periodic check).")
elif trigger_type == 'dag_run': elif trigger_type == 'dag_run':
# In Airflow 2.2+ we can get the triggering run object # In Airflow 2.2+ we can get the triggering run object
try: try:
@ -154,10 +142,10 @@ default_args = {
with DAG( with DAG(
dag_id='ytdlp_sensor_redis_queue', dag_id='ytdlp_sensor_redis_queue',
default_args=default_args, default_args=default_args,
schedule_interval=None, # This DAG is now only triggered (manually or by a worker) schedule_interval='*/1 * * * *', # Runs every minute and can also be triggered.
max_active_runs=1, # Prevent multiple sensors from running at once max_active_runs=1, # Prevent multiple sensors from running at once
catchup=False, catchup=False,
description='Polls Redis queue for a batch of URLs and triggers parallel worker DAGs.', description='Polls Redis queue every minute (and on trigger) for URLs and starts worker DAGs.',
tags=['ytdlp', 'sensor', 'queue', 'redis', 'batch'], tags=['ytdlp', 'sensor', 'queue', 'redis', 'batch'],
params={ params={
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues."), 'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues."),

View File

@ -33,6 +33,10 @@ import os
import redis import redis
import socket import socket
import time import time
import traceback
# Import utility functions
from utils.redis_utils import _get_redis_client
# Configure logging # Configure logging
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -106,7 +110,7 @@ def handle_success(**context):
try: try:
# In the worker pattern, there's no "progress" hash to remove from. # In the worker pattern, there's no "progress" hash to remove from.
# We just add the result to the success hash. # We just add the result to the success hash.
client = YtdlpOpsOperator._get_redis_client(redis_conn_id) client = _get_redis_client(redis_conn_id)
client.hset(result_queue, url, json.dumps(result_data)) client.hset(result_queue, url, json.dumps(result_data))
logger.info(f"Stored success result for URL '{url}' in result hash '{result_queue}'.") logger.info(f"Stored success result for URL '{url}' in result hash '{result_queue}'.")
except Exception as e: except Exception as e:
@ -115,8 +119,8 @@ def handle_success(**context):
def handle_failure(**context): def handle_failure(**context):
""" """
Handles failed processing. Moves the URL to the fail hash and, if stop_on_failure Handles failed processing. Records detailed error information to the fail hash
is True, fails the task to make the DAG run failure visible. and, if stop_on_failure is True, fails the task to make the DAG run failure visible.
""" """
ti = context['task_instance'] ti = context['task_instance']
params = context['params'] params = context['params']
@ -132,14 +136,31 @@ def handle_failure(**context):
requeue_on_failure = params.get('requeue_on_failure', False) requeue_on_failure = params.get('requeue_on_failure', False)
stop_on_failure = params.get('stop_on_failure', True) stop_on_failure = params.get('stop_on_failure', True)
# --- Extract Detailed Error Information ---
exception = context.get('exception') exception = context.get('exception')
error_message = str(exception) if exception else "Unknown error" error_message = str(exception) if exception else "Unknown error"
error_type = type(exception).__name__ if exception else "Unknown"
tb_str = "".join(traceback.format_exception(etype=type(exception), value=exception, tb=exception.__traceback__)) if exception else "No traceback available."
# Find the specific task that failed
dag_run = context['dag_run']
failed_task_id = "unknown"
# Look at direct upstream tasks of the current task ('handle_failure')
upstream_tasks = ti.get_direct_relatives(upstream=True)
for task in upstream_tasks:
upstream_ti = dag_run.get_task_instance(task_id=task.task_id)
if upstream_ti and upstream_ti.state == 'failed':
failed_task_id = task.task_id
break
logger.info(f"Handling failure for URL: {url}") logger.info(f"Handling failure for URL: {url}")
logger.error(f" Failed Task: {failed_task_id}")
logger.error(f" Failure Type: {error_type}")
logger.error(f" Failure Reason: {error_message}") logger.error(f" Failure Reason: {error_message}")
logger.debug(f" Traceback:\n{tb_str}")
try: try:
client = YtdlpOpsOperator._get_redis_client(redis_conn_id) client = _get_redis_client(redis_conn_id)
if requeue_on_failure: if requeue_on_failure:
client.rpush(inbox_queue, url) client.rpush(inbox_queue, url)
logger.info(f"Re-queued failed URL '{url}' to inbox '{inbox_queue}' for retry.") logger.info(f"Re-queued failed URL '{url}' to inbox '{inbox_queue}' for retry.")
@ -147,12 +168,15 @@ def handle_failure(**context):
fail_data = { fail_data = {
'status': 'failed', 'status': 'failed',
'end_time': time.time(), 'end_time': time.time(),
'error': error_message, 'failed_task': failed_task_id,
'error_type': error_type,
'error_message': error_message,
'traceback': tb_str,
'url': url, 'url': url,
'dag_run_id': context['dag_run'].run_id, 'dag_run_id': context['dag_run'].run_id,
} }
client.hset(fail_queue, url, json.dumps(fail_data)) client.hset(fail_queue, url, json.dumps(fail_data, indent=2))
logger.info(f"Stored failure details for URL '{url}' in fail hash '{fail_queue}'.") logger.info(f"Stored detailed failure info for URL '{url}' in fail hash '{fail_queue}'.")
except Exception as e: except Exception as e:
logger.error(f"Critical error during failure handling in Redis for URL '{url}': {e}", exc_info=True) logger.error(f"Critical error during failure handling in Redis for URL '{url}': {e}", exc_info=True)
# This is a critical error in the failure handling logic itself. # This is a critical error in the failure handling logic itself.
@ -178,22 +202,6 @@ class YtdlpOpsOperator(BaseOperator):
""" """
template_fields = ('service_ip', 'service_port', 'account_id', 'timeout', 'info_json_dir') template_fields = ('service_ip', 'service_port', 'account_id', 'timeout', 'info_json_dir')
@staticmethod
def _get_redis_client(redis_conn_id):
"""Gets a Redis client connection using RedisHook."""
try:
hook = RedisHook(redis_conn_id=redis_conn_id)
client = hook.get_conn()
client.ping()
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
return client
except redis.exceptions.AuthenticationError:
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
except Exception as e:
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
@apply_defaults @apply_defaults
def __init__(self, def __init__(self,
service_ip=None, service_ip=None,
@ -448,8 +456,8 @@ with DAG(
trigger_sensor_for_next_batch.doc_md = """ trigger_sensor_for_next_batch.doc_md = """
### Trigger Sensor for Next Batch ### Trigger Sensor for Next Batch
Triggers a new run of the `ytdlp_sensor_redis_queue` DAG to create a continuous processing loop. Triggers a new run of the `ytdlp_sensor_redis_queue` DAG to create a continuous processing loop.
This task runs after the main processing tasks are complete (either success or failure), This task **only runs on the success path** after a URL has been fully processed.
ensuring that the system immediately checks for more URLs to process. This ensures that the system immediately checks for more URLs to process, but stops the loop on failure.
""" """
# Define success and failure handling tasks # Define success and failure handling tasks
@ -470,10 +478,9 @@ with DAG(
# The main processing flow # The main processing flow
get_token >> download_video get_token >> download_video
# Branch after download: one path for success, one for failure # The success path: if download_video succeeds, run success_task, then trigger the next sensor run.
download_video >> success_task download_video >> success_task >> trigger_sensor_for_next_batch
download_video >> failure_task
# The trigger to continue the loop ONLY runs on the success path. # The failure path: if get_token OR download_video fails, run the failure_task.
# A failure will be recorded in Redis by `handle_failure` and then the loop will stop. # This is a "fan-in" dependency.
success_task >> trigger_sensor_for_next_batch [get_token, download_video] >> failure_task