Update Readme and ytdlp dags
This commit is contained in:
parent
affc59ee57
commit
1f186fd217
100
README.en.old.md
Normal file
100
README.en.old.md
Normal file
@ -0,0 +1,100 @@
|
|||||||
|
# YTDLP Airflow DAGs
|
||||||
|
|
||||||
|
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.
|
||||||
|
|
||||||
|
## DAG Descriptions
|
||||||
|
|
||||||
|
### `ytdlp_client_dag_v2.1`
|
||||||
|
|
||||||
|
* **File:** `airflow/dags/ytdlp_client_dag_v2.1.py`
|
||||||
|
* **Purpose:** Provides a way to test the YTDLP Ops Thrift service interaction for a *single* video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
|
||||||
|
* **Parameters (Defaults):**
|
||||||
|
* `url` (`'https://www.youtube.com/watch?v=sOlTX9uxUtM'`): The video URL to process.
|
||||||
|
* `redis_enabled` (`False`): Use Redis for service discovery?
|
||||||
|
* `service_ip` (`'85.192.30.55'`): Service IP if `redis_enabled=False`.
|
||||||
|
* `service_port` (`9090`): Service port if `redis_enabled=False`.
|
||||||
|
* `account_id` (`'account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Account ID for lookup or call.
|
||||||
|
* `timeout` (`30`): Timeout in seconds for Thrift connection.
|
||||||
|
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`.
|
||||||
|
* **Results:**
|
||||||
|
* Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
|
||||||
|
* Retrieves token data for the given URL and account ID.
|
||||||
|
* Saves the video's `info.json` metadata to the specified directory.
|
||||||
|
* Extracts the SOCKS proxy (if available).
|
||||||
|
* Pushes `info_json_path`, `socks_proxy`, and the original `ytdlp_command` (with tokens) to XCom.
|
||||||
|
* Optionally stores detailed results in a Redis hash (`token_info:<video_id>`).
|
||||||
|
|
||||||
|
### `ytdlp_mgmt_queue_add_urls`
|
||||||
|
|
||||||
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_add_urls.py`
|
||||||
|
* **Purpose:** Manually add video URLs to a specific YTDLP inbox queue (Redis List).
|
||||||
|
* **Parameters (Defaults):**
|
||||||
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||||
|
* `queue_name` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Target Redis list (inbox queue).
|
||||||
|
* `urls` (`""`): Multiline string of video URLs to add.
|
||||||
|
* **Results:**
|
||||||
|
* Parses the multiline `urls` parameter.
|
||||||
|
* Adds each valid URL to the end of the Redis list specified by `queue_name`.
|
||||||
|
* Logs the number of URLs added.
|
||||||
|
|
||||||
|
### `ytdlp_mgmt_queue_clear`
|
||||||
|
|
||||||
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_clear.py`
|
||||||
|
* **Purpose:** Manually delete a specific Redis key used by the YTDLP queues.
|
||||||
|
* **Parameters (Defaults):**
|
||||||
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||||
|
* `queue_to_clear` (`'PLEASE_SPECIFY_QUEUE_TO_CLEAR'`): Exact name of the Redis key to clear. **Must be changed by user.**
|
||||||
|
* **Results:**
|
||||||
|
* Deletes the Redis key specified by the `queue_to_clear` parameter.
|
||||||
|
* **Warning:** This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., `video_queue_inbox_account_xyz`, `video_queue_progress`, `video_queue_result`, `video_queue_fail`).
|
||||||
|
|
||||||
|
### `ytdlp_mgmt_queue_check_status`
|
||||||
|
|
||||||
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_check_status.py`
|
||||||
|
* **Purpose:** Manually check the type and size of a specific YTDLP Redis queue/key.
|
||||||
|
* **Parameters (Defaults):**
|
||||||
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||||
|
* `queue_to_check` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to check.
|
||||||
|
* **Results:**
|
||||||
|
* Connects to Redis and determines the type of the key specified by `queue_to_check`.
|
||||||
|
* Determines the size (length for lists, number of fields for hashes).
|
||||||
|
* Logs the key type and size.
|
||||||
|
* Pushes `queue_key_type` and `queue_size` to XCom.
|
||||||
|
|
||||||
|
### `ytdlp_mgmt_queue_list_contents`
|
||||||
|
|
||||||
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_list_contents.py`
|
||||||
|
* **Purpose:** Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
|
||||||
|
* **Parameters (Defaults):**
|
||||||
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||||
|
* `queue_to_list` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to list.
|
||||||
|
* `max_items` (`100`): Maximum number of items/fields to list.
|
||||||
|
* **Results:**
|
||||||
|
* Connects to Redis and identifies the type of the key specified by `queue_to_list`.
|
||||||
|
* If it's a List, logs the first `max_items` elements.
|
||||||
|
* If it's a Hash, logs up to `max_items` key-value pairs, attempting to pretty-print JSON values.
|
||||||
|
* Logs warnings for very large hashes.
|
||||||
|
|
||||||
|
### `ytdlp_proc_sequential_processor`
|
||||||
|
|
||||||
|
* **File:** `airflow/dags/ytdlp_proc_sequential_processor.py`
|
||||||
|
* **Purpose:** Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using `yt-dlp`, and records the result.
|
||||||
|
* **Parameters (Defaults):**
|
||||||
|
* `queue_name` (`'video_queue'`): Base name for Redis queues (e.g., `video_queue_inbox`, `video_queue_progress`).
|
||||||
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||||
|
* `redis_enabled` (`False`): Use Redis for service discovery? If False, uses `service_ip`/`port`.
|
||||||
|
* `service_ip` (`None`): Required Service IP if `redis_enabled=False`.
|
||||||
|
* `service_port` (`None`): Required Service port if `redis_enabled=False`.
|
||||||
|
* `account_id` (`'default_account'`): Account ID for the API call (used for Redis lookup if `redis_enabled=True`).
|
||||||
|
* `timeout` (`30`): Timeout in seconds for the Thrift connection.
|
||||||
|
* `download_format` (`'ba[ext=m4a]/bestaudio/best'`): yt-dlp format selection string.
|
||||||
|
* `output_path_template` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"`): yt-dlp output template. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
||||||
|
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
||||||
|
* **Results:**
|
||||||
|
* Pops one URL from the `{{ params.queue_name }}_inbox` Redis list.
|
||||||
|
* If a URL is popped, it's added to the `{{ params.queue_name }}_progress` Redis hash.
|
||||||
|
* The `YtdlpOpsOperator` (`get_token` task) attempts to get token data (including `info.json`, proxy, command) for the URL using the specified connection method and account ID.
|
||||||
|
* If token retrieval succeeds, the `download_video` task executes `yt-dlp` using the retrieved `info.json`, proxy, the `download_format` parameter, and the `output_path_template` parameter to download the actual media.
|
||||||
|
* **On Successful Download:** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_result` hash along with results (`info_json_path`, `socks_proxy`, `ytdlp_command`).
|
||||||
|
* **On Failure (Token Retrieval or Download):** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_fail` hash along with error details (message, traceback).
|
||||||
|
* If the inbox queue is empty, the DAG run skips processing via `AirflowSkipException`.
|
||||||
114
README.md
114
README.md
@ -1,100 +1,38 @@
|
|||||||
# YTDLP Airflow DAGs
|
# Архитектура и описание YTDLP Airflow DAGs
|
||||||
|
|
||||||
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.
|
Этот документ описывает архитектуру и назначение DAG'ов, используемых для скачивания видео с YouTube. Система построена по паттерну "Сенсор/Воркер" для обеспечения непрерывной и параллельной обработки.
|
||||||
|
|
||||||
## DAG Descriptions
|
## Основной цикл обработки
|
||||||
|
|
||||||
### `ytdlp_client_dag_v2.1`
|
### `ytdlp_sensor_redis_queue` (Сенсор)
|
||||||
|
|
||||||
* **File:** `airflow/dags/ytdlp_client_dag_v2.1.py`
|
- **Назначение:** Забирает URL на скачивание из очереди Redis и запускает воркеры для их обработки.
|
||||||
* **Purpose:** Provides a way to test the YTDLP Ops Thrift service interaction for a *single* video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
|
- **Принцип работы (Гибридный запуск):**
|
||||||
* **Parameters (Defaults):**
|
- **По расписанию:** Каждую минуту DAG автоматически проверяет очередь Redis. Это гарантирует, что новые задачи будут подхвачены, даже если цикл обработки был временно остановлен (из-за пустой очереди).
|
||||||
* `url` (`'https://www.youtube.com/watch?v=sOlTX9uxUtM'`): The video URL to process.
|
- **По триггеру:** Когда воркер `ytdlp_worker_per_url` успешно завершает работу, он немедленно запускает сенсор, не дожидаясь следующей минуты. Это обеспечивает непрерывную обработку без задержек.
|
||||||
* `redis_enabled` (`False`): Use Redis for service discovery?
|
- **Логика:** Извлекает из Redis (`_inbox` лист) пачку URL. Если очередь пуста, DAG успешно завершается до следующего запуска (по триггеру или по расписанию).
|
||||||
* `service_ip` (`'85.192.30.55'`): Service IP if `redis_enabled=False`.
|
|
||||||
* `service_port` (`9090`): Service port if `redis_enabled=False`.
|
|
||||||
* `account_id` (`'account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Account ID for lookup or call.
|
|
||||||
* `timeout` (`30`): Timeout in seconds for Thrift connection.
|
|
||||||
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`.
|
|
||||||
* **Results:**
|
|
||||||
* Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
|
|
||||||
* Retrieves token data for the given URL and account ID.
|
|
||||||
* Saves the video's `info.json` metadata to the specified directory.
|
|
||||||
* Extracts the SOCKS proxy (if available).
|
|
||||||
* Pushes `info_json_path`, `socks_proxy`, and the original `ytdlp_command` (with tokens) to XCom.
|
|
||||||
* Optionally stores detailed results in a Redis hash (`token_info:<video_id>`).
|
|
||||||
|
|
||||||
### `ytdlp_mgmt_queue_add_urls`
|
### `ytdlp_worker_per_url` (Воркер)
|
||||||
|
|
||||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_add_urls.py`
|
- **Назначение:** Обрабатывает один URL, скачивает видео и продолжает цикл.
|
||||||
* **Purpose:** Manually add video URLs to a specific YTDLP inbox queue (Redis List).
|
- **Принцип работы:**
|
||||||
* **Parameters (Defaults):**
|
- Получает один URL от сенсора.
|
||||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
- Обращается к сервису `ytdlp-ops-auth` для получения `info.json` и `socks5` прокси.
|
||||||
* `queue_name` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Target Redis list (inbox queue).
|
- Скачивает видео, используя полученные данные. (TODO: заменить вызов `yt-dlp` как команды на вызов библиотеки).
|
||||||
* `urls` (`""`): Multiline string of video URLs to add.
|
- В зависимости от статуса (успех/неуспех), помещает результат в соответствующий хэш Redis (`_result` или `_fail`).
|
||||||
* **Results:**
|
- В случае успеха, повторно запускает сенсор `ytdlp_sensor_redis_queue` для продолжения цикла обработки. В случае ошибки цикл останавливается для ручной диагностики.
|
||||||
* Parses the multiline `urls` parameter.
|
|
||||||
* Adds each valid URL to the end of the Redis list specified by `queue_name`.
|
|
||||||
* Logs the number of URLs added.
|
|
||||||
|
|
||||||
### `ytdlp_mgmt_queue_clear`
|
## Управляющие DAG'и
|
||||||
|
|
||||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_clear.py`
|
Эти DAG'и предназначены для ручного управления очередями и не участвуют в автоматическом цикле.
|
||||||
* **Purpose:** Manually delete a specific Redis key used by the YTDLP queues.
|
|
||||||
* **Parameters (Defaults):**
|
|
||||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
||||||
* `queue_to_clear` (`'PLEASE_SPECIFY_QUEUE_TO_CLEAR'`): Exact name of the Redis key to clear. **Must be changed by user.**
|
|
||||||
* **Results:**
|
|
||||||
* Deletes the Redis key specified by the `queue_to_clear` parameter.
|
|
||||||
* **Warning:** This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., `video_queue_inbox_account_xyz`, `video_queue_progress`, `video_queue_result`, `video_queue_fail`).
|
|
||||||
|
|
||||||
### `ytdlp_mgmt_queue_check_status`
|
- **`ytdlp_mgmt_queue_add_and_verify`**: Добавление URL в очередь задач (`_inbox`) и последующая проверка статуса этой очереди.
|
||||||
|
- **`ytdlp_mgmt_queues_check_status`**: Просмотр состояния и содержимого всех ключевых очередей (`_inbox`, `_progress`, `_result`, `_fail`). Помогает отслеживать процесс обработки.
|
||||||
|
- **`ytdlp_mgmt_queue_clear`**: Очистка (полное удаление) указанной очереди Redis. **Использовать с осторожностью**, так как операция необратима.
|
||||||
|
|
||||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_check_status.py`
|
## Внешние сервисы
|
||||||
* **Purpose:** Manually check the type and size of a specific YTDLP Redis queue/key.
|
|
||||||
* **Parameters (Defaults):**
|
|
||||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
||||||
* `queue_to_check` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to check.
|
|
||||||
* **Results:**
|
|
||||||
* Connects to Redis and determines the type of the key specified by `queue_to_check`.
|
|
||||||
* Determines the size (length for lists, number of fields for hashes).
|
|
||||||
* Logs the key type and size.
|
|
||||||
* Pushes `queue_key_type` and `queue_size` to XCom.
|
|
||||||
|
|
||||||
### `ytdlp_mgmt_queue_list_contents`
|
### `ytdlp-ops-auth` (Thrift Service)
|
||||||
|
|
||||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_list_contents.py`
|
- **Назначение:** Внешний сервис, который предоставляет аутентификационные данные (токены, cookies, proxy) для скачивания видео.
|
||||||
* **Purpose:** Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
|
- **Взаимодействие:** Worker DAG (`ytdlp_worker_per_url`) обращается к этому сервису перед началом загрузки для получения необходимых данных для `yt-dlp`.
|
||||||
* **Parameters (Defaults):**
|
|
||||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
||||||
* `queue_to_list` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to list.
|
|
||||||
* `max_items` (`100`): Maximum number of items/fields to list.
|
|
||||||
* **Results:**
|
|
||||||
* Connects to Redis and identifies the type of the key specified by `queue_to_list`.
|
|
||||||
* If it's a List, logs the first `max_items` elements.
|
|
||||||
* If it's a Hash, logs up to `max_items` key-value pairs, attempting to pretty-print JSON values.
|
|
||||||
* Logs warnings for very large hashes.
|
|
||||||
|
|
||||||
### `ytdlp_proc_sequential_processor`
|
|
||||||
|
|
||||||
* **File:** `airflow/dags/ytdlp_proc_sequential_processor.py`
|
|
||||||
* **Purpose:** Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using `yt-dlp`, and records the result.
|
|
||||||
* **Parameters (Defaults):**
|
|
||||||
* `queue_name` (`'video_queue'`): Base name for Redis queues (e.g., `video_queue_inbox`, `video_queue_progress`).
|
|
||||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
||||||
* `redis_enabled` (`False`): Use Redis for service discovery? If False, uses `service_ip`/`port`.
|
|
||||||
* `service_ip` (`None`): Required Service IP if `redis_enabled=False`.
|
|
||||||
* `service_port` (`None`): Required Service port if `redis_enabled=False`.
|
|
||||||
* `account_id` (`'default_account'`): Account ID for the API call (used for Redis lookup if `redis_enabled=True`).
|
|
||||||
* `timeout` (`30`): Timeout in seconds for the Thrift connection.
|
|
||||||
* `download_format` (`'ba[ext=m4a]/bestaudio/best'`): yt-dlp format selection string.
|
|
||||||
* `output_path_template` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"`): yt-dlp output template. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
|
||||||
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
|
||||||
* **Results:**
|
|
||||||
* Pops one URL from the `{{ params.queue_name }}_inbox` Redis list.
|
|
||||||
* If a URL is popped, it's added to the `{{ params.queue_name }}_progress` Redis hash.
|
|
||||||
* The `YtdlpOpsOperator` (`get_token` task) attempts to get token data (including `info.json`, proxy, command) for the URL using the specified connection method and account ID.
|
|
||||||
* If token retrieval succeeds, the `download_video` task executes `yt-dlp` using the retrieved `info.json`, proxy, the `download_format` parameter, and the `output_path_template` parameter to download the actual media.
|
|
||||||
* **On Successful Download:** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_result` hash along with results (`info_json_path`, `socks_proxy`, `ytdlp_command`).
|
|
||||||
* **On Failure (Token Retrieval or Download):** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_fail` hash along with error details (message, traceback).
|
|
||||||
* If the inbox queue is empty, the DAG run skips processing via `AirflowSkipException`.
|
|
||||||
|
|||||||
@ -8,6 +8,9 @@ from datetime import timedelta
|
|||||||
import logging
|
import logging
|
||||||
import redis # Import redis exceptions if needed
|
import redis # Import redis exceptions if needed
|
||||||
|
|
||||||
|
# Import utility functions
|
||||||
|
from utils.redis_utils import _get_redis_client
|
||||||
|
|
||||||
# Configure logging
|
# Configure logging
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@ -15,23 +18,6 @@ logger = logging.getLogger(__name__)
|
|||||||
DEFAULT_QUEUE_NAME = 'video_queue' # Default base name for the queue
|
DEFAULT_QUEUE_NAME = 'video_queue' # Default base name for the queue
|
||||||
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
||||||
|
|
||||||
# --- Helper Functions ---
|
|
||||||
|
|
||||||
def _get_redis_client(redis_conn_id):
|
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
# --- Python Callables for Tasks ---
|
# --- Python Callables for Tasks ---
|
||||||
|
|
||||||
def add_urls_callable(**context):
|
def add_urls_callable(**context):
|
||||||
|
|||||||
@ -28,22 +28,8 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
|
|||||||
DEFAULT_QUEUE_BASE_NAME = 'video_queue'
|
DEFAULT_QUEUE_BASE_NAME = 'video_queue'
|
||||||
DEFAULT_MAX_ITEMS_TO_LIST = 25
|
DEFAULT_MAX_ITEMS_TO_LIST = 25
|
||||||
|
|
||||||
# --- Helper Function ---
|
# Import utility functions
|
||||||
|
from utils.redis_utils import _get_redis_client
|
||||||
def _get_redis_client(redis_conn_id):
|
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
# --- Python Callable for Check and List Task ---
|
# --- Python Callable for Check and List Task ---
|
||||||
|
|
||||||
|
|||||||
@ -1,5 +1,10 @@
|
|||||||
# -*- coding: utf-8 -*-
|
# -*- coding: utf-8 -*-
|
||||||
# vim:fenc=utf-8
|
# vim:fenc=utf-8
|
||||||
|
#
|
||||||
|
# Copyright © 2024 rl <rl@rlmbp>
|
||||||
|
#
|
||||||
|
# Distributed under terms of the MIT license.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
Airflow DAG for manually clearing (deleting) a specific Redis key used by YTDLP queues.
|
Airflow DAG for manually clearing (deleting) a specific Redis key used by YTDLP queues.
|
||||||
"""
|
"""
|
||||||
@ -22,22 +27,8 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
|
|||||||
# Provide a placeholder default, user MUST specify the queue to clear
|
# Provide a placeholder default, user MUST specify the queue to clear
|
||||||
DEFAULT_QUEUE_TO_CLEAR = 'PLEASE_SPECIFY_QUEUE_TO_CLEAR'
|
DEFAULT_QUEUE_TO_CLEAR = 'PLEASE_SPECIFY_QUEUE_TO_CLEAR'
|
||||||
|
|
||||||
# --- Helper Function ---
|
# Import utility functions
|
||||||
|
from utils.redis_utils import _get_redis_client
|
||||||
def _get_redis_client(redis_conn_id):
|
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
# --- Python Callable for Clear Task ---
|
# --- Python Callable for Clear Task ---
|
||||||
|
|
||||||
|
|||||||
@ -29,24 +29,8 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
|
|||||||
DEFAULT_QUEUE_TO_LIST = 'video_queue_inbox'
|
DEFAULT_QUEUE_TO_LIST = 'video_queue_inbox'
|
||||||
DEFAULT_MAX_ITEMS = 10 # Limit number of items listed by default
|
DEFAULT_MAX_ITEMS = 10 # Limit number of items listed by default
|
||||||
|
|
||||||
# --- Helper Function ---
|
# Import utility functions
|
||||||
|
from utils.redis_utils import _get_redis_client
|
||||||
def _get_redis_client(redis_conn_id):
|
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
# decode_responses=True removed as it's not supported by get_conn in some environments
|
|
||||||
# We will decode manually where needed.
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
# --- Python Callable for List Contents Task ---
|
# --- Python Callable for List Contents Task ---
|
||||||
|
|
||||||
|
|||||||
@ -47,20 +47,7 @@ RETRY_DELAY_REDIS_LOOKUP = 10 # Delay (seconds) for Redis lookup retries
|
|||||||
|
|
||||||
# --- Helper Functions ---
|
# --- Helper Functions ---
|
||||||
|
|
||||||
def _get_redis_client(redis_conn_id):
|
from utils.redis_utils import _get_redis_client
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
def _extract_video_id(url):
|
def _extract_video_id(url):
|
||||||
"""Extracts YouTube video ID from URL."""
|
"""Extracts YouTube video ID from URL."""
|
||||||
|
|||||||
@ -21,6 +21,9 @@ from datetime import timedelta
|
|||||||
import logging
|
import logging
|
||||||
import redis
|
import redis
|
||||||
|
|
||||||
|
# Import utility functions
|
||||||
|
from utils.redis_utils import _get_redis_client
|
||||||
|
|
||||||
# Configure logging
|
# Configure logging
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@ -30,23 +33,6 @@ DEFAULT_REDIS_CONN_ID = 'redis_default'
|
|||||||
DEFAULT_TIMEOUT = 30
|
DEFAULT_TIMEOUT = 30
|
||||||
DEFAULT_MAX_URLS = '1' # Default number of URLs to process per run
|
DEFAULT_MAX_URLS = '1' # Default number of URLs to process per run
|
||||||
|
|
||||||
# --- Helper Functions ---
|
|
||||||
|
|
||||||
def _get_redis_client(redis_conn_id):
|
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
# --- Task Callables ---
|
# --- Task Callables ---
|
||||||
|
|
||||||
def log_trigger_info_callable(**context):
|
def log_trigger_info_callable(**context):
|
||||||
@ -57,6 +43,8 @@ def log_trigger_info_callable(**context):
|
|||||||
|
|
||||||
if trigger_type == 'manual':
|
if trigger_type == 'manual':
|
||||||
logger.info("Trigger source: Manual execution from Airflow UI or CLI.")
|
logger.info("Trigger source: Manual execution from Airflow UI or CLI.")
|
||||||
|
elif trigger_type == 'scheduled':
|
||||||
|
logger.info("Trigger source: Scheduled run (periodic check).")
|
||||||
elif trigger_type == 'dag_run':
|
elif trigger_type == 'dag_run':
|
||||||
# In Airflow 2.2+ we can get the triggering run object
|
# In Airflow 2.2+ we can get the triggering run object
|
||||||
try:
|
try:
|
||||||
@ -154,10 +142,10 @@ default_args = {
|
|||||||
with DAG(
|
with DAG(
|
||||||
dag_id='ytdlp_sensor_redis_queue',
|
dag_id='ytdlp_sensor_redis_queue',
|
||||||
default_args=default_args,
|
default_args=default_args,
|
||||||
schedule_interval=None, # This DAG is now only triggered (manually or by a worker)
|
schedule_interval='*/1 * * * *', # Runs every minute and can also be triggered.
|
||||||
max_active_runs=1, # Prevent multiple sensors from running at once
|
max_active_runs=1, # Prevent multiple sensors from running at once
|
||||||
catchup=False,
|
catchup=False,
|
||||||
description='Polls Redis queue for a batch of URLs and triggers parallel worker DAGs.',
|
description='Polls Redis queue every minute (and on trigger) for URLs and starts worker DAGs.',
|
||||||
tags=['ytdlp', 'sensor', 'queue', 'redis', 'batch'],
|
tags=['ytdlp', 'sensor', 'queue', 'redis', 'batch'],
|
||||||
params={
|
params={
|
||||||
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues."),
|
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues."),
|
||||||
|
|||||||
@ -33,6 +33,10 @@ import os
|
|||||||
import redis
|
import redis
|
||||||
import socket
|
import socket
|
||||||
import time
|
import time
|
||||||
|
import traceback
|
||||||
|
|
||||||
|
# Import utility functions
|
||||||
|
from utils.redis_utils import _get_redis_client
|
||||||
|
|
||||||
# Configure logging
|
# Configure logging
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@ -106,7 +110,7 @@ def handle_success(**context):
|
|||||||
try:
|
try:
|
||||||
# In the worker pattern, there's no "progress" hash to remove from.
|
# In the worker pattern, there's no "progress" hash to remove from.
|
||||||
# We just add the result to the success hash.
|
# We just add the result to the success hash.
|
||||||
client = YtdlpOpsOperator._get_redis_client(redis_conn_id)
|
client = _get_redis_client(redis_conn_id)
|
||||||
client.hset(result_queue, url, json.dumps(result_data))
|
client.hset(result_queue, url, json.dumps(result_data))
|
||||||
logger.info(f"Stored success result for URL '{url}' in result hash '{result_queue}'.")
|
logger.info(f"Stored success result for URL '{url}' in result hash '{result_queue}'.")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@ -115,8 +119,8 @@ def handle_success(**context):
|
|||||||
|
|
||||||
def handle_failure(**context):
|
def handle_failure(**context):
|
||||||
"""
|
"""
|
||||||
Handles failed processing. Moves the URL to the fail hash and, if stop_on_failure
|
Handles failed processing. Records detailed error information to the fail hash
|
||||||
is True, fails the task to make the DAG run failure visible.
|
and, if stop_on_failure is True, fails the task to make the DAG run failure visible.
|
||||||
"""
|
"""
|
||||||
ti = context['task_instance']
|
ti = context['task_instance']
|
||||||
params = context['params']
|
params = context['params']
|
||||||
@ -132,14 +136,31 @@ def handle_failure(**context):
|
|||||||
requeue_on_failure = params.get('requeue_on_failure', False)
|
requeue_on_failure = params.get('requeue_on_failure', False)
|
||||||
stop_on_failure = params.get('stop_on_failure', True)
|
stop_on_failure = params.get('stop_on_failure', True)
|
||||||
|
|
||||||
|
# --- Extract Detailed Error Information ---
|
||||||
exception = context.get('exception')
|
exception = context.get('exception')
|
||||||
error_message = str(exception) if exception else "Unknown error"
|
error_message = str(exception) if exception else "Unknown error"
|
||||||
|
error_type = type(exception).__name__ if exception else "Unknown"
|
||||||
|
tb_str = "".join(traceback.format_exception(etype=type(exception), value=exception, tb=exception.__traceback__)) if exception else "No traceback available."
|
||||||
|
|
||||||
|
# Find the specific task that failed
|
||||||
|
dag_run = context['dag_run']
|
||||||
|
failed_task_id = "unknown"
|
||||||
|
# Look at direct upstream tasks of the current task ('handle_failure')
|
||||||
|
upstream_tasks = ti.get_direct_relatives(upstream=True)
|
||||||
|
for task in upstream_tasks:
|
||||||
|
upstream_ti = dag_run.get_task_instance(task_id=task.task_id)
|
||||||
|
if upstream_ti and upstream_ti.state == 'failed':
|
||||||
|
failed_task_id = task.task_id
|
||||||
|
break
|
||||||
|
|
||||||
logger.info(f"Handling failure for URL: {url}")
|
logger.info(f"Handling failure for URL: {url}")
|
||||||
|
logger.error(f" Failed Task: {failed_task_id}")
|
||||||
|
logger.error(f" Failure Type: {error_type}")
|
||||||
logger.error(f" Failure Reason: {error_message}")
|
logger.error(f" Failure Reason: {error_message}")
|
||||||
|
logger.debug(f" Traceback:\n{tb_str}")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
client = YtdlpOpsOperator._get_redis_client(redis_conn_id)
|
client = _get_redis_client(redis_conn_id)
|
||||||
if requeue_on_failure:
|
if requeue_on_failure:
|
||||||
client.rpush(inbox_queue, url)
|
client.rpush(inbox_queue, url)
|
||||||
logger.info(f"Re-queued failed URL '{url}' to inbox '{inbox_queue}' for retry.")
|
logger.info(f"Re-queued failed URL '{url}' to inbox '{inbox_queue}' for retry.")
|
||||||
@ -147,12 +168,15 @@ def handle_failure(**context):
|
|||||||
fail_data = {
|
fail_data = {
|
||||||
'status': 'failed',
|
'status': 'failed',
|
||||||
'end_time': time.time(),
|
'end_time': time.time(),
|
||||||
'error': error_message,
|
'failed_task': failed_task_id,
|
||||||
|
'error_type': error_type,
|
||||||
|
'error_message': error_message,
|
||||||
|
'traceback': tb_str,
|
||||||
'url': url,
|
'url': url,
|
||||||
'dag_run_id': context['dag_run'].run_id,
|
'dag_run_id': context['dag_run'].run_id,
|
||||||
}
|
}
|
||||||
client.hset(fail_queue, url, json.dumps(fail_data))
|
client.hset(fail_queue, url, json.dumps(fail_data, indent=2))
|
||||||
logger.info(f"Stored failure details for URL '{url}' in fail hash '{fail_queue}'.")
|
logger.info(f"Stored detailed failure info for URL '{url}' in fail hash '{fail_queue}'.")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Critical error during failure handling in Redis for URL '{url}': {e}", exc_info=True)
|
logger.error(f"Critical error during failure handling in Redis for URL '{url}': {e}", exc_info=True)
|
||||||
# This is a critical error in the failure handling logic itself.
|
# This is a critical error in the failure handling logic itself.
|
||||||
@ -178,22 +202,6 @@ class YtdlpOpsOperator(BaseOperator):
|
|||||||
"""
|
"""
|
||||||
template_fields = ('service_ip', 'service_port', 'account_id', 'timeout', 'info_json_dir')
|
template_fields = ('service_ip', 'service_port', 'account_id', 'timeout', 'info_json_dir')
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def _get_redis_client(redis_conn_id):
|
|
||||||
"""Gets a Redis client connection using RedisHook."""
|
|
||||||
try:
|
|
||||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
|
||||||
client = hook.get_conn()
|
|
||||||
client.ping()
|
|
||||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
|
||||||
return client
|
|
||||||
except redis.exceptions.AuthenticationError:
|
|
||||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
|
||||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
|
||||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
|
||||||
|
|
||||||
@apply_defaults
|
@apply_defaults
|
||||||
def __init__(self,
|
def __init__(self,
|
||||||
service_ip=None,
|
service_ip=None,
|
||||||
@ -448,8 +456,8 @@ with DAG(
|
|||||||
trigger_sensor_for_next_batch.doc_md = """
|
trigger_sensor_for_next_batch.doc_md = """
|
||||||
### Trigger Sensor for Next Batch
|
### Trigger Sensor for Next Batch
|
||||||
Triggers a new run of the `ytdlp_sensor_redis_queue` DAG to create a continuous processing loop.
|
Triggers a new run of the `ytdlp_sensor_redis_queue` DAG to create a continuous processing loop.
|
||||||
This task runs after the main processing tasks are complete (either success or failure),
|
This task **only runs on the success path** after a URL has been fully processed.
|
||||||
ensuring that the system immediately checks for more URLs to process.
|
This ensures that the system immediately checks for more URLs to process, but stops the loop on failure.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Define success and failure handling tasks
|
# Define success and failure handling tasks
|
||||||
@ -470,10 +478,9 @@ with DAG(
|
|||||||
# The main processing flow
|
# The main processing flow
|
||||||
get_token >> download_video
|
get_token >> download_video
|
||||||
|
|
||||||
# Branch after download: one path for success, one for failure
|
# The success path: if download_video succeeds, run success_task, then trigger the next sensor run.
|
||||||
download_video >> success_task
|
download_video >> success_task >> trigger_sensor_for_next_batch
|
||||||
download_video >> failure_task
|
|
||||||
|
|
||||||
# The trigger to continue the loop ONLY runs on the success path.
|
# The failure path: if get_token OR download_video fails, run the failure_task.
|
||||||
# A failure will be recorded in Redis by `handle_failure` and then the loop will stop.
|
# This is a "fan-in" dependency.
|
||||||
success_task >> trigger_sensor_for_next_batch
|
[get_token, download_video] >> failure_task
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user