101 lines
7.1 KiB
Markdown
101 lines
7.1 KiB
Markdown
# YTDLP Airflow DAGs
|
|
|
|
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.
|
|
|
|
## DAG Descriptions
|
|
|
|
### `ytdlp_client_dag_v2.1`
|
|
|
|
* **File:** `airflow/dags/ytdlp_client_dag_v2.1.py`
|
|
* **Purpose:** Provides a way to test the YTDLP Ops Thrift service interaction for a *single* video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
|
|
* **Parameters (Defaults):**
|
|
* `url` (`'https://www.youtube.com/watch?v=sOlTX9uxUtM'`): The video URL to process.
|
|
* `redis_enabled` (`False`): Use Redis for service discovery?
|
|
* `service_ip` (`'85.192.30.55'`): Service IP if `redis_enabled=False`.
|
|
* `service_port` (`9090`): Service port if `redis_enabled=False`.
|
|
* `account_id` (`'account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Account ID for lookup or call.
|
|
* `timeout` (`30`): Timeout in seconds for Thrift connection.
|
|
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`.
|
|
* **Results:**
|
|
* Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
|
|
* Retrieves token data for the given URL and account ID.
|
|
* Saves the video's `info.json` metadata to the specified directory.
|
|
* Extracts the SOCKS proxy (if available).
|
|
* Pushes `info_json_path`, `socks_proxy`, and the original `ytdlp_command` (with tokens) to XCom.
|
|
* Optionally stores detailed results in a Redis hash (`token_info:<video_id>`).
|
|
|
|
### `ytdlp_mgmt_queue_add_urls`
|
|
|
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_add_urls.py`
|
|
* **Purpose:** Manually add video URLs to a specific YTDLP inbox queue (Redis List).
|
|
* **Parameters (Defaults):**
|
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
* `queue_name` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Target Redis list (inbox queue).
|
|
* `urls` (`""`): Multiline string of video URLs to add.
|
|
* **Results:**
|
|
* Parses the multiline `urls` parameter.
|
|
* Adds each valid URL to the end of the Redis list specified by `queue_name`.
|
|
* Logs the number of URLs added.
|
|
|
|
### `ytdlp_mgmt_queue_clear`
|
|
|
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_clear.py`
|
|
* **Purpose:** Manually delete a specific Redis key used by the YTDLP queues.
|
|
* **Parameters (Defaults):**
|
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
* `queue_to_clear` (`'PLEASE_SPECIFY_QUEUE_TO_CLEAR'`): Exact name of the Redis key to clear. **Must be changed by user.**
|
|
* **Results:**
|
|
* Deletes the Redis key specified by the `queue_to_clear` parameter.
|
|
* **Warning:** This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., `video_queue_inbox_account_xyz`, `video_queue_progress`, `video_queue_result`, `video_queue_fail`).
|
|
|
|
### `ytdlp_mgmt_queue_check_status`
|
|
|
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_check_status.py`
|
|
* **Purpose:** Manually check the type and size of a specific YTDLP Redis queue/key.
|
|
* **Parameters (Defaults):**
|
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
* `queue_to_check` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to check.
|
|
* **Results:**
|
|
* Connects to Redis and determines the type of the key specified by `queue_to_check`.
|
|
* Determines the size (length for lists, number of fields for hashes).
|
|
* Logs the key type and size.
|
|
* Pushes `queue_key_type` and `queue_size` to XCom.
|
|
|
|
### `ytdlp_mgmt_queue_list_contents`
|
|
|
|
* **File:** `airflow/dags/ytdlp_mgmt_queue_list_contents.py`
|
|
* **Purpose:** Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
|
|
* **Parameters (Defaults):**
|
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
* `queue_to_list` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to list.
|
|
* `max_items` (`100`): Maximum number of items/fields to list.
|
|
* **Results:**
|
|
* Connects to Redis and identifies the type of the key specified by `queue_to_list`.
|
|
* If it's a List, logs the first `max_items` elements.
|
|
* If it's a Hash, logs up to `max_items` key-value pairs, attempting to pretty-print JSON values.
|
|
* Logs warnings for very large hashes.
|
|
|
|
### `ytdlp_proc_sequential_processor`
|
|
|
|
* **File:** `airflow/dags/ytdlp_proc_sequential_processor.py`
|
|
* **Purpose:** Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using `yt-dlp`, and records the result.
|
|
* **Parameters (Defaults):**
|
|
* `queue_name` (`'video_queue'`): Base name for Redis queues (e.g., `video_queue_inbox`, `video_queue_progress`).
|
|
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
|
* `redis_enabled` (`False`): Use Redis for service discovery? If False, uses `service_ip`/`port`.
|
|
* `service_ip` (`None`): Required Service IP if `redis_enabled=False`.
|
|
* `service_port` (`None`): Required Service port if `redis_enabled=False`.
|
|
* `account_id` (`'default_account'`): Account ID for the API call (used for Redis lookup if `redis_enabled=True`).
|
|
* `timeout` (`30`): Timeout in seconds for the Thrift connection.
|
|
* `download_format` (`'ba[ext=m4a]/bestaudio/best'`): yt-dlp format selection string.
|
|
* `output_path_template` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"`): yt-dlp output template. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
|
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
|
* **Results:**
|
|
* Pops one URL from the `{{ params.queue_name }}_inbox` Redis list.
|
|
* If a URL is popped, it's added to the `{{ params.queue_name }}_progress` Redis hash.
|
|
* The `YtdlpOpsOperator` (`get_token` task) attempts to get token data (including `info.json`, proxy, command) for the URL using the specified connection method and account ID.
|
|
* If token retrieval succeeds, the `download_video` task executes `yt-dlp` using the retrieved `info.json`, proxy, the `download_format` parameter, and the `output_path_template` parameter to download the actual media.
|
|
* **On Successful Download:** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_result` hash along with results (`info_json_path`, `socks_proxy`, `ytdlp_command`).
|
|
* **On Failure (Token Retrieval or Download):** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_fail` hash along with error details (message, traceback).
|
|
* If the inbox queue is empty, the DAG run skips processing via `AirflowSkipException`.
|