7.1 KiB
7.1 KiB
YTDLP Airflow DAGs
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.
DAG Descriptions
ytdlp_client_dag_v2.1
- File:
airflow/dags/ytdlp_client_dag_v2.1.py - Purpose: Provides a way to test the YTDLP Ops Thrift service interaction for a single video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
- Parameters (Defaults):
*
url('https://www.youtube.com/watch?v=sOlTX9uxUtM'): The video URL to process. *redis_enabled(False): Use Redis for service discovery? *service_ip('85.192.30.55'): Service IP ifredis_enabled=False. *service_port(9090): Service port ifredis_enabled=False. *account_id('account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Account ID for lookup or call. *timeout(30): Timeout in seconds for Thrift connection. *info_json_dir("{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"): Directory to saveinfo.json. - Results:
- Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
- Retrieves token data for the given URL and account ID.
- Saves the video's
info.jsonmetadata to the specified directory. - Extracts the SOCKS proxy (if available).
- Pushes
info_json_path,socks_proxy, and the originalytdlp_command(with tokens) to XCom. - Optionally stores detailed results in a Redis hash (
token_info:<video_id>).
ytdlp_mgmt_queue_add_urls
- File:
airflow/dags/ytdlp_mgmt_queue_add_urls.py - Purpose: Manually add video URLs to a specific YTDLP inbox queue (Redis List).
- Parameters (Defaults):
*
redis_conn_id('redis_default'): Airflow Redis connection ID. *queue_name('video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Target Redis list (inbox queue). *urls(""): Multiline string of video URLs to add. - Results:
- Parses the multiline
urlsparameter. - Adds each valid URL to the end of the Redis list specified by
queue_name. - Logs the number of URLs added.
- Parses the multiline
ytdlp_mgmt_queue_clear
- File:
airflow/dags/ytdlp_mgmt_queue_clear.py - Purpose: Manually delete a specific Redis key used by the YTDLP queues.
- Parameters (Defaults):
*
redis_conn_id('redis_default'): Airflow Redis connection ID. *queue_to_clear('PLEASE_SPECIFY_QUEUE_TO_CLEAR'): Exact name of the Redis key to clear. Must be changed by user. - Results:
- Deletes the Redis key specified by the
queue_to_clearparameter. - Warning: This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g.,
video_queue_inbox_account_xyz,video_queue_progress,video_queue_result,video_queue_fail).
- Deletes the Redis key specified by the
ytdlp_mgmt_queue_check_status
- File:
airflow/dags/ytdlp_mgmt_queue_check_status.py - Purpose: Manually check the type and size of a specific YTDLP Redis queue/key.
- Parameters (Defaults):
*
redis_conn_id('redis_default'): Airflow Redis connection ID. *queue_to_check('video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Exact name of the Redis key to check. - Results:
- Connects to Redis and determines the type of the key specified by
queue_to_check. - Determines the size (length for lists, number of fields for hashes).
- Logs the key type and size.
- Pushes
queue_key_typeandqueue_sizeto XCom.
- Connects to Redis and determines the type of the key specified by
ytdlp_mgmt_queue_list_contents
- File:
airflow/dags/ytdlp_mgmt_queue_list_contents.py - Purpose: Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
- Parameters (Defaults):
*
redis_conn_id('redis_default'): Airflow Redis connection ID. *queue_to_list('video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Exact name of the Redis key to list. *max_items(100): Maximum number of items/fields to list. - Results:
- Connects to Redis and identifies the type of the key specified by
queue_to_list. - If it's a List, logs the first
max_itemselements. - If it's a Hash, logs up to
max_itemskey-value pairs, attempting to pretty-print JSON values. - Logs warnings for very large hashes.
- Connects to Redis and identifies the type of the key specified by
ytdlp_proc_sequential_processor
- File:
airflow/dags/ytdlp_proc_sequential_processor.py - Purpose: Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using
yt-dlp, and records the result. - Parameters (Defaults):
*
queue_name('video_queue'): Base name for Redis queues (e.g.,video_queue_inbox,video_queue_progress). *redis_conn_id('redis_default'): Airflow Redis connection ID. *redis_enabled(False): Use Redis for service discovery? If False, usesservice_ip/port. *service_ip(None): Required Service IP ifredis_enabled=False. *service_port(None): Required Service port ifredis_enabled=False. *account_id('default_account'): Account ID for the API call (used for Redis lookup ifredis_enabled=True). *timeout(30): Timeout in seconds for the Thrift connection. *download_format('ba[ext=m4a]/bestaudio/best'): yt-dlp format selection string. *output_path_template("{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"): yt-dlp output template. Uses Airflow VariableDOWNLOADS_TEMP. *info_json_dir("{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"): Directory to saveinfo.json. Uses Airflow VariableDOWNLOADS_TEMP. - Results:
- Pops one URL from the
{{ params.queue_name }}_inboxRedis list. - If a URL is popped, it's added to the
{{ params.queue_name }}_progressRedis hash. - The
YtdlpOpsOperator(get_tokentask) attempts to get token data (includinginfo.json, proxy, command) for the URL using the specified connection method and account ID. - If token retrieval succeeds, the
download_videotask executesyt-dlpusing the retrievedinfo.json, proxy, thedownload_formatparameter, and theoutput_path_templateparameter to download the actual media. - On Successful Download: The URL is removed from the progress hash and added to the
{{ params.queue_name }}_resulthash along with results (info_json_path,socks_proxy,ytdlp_command). - On Failure (Token Retrieval or Download): The URL is removed from the progress hash and added to the
{{ params.queue_name }}_failhash along with error details (message, traceback). - If the inbox queue is empty, the DAG run skips processing via
AirflowSkipException.
- Pops one URL from the