yt-dlp-dags/README.en.old.md
2025-07-18 17:17:19 +03:00

7.1 KiB

YTDLP Airflow DAGs

This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.

DAG Descriptions

ytdlp_client_dag_v2.1

  • File: airflow/dags/ytdlp_client_dag_v2.1.py
  • Purpose: Provides a way to test the YTDLP Ops Thrift service interaction for a single video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
  • Parameters (Defaults): * url ('https://www.youtube.com/watch?v=sOlTX9uxUtM'): The video URL to process. * redis_enabled (False): Use Redis for service discovery? * service_ip ('85.192.30.55'): Service IP if redis_enabled=False. * service_port (9090): Service port if redis_enabled=False. * account_id ('account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Account ID for lookup or call. * timeout (30): Timeout in seconds for Thrift connection. * info_json_dir ("{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"): Directory to save info.json.
  • Results:
    • Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
    • Retrieves token data for the given URL and account ID.
    • Saves the video's info.json metadata to the specified directory.
    • Extracts the SOCKS proxy (if available).
    • Pushes info_json_path, socks_proxy, and the original ytdlp_command (with tokens) to XCom.
    • Optionally stores detailed results in a Redis hash (token_info:<video_id>).

ytdlp_mgmt_queue_add_urls

  • File: airflow/dags/ytdlp_mgmt_queue_add_urls.py
  • Purpose: Manually add video URLs to a specific YTDLP inbox queue (Redis List).
  • Parameters (Defaults): * redis_conn_id ('redis_default'): Airflow Redis connection ID. * queue_name ('video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Target Redis list (inbox queue). * urls (""): Multiline string of video URLs to add.
  • Results:
    • Parses the multiline urls parameter.
    • Adds each valid URL to the end of the Redis list specified by queue_name.
    • Logs the number of URLs added.

ytdlp_mgmt_queue_clear

  • File: airflow/dags/ytdlp_mgmt_queue_clear.py
  • Purpose: Manually delete a specific Redis key used by the YTDLP queues.
  • Parameters (Defaults): * redis_conn_id ('redis_default'): Airflow Redis connection ID. * queue_to_clear ('PLEASE_SPECIFY_QUEUE_TO_CLEAR'): Exact name of the Redis key to clear. Must be changed by user.
  • Results:
    • Deletes the Redis key specified by the queue_to_clear parameter.
    • Warning: This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., video_queue_inbox_account_xyz, video_queue_progress, video_queue_result, video_queue_fail).

ytdlp_mgmt_queue_check_status

  • File: airflow/dags/ytdlp_mgmt_queue_check_status.py
  • Purpose: Manually check the type and size of a specific YTDLP Redis queue/key.
  • Parameters (Defaults): * redis_conn_id ('redis_default'): Airflow Redis connection ID. * queue_to_check ('video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Exact name of the Redis key to check.
  • Results:
    • Connects to Redis and determines the type of the key specified by queue_to_check.
    • Determines the size (length for lists, number of fields for hashes).
    • Logs the key type and size.
    • Pushes queue_key_type and queue_size to XCom.

ytdlp_mgmt_queue_list_contents

  • File: airflow/dags/ytdlp_mgmt_queue_list_contents.py
  • Purpose: Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
  • Parameters (Defaults): * redis_conn_id ('redis_default'): Airflow Redis connection ID. * queue_to_list ('video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'): Exact name of the Redis key to list. * max_items (100): Maximum number of items/fields to list.
  • Results:
    • Connects to Redis and identifies the type of the key specified by queue_to_list.
    • If it's a List, logs the first max_items elements.
    • If it's a Hash, logs up to max_items key-value pairs, attempting to pretty-print JSON values.
    • Logs warnings for very large hashes.

ytdlp_proc_sequential_processor

  • File: airflow/dags/ytdlp_proc_sequential_processor.py
  • Purpose: Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using yt-dlp, and records the result.
  • Parameters (Defaults): * queue_name ('video_queue'): Base name for Redis queues (e.g., video_queue_inbox, video_queue_progress). * redis_conn_id ('redis_default'): Airflow Redis connection ID. * redis_enabled (False): Use Redis for service discovery? If False, uses service_ip/port. * service_ip (None): Required Service IP if redis_enabled=False. * service_port (None): Required Service port if redis_enabled=False. * account_id ('default_account'): Account ID for the API call (used for Redis lookup if redis_enabled=True). * timeout (30): Timeout in seconds for the Thrift connection. * download_format ('ba[ext=m4a]/bestaudio/best'): yt-dlp format selection string. * output_path_template ("{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"): yt-dlp output template. Uses Airflow Variable DOWNLOADS_TEMP. * info_json_dir ("{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"): Directory to save info.json. Uses Airflow Variable DOWNLOADS_TEMP.
  • Results:
    • Pops one URL from the {{ params.queue_name }}_inbox Redis list.
    • If a URL is popped, it's added to the {{ params.queue_name }}_progress Redis hash.
    • The YtdlpOpsOperator (get_token task) attempts to get token data (including info.json, proxy, command) for the URL using the specified connection method and account ID.
    • If token retrieval succeeds, the download_video task executes yt-dlp using the retrieved info.json, proxy, the download_format parameter, and the output_path_template parameter to download the actual media.
    • On Successful Download: The URL is removed from the progress hash and added to the {{ params.queue_name }}_result hash along with results (info_json_path, socks_proxy, ytdlp_command).
    • On Failure (Token Retrieval or Download): The URL is removed from the progress hash and added to the {{ params.queue_name }}_fail hash along with error details (message, traceback).
    • If the inbox queue is empty, the DAG run skips processing via AirflowSkipException.