yt-dlp-dags/airflow/README.md
2025-08-26 18:00:55 +03:00

4.8 KiB

Airflow DAGs Explanation

ytdlp_ops_worker_per_url.py

This DAG processes a single YouTube URL passed via DAG run configuration. It's the "Worker" part of a Sensor/Worker pattern and uses the TaskFlow API to implement worker affinity, ensuring all tasks for a single URL run on the same machine.

DAG Structure and Flow

Legend:

  • TaskName: An Airflow task.
  • -->: Successful execution flow.
  • --(fail)-->: Execution flow triggered by the failure of the preceding task.
  • --(success)-->: Execution flow triggered only if the preceding task succeeds.
  • [Group: GroupName]: A TaskGroup containing sub-tasks.

Execution Flow:

  1. Start: The DAG run is triggered (e.g., by the dispatcher).

  2. get_url_and_assign_account

    • Purpose: Gets the URL and assigns the first account.
    • Flow:
      • --> get_token (Success path)
      • --(fail)--> handle_bannable_error_branch (Failure path)
  3. get_token (Initial attempt)

    • Purpose: Calls the Thrift service to get a token using the assigned account.
    • Flow:
      • --(success)--> download_and_probe (Success path, passed via coalesce_token_data)
      • --(fail)--> handle_bannable_error_branch (Failure path)
  4. handle_bannable_error_branch

    • Purpose: Checks the error from get_token and decides the next step based on error type and policy.
    • Flow (Branches):
      • If bannable error & retry policy:
        • --> [Group: ban_account_and_prepare_for_retry]
          • --> check_sliding_window_for_ban
            • --> ban_account_task (if ban criteria met)
            • --> skip_ban_task (if ban criteria not met)
        • --> assign_new_account_for_retry (after group)
        • --> retry_get_token (using new account)
      • If bannable error & stop policy:
        • --> ban_and_fail (Bans account and fails DAG)
      • If connection error & retry policy:
        • --> assign_new_account_for_retry
        • --> retry_get_token
      • If non-bannable/connection error:
        • (No specific path defined, DAG likely fails)
  5. retry_get_token

    • Purpose: Calls the Thrift service again using the new account.
    • Flow:
      • --(success)--> download_and_probe (Success path, passed via coalesce_token_data)
      • --(fail)--> handle_generic_failure (Failure path)
  6. coalesce_token_data

    • Purpose: Selects the successful token data from either the initial or retry attempt.
    • Flow:
      • --> download_and_probe (Success path)
  7. download_and_probe

    • Purpose: Uses the token data to download the media file and probes it with ffmpeg.
    • Flow:
      • --(success)--> mark_url_as_success (Success path)
      • --(fail)--> handle_generic_failure (Failure path)
  8. mark_url_as_success

    • Purpose: Records the successful processing result.
    • Flow:
      • --(success)--> continue_processing_loop (Success path)
      • --(fail)--> handle_generic_failure (Failure path)
  9. continue_processing_loop

    • Purpose: Triggers a new run of the dispatcher DAG.
    • Flow:
      • (End of this DAG run)
  10. handle_generic_failure

    • Purpose: Catches any unhandled failures and marks the DAG run as failed.
    • Flow:
      • (End of this DAG run, marked as failed)

Purpose of Orchestrator and Dispatcher

The system uses separate orchestrator and dispatcher components for several key reasons:

  1. Worker Affinity/Pinning: One of the main reasons is to ensure that all tasks related to processing a single URL run on the same worker machine. This is crucial because the get_token task generates an info.json file that contains session-specific data (like cookies and tokens). The subsequent download_and_probe task needs to use this exact info.json file. By using a dedicated worker DAG (ytdlp_ops_worker_per_url.py) with worker affinity, we guarantee that the file system where info.json is stored is accessible to both tasks.

  2. Scalability and Load Distribution: The dispatcher can monitor queues or sources of URLs and trigger individual worker DAG runs. This decouples the discovery of work from the execution of work, allowing for better scaling and management of processing load across multiple workers.

  3. Fault Isolation: If processing a single URL fails, it only affects that specific worker DAG run, not the entire pipeline. The dispatcher can continue to trigger other worker runs for other URLs.

  4. Flexibility: The orchestrator/dispatcher pattern allows for more complex scheduling, prioritization, and routing logic to be implemented in the dispatcher, while keeping the worker DAG focused on the core processing steps for a single unit of work.