Airflow DAGs Explanation
ytdlp_ops_worker_per_url.py
This DAG processes a single YouTube URL passed via DAG run configuration. It's the "Worker" part of a Sensor/Worker pattern and uses the TaskFlow API to implement worker affinity, ensuring all tasks for a single URL run on the same machine.
DAG Structure and Flow
Legend:
TaskName: An Airflow task.-->: Successful execution flow.--(fail)-->: Execution flow triggered by the failure of the preceding task.--(success)-->: Execution flow triggered only if the preceding task succeeds.[Group: GroupName]: A TaskGroup containing sub-tasks.
Execution Flow:
-
Start: The DAG run is triggered (e.g., by the dispatcher).
-
get_url_and_assign_account- Purpose: Gets the URL and assigns the first account.
- Flow:
--> get_token(Success path)--(fail)--> handle_bannable_error_branch(Failure path)
-
get_token(Initial attempt)- Purpose: Calls the Thrift service to get a token using the assigned account.
- Flow:
--(success)--> download_and_probe(Success path, passed viacoalesce_token_data)--(fail)--> handle_bannable_error_branch(Failure path)
-
handle_bannable_error_branch- Purpose: Checks the error from
get_tokenand decides the next step based on error type and policy. - Flow (Branches):
- If bannable error & retry policy:
--> [Group: ban_account_and_prepare_for_retry]--> check_sliding_window_for_ban--> ban_account_task(if ban criteria met)--> skip_ban_task(if ban criteria not met)
--> assign_new_account_for_retry(after group)--> retry_get_token(using new account)
- If bannable error & stop policy:
--> ban_and_fail(Bans account and fails DAG)
- If connection error & retry policy:
--> assign_new_account_for_retry--> retry_get_token
- If non-bannable/connection error:
- (No specific path defined, DAG likely fails)
- If bannable error & retry policy:
- Purpose: Checks the error from
-
retry_get_token- Purpose: Calls the Thrift service again using the new account.
- Flow:
--(success)--> download_and_probe(Success path, passed viacoalesce_token_data)--(fail)--> handle_generic_failure(Failure path)
-
coalesce_token_data- Purpose: Selects the successful token data from either the initial or retry attempt.
- Flow:
--> download_and_probe(Success path)
-
download_and_probe- Purpose: Uses the token data to download the media file and probes it with ffmpeg.
- Flow:
--(success)--> mark_url_as_success(Success path)--(fail)--> handle_generic_failure(Failure path)
-
mark_url_as_success- Purpose: Records the successful processing result.
- Flow:
--(success)--> continue_processing_loop(Success path)--(fail)--> handle_generic_failure(Failure path)
-
continue_processing_loop- Purpose: Triggers a new run of the dispatcher DAG.
- Flow:
- (End of this DAG run)
-
handle_generic_failure- Purpose: Catches any unhandled failures and marks the DAG run as failed.
- Flow:
- (End of this DAG run, marked as failed)
Purpose of Orchestrator and Dispatcher
The system uses separate orchestrator and dispatcher components for several key reasons:
-
Worker Affinity/Pinning: One of the main reasons is to ensure that all tasks related to processing a single URL run on the same worker machine. This is crucial because the
get_tokentask generates aninfo.jsonfile that contains session-specific data (like cookies and tokens). The subsequentdownload_and_probetask needs to use this exactinfo.jsonfile. By using a dedicated worker DAG (ytdlp_ops_worker_per_url.py) with worker affinity, we guarantee that the file system whereinfo.jsonis stored is accessible to both tasks. -
Scalability and Load Distribution: The dispatcher can monitor queues or sources of URLs and trigger individual worker DAG runs. This decouples the discovery of work from the execution of work, allowing for better scaling and management of processing load across multiple workers.
-
Fault Isolation: If processing a single URL fails, it only affects that specific worker DAG run, not the entire pipeline. The dispatcher can continue to trigger other worker runs for other URLs.
-
Flexibility: The orchestrator/dispatcher pattern allows for more complex scheduling, prioritization, and routing logic to be implemented in the dispatcher, while keeping the worker DAG focused on the core processing steps for a single unit of work.