yt-dlp-dags/airflow/README.md
2025-09-17 13:06:37 +03:00

7.2 KiB

Airflow Cluster for YT-DLP Operations

This directory contains the configuration and deployment files for an Apache Airflow cluster designed to manage distributed YouTube video downloading tasks using the ytdlp-ops service.

Overview

The cluster consists of:

  • Master Node: Runs the Airflow webserver, scheduler, and Flower (Celery monitoring). It also hosts shared services like Redis (broker/backend) and MinIO (artifact storage).
  • Worker Nodes: Run Celery workers that execute download tasks. Each worker node also runs the ytdlp-ops-service (Thrift API server), Envoy proxy (load balancer for Thrift traffic), and Camoufox (remote browser instances for token generation).

Key Components

Airflow DAGs

  • ytdlp_ops_dispatcher.py: The "Sensor" part of a Sensor/Worker pattern. It monitors a Redis queue for URLs to process and triggers a ytdlp_ops_worker_per_url DAG run for each URL.
  • ytdlp_ops_worker_per_url.py: The "Worker" DAG. It processes a single URL passed via DAG run configuration. It implements worker affinity (all tasks for a URL run on the same machine) and handles account management (retrying with different accounts, banning failed accounts based on sliding window checks).

Configuration Files

  • airflow.cfg: Main Airflow configuration file.
  • config/airflow_local_settings.py: Contains the task_instance_mutation_hook which implements worker affinity by dynamically assigning tasks to queues based on the worker node's hostname.
  • config/custom_task_hooks.py: Contains the task_instance_mutation_hook (duplicated here, but airflow_local_settings.py is the active one).
  • config/redis_default_conn.json.j2: Jinja2 template for the Airflow Redis connection configuration.
  • config/minio_default_conn.json.j2: Jinja2 template for the Airflow MinIO connection configuration.

Docker & Compose

  • Dockerfile: Defines the Airflow image, including necessary dependencies like yt-dlp, ffmpeg, and Python packages.
  • Dockerfile.caddy: Defines a Caddy image used as a reverse proxy for serving Airflow static assets.
  • configs/docker-compose-master.yaml.j2: Jinja2 template for the Docker Compose configuration on the Airflow master node.
  • configs/docker-compose-dl.yaml.j2: Jinja2 template for the Docker Compose configuration on the Airflow worker nodes.
  • configs/docker-compose-ytdlp-ops.yaml.j2: Jinja2 template for the Docker Compose configuration for the ytdlp-ops services (Thrift API, Envoy, Camoufox) on both master (management role) and worker nodes.
  • configs/docker-compose.camoufox.yaml.j2: Jinja2 template (auto-generated by generate_envoy_config.py) for the Camoufox browser service definitions.
  • configs/docker-compose.config-generate.yaml: Docker Compose file used to run the generate_envoy_config.py script in a container to create the final service configuration files.
  • generate_envoy_config.py: Script that generates envoy.yaml, docker-compose.camoufox.yaml, and camoufox_endpoints.json based on environment variables.
  • configs/envoy.yaml.j2: Jinja2 template (used by generate_envoy_config.py) for the Envoy proxy configuration.

Camoufox (Remote Browsers)

  • camoufox/: Directory containing the Camoufox browser setup.
    • Dockerfile: Defines the Camoufox image.
    • requirements.txt: Python dependencies for the Camoufox server.
    • camoufox_server.py: The core server logic for managing remote browser instances.
    • start_camoufox.sh: Wrapper script to start the Camoufox server with Xvfb and VNC.
    • *.xpi: Browser extensions used by Camoufox.

Deployment Process

Deployment is managed by Ansible playbooks located in the ansible/ directory.

  1. Inventory Generation: The tools/generate-inventory.py script reads cluster.yml and generates ansible/inventory.ini, ansible/host_vars/, and ansible/group_vars/all/generated_vars.yml.
  2. Full Deployment: ansible-playbook playbook-full.yml is the main command.
    • Installs prerequisites (Docker, pipx, Glances).
    • Ensures the airflow_proxynet Docker network exists.
    • Imports and runs playbook-master.yml for the master node.
    • Imports and runs playbook-worker.yml for worker nodes.
  3. Master Deployment (playbook-master.yml):
    • Sets system configurations (timezone, NTP, swap, sysctl).
    • Calls airflow-master role:
      • Syncs files to /srv/airflow_master/.
      • Templates configs/docker-compose-master.yaml.
      • Builds the Airflow image.
      • Extracts static assets and builds the Caddy image.
      • Starts services using docker compose.
    • Calls ytdlp-master role:
      • Syncs generate_envoy_config.py and templates.
      • Creates .env file.
      • Runs generate_envoy_config.py to create service configs.
      • Creates a dummy docker-compose.camoufox.yaml.
      • Starts ytdlp-ops management services using docker compose.
  4. Worker Deployment (playbook-worker.yml):
    • Sets system configurations (timezone, NTP, swap, sysctl, system limits).
    • Calls ytdlp-worker role:
      • Syncs files (including camoufox/ directory) to /srv/airflow_dl_worker/.
      • Creates .env file.
      • Runs generate_envoy_config.py to create service configs (including docker-compose.camoufox.yaml).
      • Builds the Camoufox image.
      • Starts ytdlp-ops worker services using docker compose.
    • Calls airflow-worker role:
      • Syncs files to /srv/airflow_dl_worker/.
      • Templates configs/docker-compose-dl.yaml.
      • Builds the Airflow image.
      • Starts services using docker compose.
    • Verifies Camoufox services are running.

Service Interaction Flow (Worker Node)

  1. Airflow Worker: Pulls tasks from the Redis queue.
  2. ytdlp_ops_worker_per_url DAG: Executes tasks on the local worker node.
  3. Thrift Client (in DAG task): Connects to localhost:9080 (Envoy's public port).
  4. Envoy Proxy: Listens on :9080, load balances Thrift requests across internal ports (9090, 9091, 9092 - based on YTDLP_WORKERS) of the local ytdlp-ops-service.
  5. ytdlp-ops-service: Receives the Thrift request.
  6. Token Generation: If needed, ytdlp-ops-service connects to a local Camoufox instance via WebSocket (using camoufox_endpoints.json for the address) to generate tokens.
  7. Camoufox: Runs a headless Firefox browser, potentially using a SOCKS5 proxy, to interact with YouTube and generate the required tokens.
  8. Download: The DAG task uses the token (via info.json) and potentially the SOCKS5 proxy to run yt-dlp for the actual download.

Environment Variables

Key environment variables used in .env files (generated by Ansible templates) control service behavior:

  • HOSTNAME: The Ansible inventory hostname.
  • SERVICE_ROLE: management (master) or worker.
  • SERVER_IDENTITY: Unique identifier for the ytdlp-ops-service instance.
  • YTDLP_WORKERS: Number of internal Thrift worker endpoints and Camoufox browser instances.
  • CAMOUFOX_PROXIES: Comma-separated list of SOCKS5 proxy URLs for Camoufox.
  • MASTER_HOST_IP: IP address of the Airflow master node (for connecting back to Redis).
  • Various passwords and ports.

This setup allows for a scalable and robust system for managing YouTube downloads with account rotation and proxy usage.