bgutil-ytdlp-pot-provider @ c79e8dc481
Airflow Cluster for YT-DLP Operations
This directory contains the configuration and deployment files for an Apache Airflow cluster designed to manage distributed YouTube video downloading tasks using the ytdlp-ops service.
Overview
The cluster consists of:
- Master Node: Runs the Airflow webserver, scheduler, and Flower (Celery monitoring). It also hosts shared services like Redis (broker/backend) and MinIO (artifact storage).
- Worker Nodes: Run Celery workers that execute download tasks. Each worker node also runs the
ytdlp-ops-service(Thrift API server), Envoy proxy (load balancer for Thrift traffic), and Camoufox (remote browser instances for token generation).
Key Components
Airflow DAGs
ytdlp_ops_dispatcher.py: The "Sensor" part of a Sensor/Worker pattern. It monitors a Redis queue for URLs to process and triggers aytdlp_ops_worker_per_urlDAG run for each URL.ytdlp_ops_worker_per_url.py: The "Worker" DAG. It processes a single URL passed via DAG run configuration. It implements worker affinity (all tasks for a URL run on the same machine) and handles account management (retrying with different accounts, banning failed accounts based on sliding window checks).
Configuration Files
airflow.cfg: Main Airflow configuration file.config/airflow_local_settings.py: Contains thetask_instance_mutation_hookwhich implements worker affinity by dynamically assigning tasks to queues based on the worker node's hostname.config/custom_task_hooks.py: Contains thetask_instance_mutation_hook(duplicated here, butairflow_local_settings.pyis the active one).config/redis_default_conn.json.j2: Jinja2 template for the Airflow Redis connection configuration.config/minio_default_conn.json.j2: Jinja2 template for the Airflow MinIO connection configuration.
Docker & Compose
Dockerfile: Defines the Airflow image, including necessary dependencies likeyt-dlp,ffmpeg, and Python packages.Dockerfile.caddy: Defines a Caddy image used as a reverse proxy for serving Airflow static assets.configs/docker-compose-master.yaml.j2: Jinja2 template for the Docker Compose configuration on the Airflow master node.configs/docker-compose-dl.yaml.j2: Jinja2 template for the Docker Compose configuration on the Airflow worker nodes.configs/docker-compose-ytdlp-ops.yaml.j2: Jinja2 template for the Docker Compose configuration for theytdlp-opsservices (Thrift API, Envoy, Camoufox) on both master (management role) and worker nodes.configs/docker-compose.camoufox.yaml.j2: Jinja2 template (auto-generated bygenerate_envoy_config.py) for the Camoufox browser service definitions.configs/docker-compose.config-generate.yaml: Docker Compose file used to run thegenerate_envoy_config.pyscript in a container to create the final service configuration files.generate_envoy_config.py: Script that generatesenvoy.yaml,docker-compose.camoufox.yaml, andcamoufox_endpoints.jsonbased on environment variables.configs/envoy.yaml.j2: Jinja2 template (used bygenerate_envoy_config.py) for the Envoy proxy configuration.
Camoufox (Remote Browsers)
camoufox/: Directory containing the Camoufox browser setup.Dockerfile: Defines the Camoufox image.requirements.txt: Python dependencies for the Camoufox server.camoufox_server.py: The core server logic for managing remote browser instances.start_camoufox.sh: Wrapper script to start the Camoufox server with Xvfb and VNC.*.xpi: Browser extensions used by Camoufox.
Deployment Process
Deployment is managed by Ansible playbooks located in the ansible/ directory.
- Inventory Generation: The
tools/generate-inventory.pyscript readscluster.ymland generatesansible/inventory.ini,ansible/host_vars/, andansible/group_vars/all/generated_vars.yml. - Full Deployment:
ansible-playbook playbook-full.ymlis the main command.- Installs prerequisites (Docker, pipx, Glances).
- Ensures the
airflow_proxynetDocker network exists. - Imports and runs
playbook-master.ymlfor the master node. - Imports and runs
playbook-worker.ymlfor worker nodes.
- Master Deployment (
playbook-master.yml):- Sets system configurations (timezone, NTP, swap, sysctl).
- Calls
airflow-masterrole:- Syncs files to
/srv/airflow_master/. - Templates
configs/docker-compose-master.yaml. - Builds the Airflow image.
- Extracts static assets and builds the Caddy image.
- Starts services using
docker compose.
- Syncs files to
- Calls
ytdlp-masterrole:- Syncs
generate_envoy_config.pyand templates. - Creates
.envfile. - Runs
generate_envoy_config.pyto create service configs. - Creates a dummy
docker-compose.camoufox.yaml. - Starts
ytdlp-opsmanagement services usingdocker compose.
- Syncs
- Worker Deployment (
playbook-worker.yml):- Sets system configurations (timezone, NTP, swap, sysctl, system limits).
- Calls
ytdlp-workerrole:- Syncs files (including
camoufox/directory) to/srv/airflow_dl_worker/. - Creates
.envfile. - Runs
generate_envoy_config.pyto create service configs (includingdocker-compose.camoufox.yaml). - Builds the Camoufox image.
- Starts
ytdlp-opsworker services usingdocker compose.
- Syncs files (including
- Calls
airflow-workerrole:- Syncs files to
/srv/airflow_dl_worker/. - Templates
configs/docker-compose-dl.yaml. - Builds the Airflow image.
- Starts services using
docker compose.
- Syncs files to
- Verifies Camoufox services are running.
Service Interaction Flow (Worker Node)
- Airflow Worker: Pulls tasks from the Redis queue.
ytdlp_ops_worker_per_urlDAG: Executes tasks on the local worker node.- Thrift Client (in DAG task): Connects to
localhost:9080(Envoy's public port). - Envoy Proxy: Listens on
:9080, load balances Thrift requests across internal ports (9090,9091,9092- based onYTDLP_WORKERS) of the localytdlp-ops-service. ytdlp-ops-service: Receives the Thrift request.- Token Generation: If needed,
ytdlp-ops-serviceconnects to a local Camoufox instance via WebSocket (usingcamoufox_endpoints.jsonfor the address) to generate tokens. - Camoufox: Runs a headless Firefox browser, potentially using a SOCKS5 proxy, to interact with YouTube and generate the required tokens.
- Download: The DAG task uses the token (via
info.json) and potentially the SOCKS5 proxy to runyt-dlpfor the actual download.
Environment Variables
Key environment variables used in .env files (generated by Ansible templates) control service behavior:
HOSTNAME: The Ansible inventory hostname.SERVICE_ROLE:management(master) orworker.SERVER_IDENTITY: Unique identifier for theytdlp-ops-serviceinstance.YTDLP_WORKERS: Number of internal Thrift worker endpoints and Camoufox browser instances.CAMOUFOX_PROXIES: Comma-separated list of SOCKS5 proxy URLs for Camoufox.MASTER_HOST_IP: IP address of the Airflow master node (for connecting back to Redis).- Various passwords and ports.
This setup allows for a scalable and robust system for managing YouTube downloads with account rotation and proxy usage.