# Airflow Cluster for YT-DLP Operations This directory contains the configuration and deployment files for an Apache Airflow cluster designed to manage distributed YouTube video downloading tasks using the `ytdlp-ops` service. ## Overview The cluster consists of: - **Master Node:** Runs the Airflow webserver, scheduler, and Flower (Celery monitoring). It also hosts shared services like Redis (broker/backend) and MinIO (artifact storage). - **Worker Nodes:** Run Celery workers that execute download tasks. Each worker node also runs the `ytdlp-ops-service` (Thrift API server), Envoy proxy (load balancer for Thrift traffic), and Camoufox (remote browser instances for token generation). ## Key Components ### Airflow DAGs - `ytdlp_ops_dispatcher.py`: The "Sensor" part of a Sensor/Worker pattern. It monitors a Redis queue for URLs to process and triggers a `ytdlp_ops_worker_per_url` DAG run for each URL. - `ytdlp_ops_worker_per_url.py`: The "Worker" DAG. It processes a single URL passed via DAG run configuration. It implements worker affinity (all tasks for a URL run on the same machine) and handles account management (retrying with different accounts, banning failed accounts based on sliding window checks). ### Configuration Files - `airflow.cfg`: Main Airflow configuration file. - `config/airflow_local_settings.py`: Contains the `task_instance_mutation_hook` which implements worker affinity by dynamically assigning tasks to queues based on the worker node's hostname. - `config/custom_task_hooks.py`: Contains the `task_instance_mutation_hook` (duplicated here, but `airflow_local_settings.py` is the active one). - `config/redis_default_conn.json.j2`: Jinja2 template for the Airflow Redis connection configuration. - `config/minio_default_conn.json.j2`: Jinja2 template for the Airflow MinIO connection configuration. ### Docker & Compose - `Dockerfile`: Defines the Airflow image, including necessary dependencies like `yt-dlp`, `ffmpeg`, and Python packages. - `Dockerfile.caddy`: Defines a Caddy image used as a reverse proxy for serving Airflow static assets. - `configs/docker-compose-master.yaml.j2`: Jinja2 template for the Docker Compose configuration on the Airflow master node. - `configs/docker-compose-dl.yaml.j2`: Jinja2 template for the Docker Compose configuration on the Airflow worker nodes. - `configs/docker-compose-ytdlp-ops.yaml.j2`: Jinja2 template for the Docker Compose configuration for the `ytdlp-ops` services (Thrift API, Envoy, Camoufox) on both master (management role) and worker nodes. - `configs/docker-compose.camoufox.yaml.j2`: Jinja2 template (auto-generated by `generate_envoy_config.py`) for the Camoufox browser service definitions. - `configs/docker-compose.config-generate.yaml`: Docker Compose file used to run the `generate_envoy_config.py` script in a container to create the final service configuration files. - `generate_envoy_config.py`: Script that generates `envoy.yaml`, `docker-compose.camoufox.yaml`, and `camoufox_endpoints.json` based on environment variables. - `configs/envoy.yaml.j2`: Jinja2 template (used by `generate_envoy_config.py`) for the Envoy proxy configuration. ### Camoufox (Remote Browsers) - `camoufox/`: Directory containing the Camoufox browser setup. - `Dockerfile`: Defines the Camoufox image. - `requirements.txt`: Python dependencies for the Camoufox server. - `camoufox_server.py`: The core server logic for managing remote browser instances. - `start_camoufox.sh`: Wrapper script to start the Camoufox server with Xvfb and VNC. - `*.xpi`: Browser extensions used by Camoufox. ## Deployment Process Deployment is managed by Ansible playbooks located in the `ansible/` directory. 1. **Inventory Generation:** The `tools/generate-inventory.py` script reads `cluster.yml` and generates `ansible/inventory.ini`, `ansible/host_vars/`, and `ansible/group_vars/all/generated_vars.yml`. 2. **Full Deployment:** `ansible-playbook playbook-full.yml` is the main command. - Installs prerequisites (Docker, pipx, Glances). - Ensures the `airflow_proxynet` Docker network exists. - Imports and runs `playbook-master.yml` for the master node. - Imports and runs `playbook-worker.yml` for worker nodes. 3. **Master Deployment (`playbook-master.yml`):** - Sets system configurations (timezone, NTP, swap, sysctl). - Calls `airflow-master` role: - Syncs files to `/srv/airflow_master/`. - Templates `configs/docker-compose-master.yaml`. - Builds the Airflow image. - Extracts static assets and builds the Caddy image. - Starts services using `docker compose`. - Calls `ytdlp-master` role: - Syncs `generate_envoy_config.py` and templates. - Creates `.env` file. - Runs `generate_envoy_config.py` to create service configs. - Creates a dummy `docker-compose.camoufox.yaml`. - Starts `ytdlp-ops` management services using `docker compose`. 4. **Worker Deployment (`playbook-worker.yml`):** - Sets system configurations (timezone, NTP, swap, sysctl, system limits). - Calls `ytdlp-worker` role: - Syncs files (including `camoufox/` directory) to `/srv/airflow_dl_worker/`. - Creates `.env` file. - Runs `generate_envoy_config.py` to create service configs (including `docker-compose.camoufox.yaml`). - Builds the Camoufox image. - Starts `ytdlp-ops` worker services using `docker compose`. - Calls `airflow-worker` role: - Syncs files to `/srv/airflow_dl_worker/`. - Templates `configs/docker-compose-dl.yaml`. - Builds the Airflow image. - Starts services using `docker compose`. - Verifies Camoufox services are running. ## Service Interaction Flow (Worker Node) 1. **Airflow Worker:** Pulls tasks from the Redis queue. 2. **`ytdlp_ops_worker_per_url` DAG:** Executes tasks on the local worker node. 3. **Thrift Client (in DAG task):** Connects to `localhost:9080` (Envoy's public port). 4. **Envoy Proxy:** Listens on `:9080`, load balances Thrift requests across internal ports (`9090`, `9091`, `9092` - based on `YTDLP_WORKERS`) of the local `ytdlp-ops-service`. 5. **`ytdlp-ops-service`:** Receives the Thrift request. 6. **Token Generation:** If needed, `ytdlp-ops-service` connects to a local Camoufox instance via WebSocket (using `camoufox_endpoints.json` for the address) to generate tokens. 7. **Camoufox:** Runs a headless Firefox browser, potentially using a SOCKS5 proxy, to interact with YouTube and generate the required tokens. 8. **Download:** The DAG task uses the token (via `info.json`) and potentially the SOCKS5 proxy to run `yt-dlp` for the actual download. ## Environment Variables Key environment variables used in `.env` files (generated by Ansible templates) control service behavior: - `HOSTNAME`: The Ansible inventory hostname. - `SERVICE_ROLE`: `management` (master) or `worker`. - `SERVER_IDENTITY`: Unique identifier for the `ytdlp-ops-service` instance. - `YTDLP_WORKERS`: Number of internal Thrift worker endpoints and Camoufox browser instances. - `CAMOUFOX_PROXIES`: Comma-separated list of SOCKS5 proxy URLs for Camoufox. - `MASTER_HOST_IP`: IP address of the Airflow master node (for connecting back to Redis). - Various passwords and ports. This setup allows for a scalable and robust system for managing YouTube downloads with account rotation and proxy usage.