2025-09-17 13:06:37 +03:00

109 lines
7.2 KiB
Markdown

# Airflow Cluster for YT-DLP Operations
This directory contains the configuration and deployment files for an Apache Airflow cluster designed to manage distributed YouTube video downloading tasks using the `ytdlp-ops` service.
## Overview
The cluster consists of:
- **Master Node:** Runs the Airflow webserver, scheduler, and Flower (Celery monitoring). It also hosts shared services like Redis (broker/backend) and MinIO (artifact storage).
- **Worker Nodes:** Run Celery workers that execute download tasks. Each worker node also runs the `ytdlp-ops-service` (Thrift API server), Envoy proxy (load balancer for Thrift traffic), and Camoufox (remote browser instances for token generation).
## Key Components
### Airflow DAGs
- `ytdlp_ops_dispatcher.py`: The "Sensor" part of a Sensor/Worker pattern. It monitors a Redis queue for URLs to process and triggers a `ytdlp_ops_worker_per_url` DAG run for each URL.
- `ytdlp_ops_worker_per_url.py`: The "Worker" DAG. It processes a single URL passed via DAG run configuration. It implements worker affinity (all tasks for a URL run on the same machine) and handles account management (retrying with different accounts, banning failed accounts based on sliding window checks).
### Configuration Files
- `airflow.cfg`: Main Airflow configuration file.
- `config/airflow_local_settings.py`: Contains the `task_instance_mutation_hook` which implements worker affinity by dynamically assigning tasks to queues based on the worker node's hostname.
- `config/custom_task_hooks.py`: Contains the `task_instance_mutation_hook` (duplicated here, but `airflow_local_settings.py` is the active one).
- `config/redis_default_conn.json.j2`: Jinja2 template for the Airflow Redis connection configuration.
- `config/minio_default_conn.json.j2`: Jinja2 template for the Airflow MinIO connection configuration.
### Docker & Compose
- `Dockerfile`: Defines the Airflow image, including necessary dependencies like `yt-dlp`, `ffmpeg`, and Python packages.
- `Dockerfile.caddy`: Defines a Caddy image used as a reverse proxy for serving Airflow static assets.
- `configs/docker-compose-master.yaml.j2`: Jinja2 template for the Docker Compose configuration on the Airflow master node.
- `configs/docker-compose-dl.yaml.j2`: Jinja2 template for the Docker Compose configuration on the Airflow worker nodes.
- `configs/docker-compose-ytdlp-ops.yaml.j2`: Jinja2 template for the Docker Compose configuration for the `ytdlp-ops` services (Thrift API, Envoy, Camoufox) on both master (management role) and worker nodes.
- `configs/docker-compose.camoufox.yaml.j2`: Jinja2 template (auto-generated by `generate_envoy_config.py`) for the Camoufox browser service definitions.
- `configs/docker-compose.config-generate.yaml`: Docker Compose file used to run the `generate_envoy_config.py` script in a container to create the final service configuration files.
- `generate_envoy_config.py`: Script that generates `envoy.yaml`, `docker-compose.camoufox.yaml`, and `camoufox_endpoints.json` based on environment variables.
- `configs/envoy.yaml.j2`: Jinja2 template (used by `generate_envoy_config.py`) for the Envoy proxy configuration.
### Camoufox (Remote Browsers)
- `camoufox/`: Directory containing the Camoufox browser setup.
- `Dockerfile`: Defines the Camoufox image.
- `requirements.txt`: Python dependencies for the Camoufox server.
- `camoufox_server.py`: The core server logic for managing remote browser instances.
- `start_camoufox.sh`: Wrapper script to start the Camoufox server with Xvfb and VNC.
- `*.xpi`: Browser extensions used by Camoufox.
## Deployment Process
Deployment is managed by Ansible playbooks located in the `ansible/` directory.
1. **Inventory Generation:** The `tools/generate-inventory.py` script reads `cluster.yml` and generates `ansible/inventory.ini`, `ansible/host_vars/`, and `ansible/group_vars/all/generated_vars.yml`.
2. **Full Deployment:** `ansible-playbook playbook-full.yml` is the main command.
- Installs prerequisites (Docker, pipx, Glances).
- Ensures the `airflow_proxynet` Docker network exists.
- Imports and runs `playbook-master.yml` for the master node.
- Imports and runs `playbook-worker.yml` for worker nodes.
3. **Master Deployment (`playbook-master.yml`):**
- Sets system configurations (timezone, NTP, swap, sysctl).
- Calls `airflow-master` role:
- Syncs files to `/srv/airflow_master/`.
- Templates `configs/docker-compose-master.yaml`.
- Builds the Airflow image.
- Extracts static assets and builds the Caddy image.
- Starts services using `docker compose`.
- Calls `ytdlp-master` role:
- Syncs `generate_envoy_config.py` and templates.
- Creates `.env` file.
- Runs `generate_envoy_config.py` to create service configs.
- Creates a dummy `docker-compose.camoufox.yaml`.
- Starts `ytdlp-ops` management services using `docker compose`.
4. **Worker Deployment (`playbook-worker.yml`):**
- Sets system configurations (timezone, NTP, swap, sysctl, system limits).
- Calls `ytdlp-worker` role:
- Syncs files (including `camoufox/` directory) to `/srv/airflow_dl_worker/`.
- Creates `.env` file.
- Runs `generate_envoy_config.py` to create service configs (including `docker-compose.camoufox.yaml`).
- Builds the Camoufox image.
- Starts `ytdlp-ops` worker services using `docker compose`.
- Calls `airflow-worker` role:
- Syncs files to `/srv/airflow_dl_worker/`.
- Templates `configs/docker-compose-dl.yaml`.
- Builds the Airflow image.
- Starts services using `docker compose`.
- Verifies Camoufox services are running.
## Service Interaction Flow (Worker Node)
1. **Airflow Worker:** Pulls tasks from the Redis queue.
2. **`ytdlp_ops_worker_per_url` DAG:** Executes tasks on the local worker node.
3. **Thrift Client (in DAG task):** Connects to `localhost:9080` (Envoy's public port).
4. **Envoy Proxy:** Listens on `:9080`, load balances Thrift requests across internal ports (`9090`, `9091`, `9092` - based on `YTDLP_WORKERS`) of the local `ytdlp-ops-service`.
5. **`ytdlp-ops-service`:** Receives the Thrift request.
6. **Token Generation:** If needed, `ytdlp-ops-service` connects to a local Camoufox instance via WebSocket (using `camoufox_endpoints.json` for the address) to generate tokens.
7. **Camoufox:** Runs a headless Firefox browser, potentially using a SOCKS5 proxy, to interact with YouTube and generate the required tokens.
8. **Download:** The DAG task uses the token (via `info.json`) and potentially the SOCKS5 proxy to run `yt-dlp` for the actual download.
## Environment Variables
Key environment variables used in `.env` files (generated by Ansible templates) control service behavior:
- `HOSTNAME`: The Ansible inventory hostname.
- `SERVICE_ROLE`: `management` (master) or `worker`.
- `SERVER_IDENTITY`: Unique identifier for the `ytdlp-ops-service` instance.
- `YTDLP_WORKERS`: Number of internal Thrift worker endpoints and Camoufox browser instances.
- `CAMOUFOX_PROXIES`: Comma-separated list of SOCKS5 proxy URLs for Camoufox.
- `MASTER_HOST_IP`: IP address of the Airflow master node (for connecting back to Redis).
- Various passwords and ports.
This setup allows for a scalable and robust system for managing YouTube downloads with account rotation and proxy usage.