Added dags on queue management and camoufox support
This commit is contained in:
parent
1f092d6f80
commit
6989d49da3
97
README-ytdlp-ops-auth.md
Normal file
97
README-ytdlp-ops-auth.md
Normal file
@ -0,0 +1,97 @@
|
||||
# YTDLP Client Side Integration
|
||||
|
||||
This document describes how to integrate and use the YTDLP client with the token service.
|
||||
|
||||
## Build
|
||||
|
||||
1. **Pull, configure and start server if needed:**
|
||||
```bash
|
||||
cd /srv/airflow_worker/
|
||||
docker login pangramia # It used to be performed beforehand otherwise ask pull password
|
||||
docker compose -f docker-compose-ytdlp-ops.yaml up -d
|
||||
docker compose -f docker-compose-ytdlp-ops.yaml logs -f
|
||||
```
|
||||
The server is bound to a certain proxy, like "socks5://sslocal-rust-1084:1084".
|
||||
|
||||
Also check that redis in bind to 0.0.0.0 in config
|
||||
|
||||
2. **Build airflow-worker with custom dependencies:**
|
||||
```bash
|
||||
cd /srv/airflow_worker/
|
||||
docker compose build airflow-worker
|
||||
docker compose down airflow-worker
|
||||
docker compose up -d --no-deps airflow-worker
|
||||
```
|
||||
|
||||
3. **Test the built-in client:**
|
||||
```bash
|
||||
# Show client help
|
||||
docker compose exec airflow-worker python /app/ytdlp_ops_client.py --help
|
||||
|
||||
# Get token and info.json
|
||||
docker compose exec airflow-worker python /app/ytdlp_ops_client.py --host 85.192.30.55 --port 9090 getToken --url 'https://www.youtube.com/watch?v=vKTVLpmvznI'
|
||||
|
||||
# List formats using saved info.json
|
||||
docker compose exec airflow-worker yt-dlp --load-info-json "latest.json" -F
|
||||
|
||||
# Simulate download using saved info.json
|
||||
docker compose exec airflow-worker yt-dlp --load-info-json "latest.json" --proxy "socks5://sslocal-rust-1084:1084" --simulate --verbose
|
||||
|
||||
# Extract metadata and download URLs using jq
|
||||
docker compose exec airflow-worker jq -r '"Title: \(.title)", "Date: \(.upload_date | strptime("%Y%m%d") | strftime("%Y-%m-%d"))", "Author: \(.uploader)", "Length: \(.duration_string)", "", "Download URLs:", (.formats[] | select(.vcodec != "none" or .acodec != "none") | .url)' latest.json
|
||||
```
|
||||
|
||||
4. **Test Airflow task:**
|
||||
|
||||
To run the `ytdlp_client_dag_v2.1` DAG:
|
||||
|
||||
Set up required Airflow variables
|
||||
```bash
|
||||
docker compose exec airflow-worker airflow variables set DOWNLOAD_OPTIONS '{"formats": ["bestvideo[height<=1080]+bestaudio/best[height<=1080]"]}'
|
||||
docker compose exec airflow-worker airflow variables set DOWNLOADS_TEMP '/opt/airflow/downloadfiles'
|
||||
docker compose exec airflow-worker airflow variables set DOWNLOADS_PATH '/opt/airflow/downloadfiles'
|
||||
|
||||
docker compose exec airflow-worker airflow variables list
|
||||
docker compose exec airflow-worker airflow variables set TOKEN_TIMEOUT '300'
|
||||
|
||||
docker compose exec airflow-worker airflow connections import /opt/airflow/config/docker_hub_repo.json
|
||||
docker compose exec airflow-worker airflow connections delete redis_default
|
||||
docker compose exec airflow-worker airflow connections import /opt/airflow/config/redis_default_conn.json
|
||||
```
|
||||
|
||||
|
||||
**Using direct connection with task test:**
|
||||
```bash
|
||||
docker compose exec airflow-worker airflow db reset
|
||||
docker compose exec airflow-worker airflow dags reserialize
|
||||
|
||||
docker compose exec airflow-worker airflow dags list
|
||||
docker compose exec airflow-worker airflow dags list-import-errors
|
||||
docker compose exec airflow-worker airflow tasks test ytdlp_client_dag_v2.1 get_token $(date -u +"%Y-%m-%dT%H:%M:%S+00:00") --task-params '{"url": "https://www.youtube.com/watch?v=sOlTX9uxUtM", "redis_enabled": false, "service_ip": "85.192.30.55", "service_port": 9090}'
|
||||
docker compose exec airflow-worker yt-dlp --load-info-json /opt/airflow/downloadfiles/latest.json --proxy "socks5://sslocal-rust-1084:1084" --verbose --simulate
|
||||
|
||||
docker compose exec airflow-worker airflow dags list-runs -d ytdlp_client_dag
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
or deploy using trigger
|
||||
```bash
|
||||
docker compose exec airflow-worker airflow dags list
|
||||
docker compose exec airflow-worker airflow dags unpause ytdlp_client_dag_v2.1
|
||||
|
||||
// Try UI or recheck if works from server deploy
|
||||
docker compose exec airflow-worker airflow dags trigger ytdlp_client_dag_v2.1 -c '{"url": "https://www.youtube.com/watch?v=sOlTX9uxUtM", "redis_enabled": false, "service_ip": "85.192.30.55", "service_port": 9090}'
|
||||
|
||||
```
|
||||
|
||||
|
||||
Check Redis for stored data by videoID
|
||||
```bash
|
||||
docker compose exec redis redis-cli -a XXXXXX -h 89.253.221.173 -p 52909 HGETALL "token_info:sOlTX9uxUtM" | jq -R -s 'split("\n") | del(.[] | select(. == "")) | [.[range(0;length;2)]]'
|
||||
```
|
||||
|
||||
169
README.md
169
README.md
@ -1,97 +1,100 @@
|
||||
# YTDLP Client Side Integration
|
||||
# YTDLP Airflow DAGs
|
||||
|
||||
This document describes how to integrate and use the YTDLP client with the token service.
|
||||
This document describes the Airflow DAGs used for interacting with the YTDLP Ops service and managing processing queues.
|
||||
|
||||
## Build
|
||||
## DAG Descriptions
|
||||
|
||||
1. **Pull, configure and start server if needed:**
|
||||
```bash
|
||||
cd /srv/airflow_worker/
|
||||
docker login pangramia # It used to be performed beforehand otherwise ask pull password
|
||||
docker compose -f docker-compose-ytdlp-ops.yaml up -d
|
||||
docker compose -f docker-compose-ytdlp-ops.yaml logs -f
|
||||
```
|
||||
The server is bound to a certain proxy, like "socks5://sslocal-rust-1084:1084".
|
||||
### `ytdlp_client_dag_v2.1`
|
||||
|
||||
Also check that redis in bind to 0.0.0.0 in config
|
||||
* **File:** `airflow/dags/ytdlp_client_dag_v2.1.py`
|
||||
* **Purpose:** Provides a way to test the YTDLP Ops Thrift service interaction for a *single* video URL. Useful for debugging connection issues, testing specific account IDs, or verifying the service response for a particular URL independently of the queueing system.
|
||||
* **Parameters (Defaults):**
|
||||
* `url` (`'https://www.youtube.com/watch?v=sOlTX9uxUtM'`): The video URL to process.
|
||||
* `redis_enabled` (`False`): Use Redis for service discovery?
|
||||
* `service_ip` (`'85.192.30.55'`): Service IP if `redis_enabled=False`.
|
||||
* `service_port` (`9090`): Service port if `redis_enabled=False`.
|
||||
* `account_id` (`'account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Account ID for lookup or call.
|
||||
* `timeout` (`30`): Timeout in seconds for Thrift connection.
|
||||
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`.
|
||||
* **Results:**
|
||||
* Connects to the YTDLP Ops service using the specified method (Redis or direct IP).
|
||||
* Retrieves token data for the given URL and account ID.
|
||||
* Saves the video's `info.json` metadata to the specified directory.
|
||||
* Extracts the SOCKS proxy (if available).
|
||||
* Pushes `info_json_path`, `socks_proxy`, and the original `ytdlp_command` (with tokens) to XCom.
|
||||
* Optionally stores detailed results in a Redis hash (`token_info:<video_id>`).
|
||||
|
||||
2. **Build airflow-worker with custom dependencies:**
|
||||
```bash
|
||||
cd /srv/airflow_worker/
|
||||
docker compose build airflow-worker
|
||||
docker compose down airflow-worker
|
||||
docker compose up -d --no-deps airflow-worker
|
||||
```
|
||||
### `ytdlp_mgmt_queue_add_urls`
|
||||
|
||||
3. **Test the built-in client:**
|
||||
```bash
|
||||
# Show client help
|
||||
docker compose exec airflow-worker python /app/ytdlp_ops_client.py --help
|
||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_add_urls.py`
|
||||
* **Purpose:** Manually add video URLs to a specific YTDLP inbox queue (Redis List).
|
||||
* **Parameters (Defaults):**
|
||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||
* `queue_name` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Target Redis list (inbox queue).
|
||||
* `urls` (`""`): Multiline string of video URLs to add.
|
||||
* **Results:**
|
||||
* Parses the multiline `urls` parameter.
|
||||
* Adds each valid URL to the end of the Redis list specified by `queue_name`.
|
||||
* Logs the number of URLs added.
|
||||
|
||||
# Get token and info.json
|
||||
docker compose exec airflow-worker python /app/ytdlp_ops_client.py --host 85.192.30.55 --port 9090 getToken --url 'https://www.youtube.com/watch?v=vKTVLpmvznI'
|
||||
### `ytdlp_mgmt_queue_clear`
|
||||
|
||||
# List formats using saved info.json
|
||||
docker compose exec airflow-worker yt-dlp --load-info-json "latest.json" -F
|
||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_clear.py`
|
||||
* **Purpose:** Manually delete a specific Redis key used by the YTDLP queues.
|
||||
* **Parameters (Defaults):**
|
||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||
* `queue_to_clear` (`'PLEASE_SPECIFY_QUEUE_TO_CLEAR'`): Exact name of the Redis key to clear. **Must be changed by user.**
|
||||
* **Results:**
|
||||
* Deletes the Redis key specified by the `queue_to_clear` parameter.
|
||||
* **Warning:** This operation is destructive and irreversible. Use with extreme caution. Ensure you specify the correct key name (e.g., `video_queue_inbox_account_xyz`, `video_queue_progress`, `video_queue_result`, `video_queue_fail`).
|
||||
|
||||
# Simulate download using saved info.json
|
||||
docker compose exec airflow-worker yt-dlp --load-info-json "latest.json" --proxy "socks5://sslocal-rust-1084:1084" --simulate --verbose
|
||||
### `ytdlp_mgmt_queue_check_status`
|
||||
|
||||
# Extract metadata and download URLs using jq
|
||||
docker compose exec airflow-worker jq -r '"Title: \(.title)", "Date: \(.upload_date | strptime("%Y%m%d") | strftime("%Y-%m-%d"))", "Author: \(.uploader)", "Length: \(.duration_string)", "", "Download URLs:", (.formats[] | select(.vcodec != "none" or .acodec != "none") | .url)' latest.json
|
||||
```
|
||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_check_status.py`
|
||||
* **Purpose:** Manually check the type and size of a specific YTDLP Redis queue/key.
|
||||
* **Parameters (Defaults):**
|
||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||
* `queue_to_check` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to check.
|
||||
* **Results:**
|
||||
* Connects to Redis and determines the type of the key specified by `queue_to_check`.
|
||||
* Determines the size (length for lists, number of fields for hashes).
|
||||
* Logs the key type and size.
|
||||
* Pushes `queue_key_type` and `queue_size` to XCom.
|
||||
|
||||
4. **Test Airflow task:**
|
||||
### `ytdlp_mgmt_queue_list_contents`
|
||||
|
||||
To run the `ytdlp_client_dag_v2.1` DAG:
|
||||
* **File:** `airflow/dags/ytdlp_mgmt_queue_list_contents.py`
|
||||
* **Purpose:** Manually list the contents of a specific YTDLP Redis queue/key (list or hash). Useful for inspecting queue state or results.
|
||||
* **Parameters (Defaults):**
|
||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||
* `queue_to_list` (`'video_queue_inbox_account_fr_2025-04-03T1220_anonomyous_2ssdfsf2342afga09'`): Exact name of the Redis key to list.
|
||||
* `max_items` (`100`): Maximum number of items/fields to list.
|
||||
* **Results:**
|
||||
* Connects to Redis and identifies the type of the key specified by `queue_to_list`.
|
||||
* If it's a List, logs the first `max_items` elements.
|
||||
* If it's a Hash, logs up to `max_items` key-value pairs, attempting to pretty-print JSON values.
|
||||
* Logs warnings for very large hashes.
|
||||
|
||||
Set up required Airflow variables
|
||||
```bash
|
||||
docker compose exec airflow-worker airflow variables set DOWNLOAD_OPTIONS '{"formats": ["bestvideo[height<=1080]+bestaudio/best[height<=1080]"]}'
|
||||
docker compose exec airflow-worker airflow variables set DOWNLOADS_TEMP '/opt/airflow/downloadfiles'
|
||||
docker compose exec airflow-worker airflow variables set DOWNLOADS_PATH '/opt/airflow/downloadfiles'
|
||||
|
||||
docker compose exec airflow-worker airflow variables list
|
||||
docker compose exec airflow-worker airflow variables set TOKEN_TIMEOUT '300'
|
||||
### `ytdlp_proc_sequential_processor`
|
||||
|
||||
docker compose exec airflow-worker airflow connections import /opt/airflow/config/docker_hub_repo.json
|
||||
docker compose exec airflow-worker airflow connections delete redis_default
|
||||
docker compose exec airflow-worker airflow connections import /opt/airflow/config/redis_default_conn.json
|
||||
```
|
||||
|
||||
|
||||
**Using direct connection with task test:**
|
||||
```bash
|
||||
docker compose exec airflow-worker airflow db reset
|
||||
docker compose exec airflow-worker airflow dags reserialize
|
||||
|
||||
docker compose exec airflow-worker airflow dags list
|
||||
docker compose exec airflow-worker airflow dags list-import-errors
|
||||
docker compose exec airflow-worker airflow tasks test ytdlp_client_dag_v2.1 get_token $(date -u +"%Y-%m-%dT%H:%M:%S+00:00") --task-params '{"url": "https://www.youtube.com/watch?v=sOlTX9uxUtM", "redis_enabled": false, "service_ip": "85.192.30.55", "service_port": 9090}'
|
||||
docker compose exec airflow-worker yt-dlp --load-info-json /opt/airflow/downloadfiles/latest.json --proxy "socks5://sslocal-rust-1084:1084" --verbose --simulate
|
||||
|
||||
docker compose exec airflow-worker airflow dags list-runs -d ytdlp_client_dag
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
or deploy using trigger
|
||||
```bash
|
||||
docker compose exec airflow-worker airflow dags list
|
||||
docker compose exec airflow-worker airflow dags unpause ytdlp_client_dag_v2.1
|
||||
|
||||
// Try UI or recheck if works from server deploy
|
||||
docker compose exec airflow-worker airflow dags trigger ytdlp_client_dag_v2.1 -c '{"url": "https://www.youtube.com/watch?v=sOlTX9uxUtM", "redis_enabled": false, "service_ip": "85.192.30.55", "service_port": 9090}'
|
||||
|
||||
```
|
||||
|
||||
|
||||
Check Redis for stored data by videoID
|
||||
```bash
|
||||
docker compose exec redis redis-cli -a XXXXXX -h 89.253.221.173 -p 52909 HGETALL "token_info:sOlTX9uxUtM" | jq -R -s 'split("\n") | del(.[] | select(. == "")) | [.[range(0;length;2)]]'
|
||||
```
|
||||
|
||||
* **File:** `airflow/dags/ytdlp_proc_sequential_processor.py`
|
||||
* **Purpose:** Processes YouTube URLs sequentially from a Redis queue. Designed for batch processing. Pops a URL, gets token/metadata via YTDLP Ops service, downloads the media using `yt-dlp`, and records the result.
|
||||
* **Parameters (Defaults):**
|
||||
* `queue_name` (`'video_queue'`): Base name for Redis queues (e.g., `video_queue_inbox`, `video_queue_progress`).
|
||||
* `redis_conn_id` (`'redis_default'`): Airflow Redis connection ID.
|
||||
* `redis_enabled` (`False`): Use Redis for service discovery? If False, uses `service_ip`/`port`.
|
||||
* `service_ip` (`None`): Required Service IP if `redis_enabled=False`.
|
||||
* `service_port` (`None`): Required Service port if `redis_enabled=False`.
|
||||
* `account_id` (`'default_account'`): Account ID for the API call (used for Redis lookup if `redis_enabled=True`).
|
||||
* `timeout` (`30`): Timeout in seconds for the Thrift connection.
|
||||
* `download_format` (`'ba[ext=m4a]/bestaudio/best'`): yt-dlp format selection string.
|
||||
* `output_path_template` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s"`): yt-dlp output template. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
||||
* `info_json_dir` (`"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}"`): Directory to save `info.json`. Uses Airflow Variable `DOWNLOADS_TEMP`.
|
||||
* **Results:**
|
||||
* Pops one URL from the `{{ params.queue_name }}_inbox` Redis list.
|
||||
* If a URL is popped, it's added to the `{{ params.queue_name }}_progress` Redis hash.
|
||||
* The `YtdlpOpsOperator` (`get_token` task) attempts to get token data (including `info.json`, proxy, command) for the URL using the specified connection method and account ID.
|
||||
* If token retrieval succeeds, the `download_video` task executes `yt-dlp` using the retrieved `info.json`, proxy, the `download_format` parameter, and the `output_path_template` parameter to download the actual media.
|
||||
* **On Successful Download:** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_result` hash along with results (`info_json_path`, `socks_proxy`, `ytdlp_command`).
|
||||
* **On Failure (Token Retrieval or Download):** The URL is removed from the progress hash and added to the `{{ params.queue_name }}_fail` hash along with error details (message, traceback).
|
||||
* If the inbox queue is empty, the DAG run skips processing via `AirflowSkipException`.
|
||||
|
||||
42
camoufox/Dockerfile
Normal file
42
camoufox/Dockerfile
Normal file
@ -0,0 +1,42 @@
|
||||
# Use a base Python image
|
||||
FROM python:3.11-slim
|
||||
|
||||
# Set working directory
|
||||
WORKDIR /app
|
||||
|
||||
# Install necessary system packages for Playwright, GeoIP, and Xvfb
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
libgeoip1 \
|
||||
# Xvfb for headless browser display
|
||||
xvfb \
|
||||
# Playwright browser dependencies
|
||||
libnss3 libnspr4 libdbus-1-3 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 libasound2 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Python dependencies: camoufox with geoip support and playwright==1.49
|
||||
# Using --no-cache-dir to reduce image size
|
||||
RUN pip install --no-cache-dir "camoufox[geoip]" playwright==1.49
|
||||
|
||||
# Install Playwright browsers for version 1.49
|
||||
RUN playwright install --with-deps
|
||||
|
||||
# Copy the server script into the image
|
||||
COPY camoufox_server.py .
|
||||
|
||||
# Create directory for extensions and copy them
|
||||
RUN mkdir /app/extensions
|
||||
COPY google_sign_in_popup_blocker-1.0.2.xpi /app/extensions/
|
||||
COPY spoof_timezone-0.3.4.xpi /app/extensions/
|
||||
COPY youtube_ad_auto_skipper-0.6.0.xpi /app/extensions/
|
||||
|
||||
# Expose the default port Camoufox might use (adjust if needed)
|
||||
# This is informational; the actual port mapping is in docker-compose.
|
||||
EXPOSE 12345
|
||||
|
||||
# Copy the wrapper script and make it executable
|
||||
COPY start_camoufox.sh /app/
|
||||
RUN chmod +x /app/start_camoufox.sh
|
||||
|
||||
# Default command executes the wrapper script.
|
||||
# Arguments for camoufox_server.py will be passed via docker-compose command section.
|
||||
ENTRYPOINT ["/app/start_camoufox.sh"]
|
||||
190
camoufox/camoufox_server.py
Normal file
190
camoufox/camoufox_server.py
Normal file
@ -0,0 +1,190 @@
|
||||
#!/usr/bin/env python3
|
||||
import re
|
||||
import argparse
|
||||
import atexit # Import atexit
|
||||
import shutil # Import shutil for directory removal
|
||||
import logging # Import the logging module
|
||||
import sys # Import sys for stdout
|
||||
import os # Import os module
|
||||
from camoufox.server import launch_server
|
||||
|
||||
def parse_proxy_url(url):
|
||||
"""Parse proxy URL in format proto://user:pass@host:port"""
|
||||
pattern = r'([^:]+)://(?:([^:]+):([^@]+)@)?([^:]+):(\d+)'
|
||||
match = re.match(pattern, url)
|
||||
if not match:
|
||||
raise ValueError('Invalid proxy URL format. Expected proto://[user:pass@]host:port')
|
||||
|
||||
proto, username, password, host, port = match.groups()
|
||||
|
||||
# Ensure username and password are strings, not None
|
||||
proxy_config = {
|
||||
'server': f'{proto}://{host}:{port}',
|
||||
'username': username or '',
|
||||
'password': password or ''
|
||||
}
|
||||
|
||||
# Remove empty credentials
|
||||
if not proxy_config['username']:
|
||||
del proxy_config['username']
|
||||
if not proxy_config['password']:
|
||||
del proxy_config['password']
|
||||
|
||||
return proxy_config
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Launch Camoufox server with optional proxy support')
|
||||
parser.add_argument('--proxy-url', help='Optional proxy URL in format proto://user:pass@host:port (supports http, https, socks5)')
|
||||
parser.add_argument('--ws-host', default='localhost', help='WebSocket server host address (e.g., localhost, 0.0.0.0)')
|
||||
parser.add_argument('--port', type=int, default=0, help='WebSocket server port (0 for random)')
|
||||
parser.add_argument('--ws-path', default='camoufox', help='WebSocket server path')
|
||||
parser.add_argument('--headless', action='store_true', help='Run browser in headless mode')
|
||||
parser.add_argument('--geoip', nargs='?', const=True, default=False,
|
||||
help='Enable geo IP protection. Can specify IP address or use True for automatic detection')
|
||||
parser.add_argument('--locale', help='Locale(s) to use (e.g. "en-US" or "en-US,fr-FR")')
|
||||
parser.add_argument('--block-images', action='store_true', help='Block image requests to save bandwidth')
|
||||
parser.add_argument('--block-webrtc', action='store_true', help='Block WebRTC entirely')
|
||||
parser.add_argument('--humanize', nargs='?', const=True, type=float,
|
||||
help='Humanize cursor movements. Can specify max duration in seconds')
|
||||
parser.add_argument('--extensions', type=str,
|
||||
help='Comma-separated list of extension paths to enable (XPI files or extracted directories). Use quotes if paths contain spaces.')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
proxy_config = None
|
||||
if args.proxy_url:
|
||||
try:
|
||||
proxy_config = parse_proxy_url(args.proxy_url)
|
||||
print(f"Using proxy configuration: {args.proxy_url}")
|
||||
except ValueError as e:
|
||||
print(f'Error parsing proxy URL: {e}')
|
||||
return
|
||||
else:
|
||||
print("No proxy URL provided. Running without proxy.")
|
||||
|
||||
# --- Basic Logging Configuration ---
|
||||
# Configure the root logger to show INFO level messages
|
||||
# This might capture logs from camoufox or its dependencies (like websockets)
|
||||
log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
||||
log_handler = logging.StreamHandler(sys.stdout) # Log to standard output
|
||||
log_handler.setFormatter(log_formatter)
|
||||
|
||||
root_logger = logging.getLogger()
|
||||
# Remove existing handlers to avoid duplicates if script is re-run in same process
|
||||
for handler in root_logger.handlers[:]:
|
||||
root_logger.removeHandler(handler)
|
||||
root_logger.addHandler(log_handler)
|
||||
# Set level to DEBUG for more detailed output from Camoufox/Playwright
|
||||
root_logger.setLevel(logging.DEBUG)
|
||||
|
||||
logging.debug("DEBUG logging enabled. Starting Camoufox server setup...")
|
||||
# --- End Logging Configuration ---
|
||||
|
||||
try:
|
||||
# --- Check DISPLAY environment variable ---
|
||||
display_var = os.environ.get('DISPLAY')
|
||||
logging.info(f"Value of DISPLAY environment variable: {display_var}")
|
||||
# --- End Check ---
|
||||
|
||||
# Build config dictionary
|
||||
config = {
|
||||
'headless': args.headless,
|
||||
'geoip': args.geoip,
|
||||
# 'proxy': proxy_config, # Add proxy config only if it exists
|
||||
'host': args.ws_host, # Add the host argument
|
||||
'port': args.port,
|
||||
'ws_path': args.ws_path,
|
||||
# Explicitly pass DISPLAY environment variable to Playwright
|
||||
'env': {'DISPLAY': os.environ.get('DISPLAY')}
|
||||
}
|
||||
# Add proxy to config only if it was successfully parsed
|
||||
if proxy_config:
|
||||
config['proxy'] = proxy_config
|
||||
|
||||
# Add optional parameters
|
||||
if args.locale:
|
||||
config['locale'] = args.locale
|
||||
if args.block_images:
|
||||
config['block_images'] = True
|
||||
if args.block_webrtc:
|
||||
config['block_webrtc'] = True
|
||||
if args.humanize:
|
||||
config['humanize'] = args.humanize if isinstance(args.humanize, float) else True
|
||||
|
||||
# Exclude default addons including uBlock Origin
|
||||
config['exclude_addons'] = ['ublock_origin', 'default_addons']
|
||||
print('Excluded default addons including uBlock Origin')
|
||||
|
||||
# Add custom extensions if specified
|
||||
if args.extensions:
|
||||
from pathlib import Path
|
||||
valid_extensions = []
|
||||
|
||||
# Split comma-separated extensions
|
||||
extensions_list = [ext.strip() for ext in args.extensions.split(',')]
|
||||
temp_dirs_to_cleanup = [] # List to store temp dirs
|
||||
|
||||
# Register cleanup function
|
||||
def cleanup_temp_dirs():
|
||||
for temp_dir in temp_dirs_to_cleanup:
|
||||
try:
|
||||
shutil.rmtree(temp_dir)
|
||||
print(f"Cleaned up temporary extension directory: {temp_dir}")
|
||||
except Exception as e:
|
||||
print(f"Warning: Failed to clean up temp dir {temp_dir}: {e}")
|
||||
atexit.register(cleanup_temp_dirs)
|
||||
|
||||
for ext_path in extensions_list:
|
||||
# Convert to absolute path
|
||||
ext_path = Path(ext_path).absolute()
|
||||
|
||||
if not ext_path.exists():
|
||||
print(f"Warning: Extension path does not exist: {ext_path}")
|
||||
continue
|
||||
|
||||
if ext_path.is_file() and ext_path.suffix == '.xpi':
|
||||
# Extract XPI to temporary directory
|
||||
import tempfile
|
||||
import zipfile
|
||||
|
||||
try:
|
||||
temp_dir = tempfile.mkdtemp(prefix=f"camoufox_ext_{ext_path.stem}_")
|
||||
temp_dirs_to_cleanup.append(temp_dir) # Add to cleanup list
|
||||
with zipfile.ZipFile(ext_path, 'r') as zip_ref:
|
||||
zip_ref.extractall(temp_dir)
|
||||
valid_extensions.append(temp_dir)
|
||||
print(f"Successfully loaded extension: {ext_path.name} (extracted to {temp_dir})")
|
||||
except Exception as e:
|
||||
print(f"Error loading extension {ext_path}: {str(e)}")
|
||||
# Remove from cleanup list if extraction failed before adding to valid_extensions
|
||||
if temp_dir in temp_dirs_to_cleanup:
|
||||
temp_dirs_to_cleanup.remove(temp_dir)
|
||||
continue
|
||||
elif ext_path.is_dir():
|
||||
# Check if it's a valid Firefox extension
|
||||
if (ext_path / 'manifest.json').exists():
|
||||
valid_extensions.append(str(ext_path))
|
||||
print(f"Successfully loaded extension: {ext_path.name}")
|
||||
else:
|
||||
print(f"Warning: Directory is not a valid Firefox extension: {ext_path}")
|
||||
else:
|
||||
print(f"Warning: Invalid extension path: {ext_path}")
|
||||
|
||||
if valid_extensions:
|
||||
config['addons'] = valid_extensions
|
||||
print(f"Loaded {len(valid_extensions)} extensions")
|
||||
else:
|
||||
print("Warning: No valid extensions were loaded")
|
||||
|
||||
server = launch_server(**config)
|
||||
except Exception as e:
|
||||
print(f'Error launching server: {str(e)}')
|
||||
if 'Browser.setBrowserProxy' in str(e):
|
||||
print('Note: The browser may not support SOCKS5 proxy authentication')
|
||||
return
|
||||
|
||||
print(f'\nCamoufox server started successfully!')
|
||||
print(f'WebSocket endpoint: {server.ws_endpoint}\n')
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
BIN
camoufox/google_sign_in_popup_blocker-1.0.2.xpi
Normal file
BIN
camoufox/google_sign_in_popup_blocker-1.0.2.xpi
Normal file
Binary file not shown.
BIN
camoufox/spoof_timezone-0.3.4.xpi
Normal file
BIN
camoufox/spoof_timezone-0.3.4.xpi
Normal file
Binary file not shown.
58
camoufox/start_camoufox.sh
Executable file
58
camoufox/start_camoufox.sh
Executable file
@ -0,0 +1,58 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Set error handling
|
||||
set -e
|
||||
|
||||
# Function to cleanup resources on exit
|
||||
cleanup() {
|
||||
echo "Cleaning up resources..."
|
||||
|
||||
# Kill Xvfb if it's running
|
||||
if [ -n "$XVFB_PID" ] && ps -p $XVFB_PID > /dev/null; then
|
||||
echo "Stopping Xvfb (PID: $XVFB_PID)"
|
||||
kill $XVFB_PID || true
|
||||
fi
|
||||
|
||||
# Remove X lock files if they exist
|
||||
if [ -e "/tmp/.X99-lock" ]; then
|
||||
echo "Removing X lock file"
|
||||
rm -f /tmp/.X99-lock
|
||||
fi
|
||||
|
||||
echo "Cleanup complete"
|
||||
}
|
||||
|
||||
# Register the cleanup function to run on script exit
|
||||
trap cleanup EXIT
|
||||
|
||||
# Check if X lock file exists and remove it (in case of previous unclean shutdown)
|
||||
if [ -e "/tmp/.X99-lock" ]; then
|
||||
echo "Removing existing X lock file"
|
||||
rm -f /tmp/.X99-lock
|
||||
fi
|
||||
|
||||
# Start Xvfb with display :99
|
||||
echo "Starting Xvfb on display :99"
|
||||
Xvfb :99 -screen 0 1280x1024x24 -ac &
|
||||
XVFB_PID=$!
|
||||
|
||||
# Wait a moment for Xvfb to initialize
|
||||
sleep 2
|
||||
|
||||
# Check if Xvfb started successfully
|
||||
if ! ps -p $XVFB_PID > /dev/null; then
|
||||
echo "Failed to start Xvfb"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Export the DISPLAY variable for the browser
|
||||
export DISPLAY=:99
|
||||
|
||||
echo "Xvfb started successfully with PID: $XVFB_PID"
|
||||
echo "DISPLAY set to: $DISPLAY"
|
||||
|
||||
# Start the Camoufox server with all arguments passed to this script
|
||||
echo "Starting Camoufox server with arguments:"
|
||||
printf " Arg: '%s'\n" "$@" # Print each argument quoted on a new line
|
||||
echo "Executing: python3 camoufox_server.py $@"
|
||||
python3 camoufox_server.py "$@"
|
||||
BIN
camoufox/youtube_ad_auto_skipper-0.6.0.xpi
Normal file
BIN
camoufox/youtube_ad_auto_skipper-0.6.0.xpi
Normal file
Binary file not shown.
@ -468,9 +468,10 @@ class YtdlpOpsOperator(BaseOperator):
|
||||
|
||||
# Write to timestamped file
|
||||
try:
|
||||
logger.info(f"Writing info.json content (received from service) to {info_json_path}...")
|
||||
with open(info_json_path, 'w', encoding='utf-8') as f:
|
||||
f.write(info_json)
|
||||
logger.info(f"Saved info.json to timestamped file: {info_json_path}")
|
||||
logger.info(f"Successfully saved info.json to timestamped file: {info_json_path}")
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to write info.json to {info_json_path}: {e}")
|
||||
return None # Indicate failure
|
||||
|
||||
189
dags/ytdlp_mgmt_queue_add_urls.py
Normal file
189
dags/ytdlp_mgmt_queue_add_urls.py
Normal file
@ -0,0 +1,189 @@
|
||||
from airflow import DAG
|
||||
from airflow.models.param import Param
|
||||
from airflow.operators.python import PythonOperator
|
||||
from airflow.providers.redis.hooks.redis import RedisHook
|
||||
from airflow.utils.dates import days_ago
|
||||
from airflow.exceptions import AirflowException
|
||||
from datetime import timedelta
|
||||
import logging
|
||||
import redis # Import redis exceptions if needed
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Default settings
|
||||
DEFAULT_QUEUE_NAME = 'video_queue_inbox' # Default to the inbox queue
|
||||
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
||||
|
||||
# --- Helper Functions ---
|
||||
|
||||
def _get_redis_client(redis_conn_id):
|
||||
"""Gets a Redis client connection using RedisHook."""
|
||||
try:
|
||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
||||
client = hook.get_conn()
|
||||
client.ping()
|
||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
||||
return client
|
||||
except redis.exceptions.AuthenticationError:
|
||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
||||
|
||||
# --- Python Callables for Tasks ---
|
||||
|
||||
def add_urls_callable(**context):
|
||||
"""Adds URLs from comma/newline separated input to the specified Redis list."""
|
||||
params = context['params']
|
||||
redis_conn_id = params['redis_conn_id']
|
||||
queue_name = params['queue_name'] # Should be the inbox queue, e.g., video_queue_inbox
|
||||
urls_input = params['urls']
|
||||
|
||||
if not queue_name.endswith('_inbox'):
|
||||
logger.warning(f"Target queue name '{queue_name}' does not end with '_inbox'. Ensure this is the intended inbox queue.")
|
||||
|
||||
if not urls_input or not isinstance(urls_input, str):
|
||||
logger.warning("No URLs provided or 'urls' parameter is not a string. Nothing to add.")
|
||||
return
|
||||
|
||||
# Process input: split by newline, then by comma, flatten, strip, and filter empty
|
||||
urls_to_add = []
|
||||
for line in urls_input.splitlines():
|
||||
urls_to_add.extend(url.strip() for url in line.split(',') if url.strip())
|
||||
|
||||
# Remove duplicates while preserving order (optional, but good practice)
|
||||
seen = set()
|
||||
urls_to_add = [x for x in urls_to_add if not (x in seen or seen.add(x))]
|
||||
|
||||
if not urls_to_add:
|
||||
logger.info("No valid URLs found after processing input. Nothing added.")
|
||||
return
|
||||
|
||||
logger.info(f"Attempting to add {len(urls_to_add)} unique URLs to Redis list '{queue_name}' using connection '{redis_conn_id}'.")
|
||||
try:
|
||||
redis_client = _get_redis_client(redis_conn_id)
|
||||
# Use rpush to add to the end of the list (FIFO behavior with lpop)
|
||||
added_count = redis_client.rpush(queue_name, *urls_to_add)
|
||||
logger.info(f"Successfully added {len(urls_to_add)} URLs to list '{queue_name}'. New list length: {added_count}.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to add URLs to Redis list '{queue_name}': {e}", exc_info=True)
|
||||
raise AirflowException(f"Failed to add URLs to Redis: {e}")
|
||||
|
||||
|
||||
# Removed clear_queue_callable as this DAG focuses on adding and verifying
|
||||
|
||||
|
||||
def check_status_callable(**context):
|
||||
"""Checks the type and length/size of the specified Redis key."""
|
||||
# Access DAG run parameters directly from context['params']
|
||||
dag_params = context['params']
|
||||
redis_conn_id = dag_params['redis_conn_id']
|
||||
# Check the status of the queue specified in the main DAG parameters
|
||||
queue_to_check = dag_params['queue_name']
|
||||
|
||||
if not queue_to_check:
|
||||
raise ValueError("DAG parameter 'queue_name' cannot be empty.")
|
||||
|
||||
logger.info(f"Attempting to check status of Redis key '{queue_to_check}' using connection '{redis_conn_id}'.") # Uses DAG param value
|
||||
try:
|
||||
# Use the resolved redis_conn_id to get the client
|
||||
redis_client = _get_redis_client(redis_conn_id)
|
||||
# redis_client.type returns bytes (e.g., b'list', b'hash', b'none')
|
||||
key_type_bytes = redis_client.type(queue_to_check)
|
||||
key_type_str = key_type_bytes.decode('utf-8') # Decode to string
|
||||
|
||||
length = 0
|
||||
if key_type_str == 'list':
|
||||
length = redis_client.llen(queue_to_check)
|
||||
logger.info(f"Redis list '{queue_to_check}' has {length} items.")
|
||||
elif key_type_str == 'hash':
|
||||
length = redis_client.hlen(queue_to_check)
|
||||
logger.info(f"Redis hash '{queue_to_check}' has {length} fields.")
|
||||
elif key_type_str == 'none': # Check against the decoded string 'none'
|
||||
logger.info(f"Redis key '{queue_to_check}' does not exist.")
|
||||
else:
|
||||
# Attempt to get size for other types if possible, e.g., set size
|
||||
try:
|
||||
if key_type_str == 'set':
|
||||
length = redis_client.scard(queue_to_check)
|
||||
logger.info(f"Redis set '{queue_to_check}' has {length} members.")
|
||||
# Add checks for other types like zset if needed
|
||||
else:
|
||||
logger.info(f"Redis key '{queue_to_check}' exists but is of unhandled type '{key_type_str}'. Cannot determine size.")
|
||||
except Exception as size_error:
|
||||
logger.warning(f"Could not determine size for Redis key '{queue_to_check}' (type: {key_type_str}): {size_error}")
|
||||
logger.info(f"Redis key '{queue_to_check}' exists but is of unhandled/unsizeable type '{key_type_str}'.")
|
||||
|
||||
# Push results to XCom
|
||||
context['task_instance'].xcom_push(key='queue_key_type', value=key_type_str)
|
||||
context['task_instance'].xcom_push(key='queue_size', value=length)
|
||||
# Return status info using the resolved queue_to_check
|
||||
return {'key': queue_to_check, 'type': key_type_str, 'size': length}
|
||||
|
||||
except Exception as e:
|
||||
# Log error using the resolved queue_to_check
|
||||
logger.error(f"Failed to check status of Redis key '{queue_to_check}': {e}", exc_info=True)
|
||||
raise AirflowException(f"Failed to check Redis key status: {e}")
|
||||
|
||||
|
||||
# --- DAG Definition ---
|
||||
default_args = {
|
||||
'owner': 'airflow',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': False,
|
||||
'email_on_retry': False,
|
||||
'retries': 1,
|
||||
'retry_delay': timedelta(minutes=1), # Slightly longer retry delay for management tasks
|
||||
'start_date': days_ago(1)
|
||||
}
|
||||
|
||||
# This single DAG contains operators for different management actions,
|
||||
# This DAG allows adding URLs and then checking the status of the target queue.
|
||||
with DAG(
|
||||
dag_id='ytdlp_mgmt_queue_add_and_verify', # Updated DAG ID
|
||||
default_args=default_args,
|
||||
schedule_interval=None, # Manually triggered
|
||||
catchup=False,
|
||||
description='Manually add URLs to a YTDLP inbox queue and verify the queue status.', # Updated description
|
||||
tags=['ytdlp', 'queue', 'management', 'redis', 'manual', 'add', 'verify'], # Updated tags
|
||||
params={
|
||||
# Common params
|
||||
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
|
||||
# Params for adding URLs (and checking the same queue)
|
||||
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", title="Target Queue Name", description="Redis list (inbox queue) to add URLs to and check status of."),
|
||||
'urls': Param("", type="string", title="URLs to Add", description="Comma and/or newline separated list of video URLs.", multiline=True), # Updated description, keep multiline for UI
|
||||
# Removed clear_queue_name param
|
||||
# Removed check_queue_name param (will use queue_name)
|
||||
}
|
||||
) as dag:
|
||||
|
||||
add_urls_task = PythonOperator(
|
||||
task_id='add_urls_to_queue',
|
||||
python_callable=add_urls_callable,
|
||||
# Pass only relevant params to the callable via context['params']
|
||||
# Note: context['params'] automatically contains all DAG params
|
||||
)
|
||||
add_urls_task.doc_md = """
|
||||
### Add URLs to Queue
|
||||
Adds URLs from the `urls` parameter (comma/newline separated) to the Redis list specified by `queue_name`.
|
||||
*Trigger this task manually via the UI and provide the URLs.*
|
||||
"""
|
||||
|
||||
# Removed clear_queue_task
|
||||
|
||||
check_status_task = PythonOperator(
|
||||
task_id='check_queue_status_after_add',
|
||||
python_callable=check_status_callable,
|
||||
# No task-specific params needed; callable uses context['params'] directly.
|
||||
)
|
||||
check_status_task.doc_md = """
|
||||
### Check Queue Status After Add
|
||||
Checks the type and length/size of the Redis key specified by `queue_name` (the same queue URLs were added to).
|
||||
Logs the result and pushes `queue_key_type` and `queue_size` to XCom.
|
||||
*This task runs automatically after `add_urls_to_queue`.*
|
||||
"""
|
||||
|
||||
# Define dependency: Add URLs first, then check status
|
||||
add_urls_task >> check_status_task
|
||||
133
dags/ytdlp_mgmt_queue_check_status.py
Normal file
133
dags/ytdlp_mgmt_queue_check_status.py
Normal file
@ -0,0 +1,133 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# vim:fenc=utf-8
|
||||
#
|
||||
# Copyright © 2024 rl <rl@rlmbp>
|
||||
#
|
||||
# Distributed under terms of the MIT license.
|
||||
|
||||
"""
|
||||
Airflow DAG for manually checking the status (type and size) of a specific Redis key used by YTDLP queues.
|
||||
"""
|
||||
|
||||
from airflow import DAG
|
||||
from airflow.exceptions import AirflowException
|
||||
from airflow.models.param import Param
|
||||
from airflow.operators.python import PythonOperator
|
||||
from airflow.providers.redis.hooks.redis import RedisHook
|
||||
from airflow.utils.dates import days_ago
|
||||
from datetime import timedelta
|
||||
import logging
|
||||
import redis # Import redis exceptions if needed
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Default settings
|
||||
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
||||
# Default to a common inbox pattern, user should override with the specific key
|
||||
DEFAULT_QUEUE_TO_CHECK = 'video_queue_inbox'
|
||||
|
||||
# --- Helper Function ---
|
||||
|
||||
def _get_redis_client(redis_conn_id):
|
||||
"""Gets a Redis client connection using RedisHook."""
|
||||
try:
|
||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
||||
client = hook.get_conn()
|
||||
client.ping()
|
||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
||||
return client
|
||||
except redis.exceptions.AuthenticationError:
|
||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
||||
|
||||
# --- Python Callable for Check Status Task ---
|
||||
|
||||
def check_status_callable(**context):
|
||||
"""Checks the length/size of the specified Redis key (queue/hash)."""
|
||||
params = context['params']
|
||||
redis_conn_id = params['redis_conn_id']
|
||||
queue_to_check = params['queue_to_check'] # Specific queue/hash name
|
||||
|
||||
if not queue_to_check:
|
||||
raise ValueError("Parameter 'queue_to_check' cannot be empty.")
|
||||
|
||||
logger.info(f"Attempting to check status of Redis key '{queue_to_check}' using connection '{redis_conn_id}'.")
|
||||
try:
|
||||
redis_client = _get_redis_client(redis_conn_id)
|
||||
key_type = redis_client.type(queue_to_check)
|
||||
key_type_str = key_type.decode('utf-8') if isinstance(key_type, bytes) else key_type # Decode if needed
|
||||
|
||||
length = 0
|
||||
if key_type_str == 'list':
|
||||
length = redis_client.llen(queue_to_check)
|
||||
logger.info(f"Redis list '{queue_to_check}' has {length} items.")
|
||||
elif key_type_str == 'hash':
|
||||
length = redis_client.hlen(queue_to_check)
|
||||
logger.info(f"Redis hash '{queue_to_check}' has {length} fields.")
|
||||
elif key_type_str == 'none':
|
||||
logger.info(f"Redis key '{queue_to_check}' does not exist.")
|
||||
else:
|
||||
# Attempt to get size for other types if possible, e.g., set size
|
||||
try:
|
||||
length = redis_client.scard(queue_to_check) # Example for set
|
||||
logger.info(f"Redis key '{queue_to_check}' (type: {key_type_str}) has size {length}.")
|
||||
except:
|
||||
logger.info(f"Redis key '{queue_to_check}' exists but is of unhandled/unsizeable type '{key_type_str}'.")
|
||||
|
||||
# Optionally push length to XCom if needed downstream
|
||||
context['task_instance'].xcom_push(key='queue_key_type', value=key_type_str)
|
||||
context['task_instance'].xcom_push(key='queue_size', value=length)
|
||||
return {'key': queue_to_check, 'type': key_type_str, 'size': length} # Return status info
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to check status of Redis key '{queue_to_check}': {e}", exc_info=True)
|
||||
raise AirflowException(f"Failed to check Redis key status: {e}")
|
||||
|
||||
# --- DAG Definition ---
|
||||
default_args = {
|
||||
'owner': 'airflow',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': False,
|
||||
'email_on_retry': False,
|
||||
'retries': 1,
|
||||
'retry_delay': timedelta(seconds=30),
|
||||
'start_date': days_ago(1)
|
||||
}
|
||||
|
||||
with DAG(
|
||||
dag_id='ytdlp_mgmt_queue_check_status',
|
||||
default_args=default_args,
|
||||
schedule_interval=None, # Manually triggered
|
||||
catchup=False,
|
||||
description='Manually check the type and size of a specific YTDLP Redis queue/key.',
|
||||
tags=['ytdlp', 'queue', 'management', 'redis', 'manual', 'status'],
|
||||
params={
|
||||
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
|
||||
'queue_to_check': Param(
|
||||
DEFAULT_QUEUE_TO_CHECK,
|
||||
type="string",
|
||||
description="Exact name of the Redis key to check (e.g., 'video_queue_inbox_account_xyz', 'video_queue_progress', 'video_queue_result', 'video_queue_fail')."
|
||||
),
|
||||
}
|
||||
) as dag:
|
||||
|
||||
check_status_task = PythonOperator(
|
||||
task_id='check_specified_queue_status',
|
||||
python_callable=check_status_callable,
|
||||
# Params are implicitly passed via context['params']
|
||||
)
|
||||
check_status_task.doc_md = """
|
||||
### Check Specified Queue/Key Status Task
|
||||
Checks the type and size (length for lists, number of fields for hashes) of the Redis key specified by `queue_to_check`.
|
||||
Logs the result and pushes `queue_key_type` and `queue_size` to XCom.
|
||||
Can check keys like:
|
||||
- `_inbox` (Redis List)
|
||||
- `_progress` (Redis Hash)
|
||||
- `_result` (Redis Hash)
|
||||
- `_fail` (Redis Hash)
|
||||
|
||||
*Trigger this task manually via the UI.*
|
||||
"""
|
||||
113
dags/ytdlp_mgmt_queue_clear.py
Normal file
113
dags/ytdlp_mgmt_queue_clear.py
Normal file
@ -0,0 +1,113 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# vim:fenc=utf-8
|
||||
#
|
||||
# Copyright © 2024 rl <rl@rlmbp>
|
||||
#
|
||||
# Distributed under terms of the MIT license.
|
||||
|
||||
"""
|
||||
Airflow DAG for manually clearing (deleting) a specific Redis key used by YTDLP queues.
|
||||
"""
|
||||
|
||||
from airflow import DAG
|
||||
from airflow.exceptions import AirflowException
|
||||
from airflow.models.param import Param
|
||||
from airflow.operators.python import PythonOperator
|
||||
from airflow.providers.redis.hooks.redis import RedisHook
|
||||
from airflow.utils.dates import days_ago
|
||||
from datetime import timedelta
|
||||
import logging
|
||||
import redis # Import redis exceptions if needed
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Default settings
|
||||
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
||||
# Provide a placeholder default, user MUST specify the queue to clear
|
||||
DEFAULT_QUEUE_TO_CLEAR = 'PLEASE_SPECIFY_QUEUE_TO_CLEAR'
|
||||
|
||||
# --- Helper Function ---
|
||||
|
||||
def _get_redis_client(redis_conn_id):
|
||||
"""Gets a Redis client connection using RedisHook."""
|
||||
try:
|
||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
||||
client = hook.get_conn()
|
||||
client.ping()
|
||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
||||
return client
|
||||
except redis.exceptions.AuthenticationError:
|
||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
||||
|
||||
# --- Python Callable for Clear Task ---
|
||||
|
||||
def clear_queue_callable(**context):
|
||||
"""Clears (deletes) the specified Redis key (queue/hash)."""
|
||||
params = context['params']
|
||||
redis_conn_id = params['redis_conn_id']
|
||||
queue_to_clear = params['queue_to_clear'] # Specific queue/hash name
|
||||
|
||||
if not queue_to_clear or queue_to_clear == DEFAULT_QUEUE_TO_CLEAR:
|
||||
raise ValueError("Parameter 'queue_to_clear' must be specified and cannot be the default placeholder.")
|
||||
|
||||
logger.info(f"Attempting to clear Redis key '{queue_to_clear}' using connection '{redis_conn_id}'.")
|
||||
try:
|
||||
redis_client = _get_redis_client(redis_conn_id)
|
||||
deleted_count = redis_client.delete(queue_to_clear)
|
||||
if deleted_count > 0:
|
||||
logger.info(f"Successfully cleared Redis key '{queue_to_clear}'.")
|
||||
else:
|
||||
logger.info(f"Redis key '{queue_to_clear}' did not exist or was already empty.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to clear Redis key '{queue_to_clear}': {e}", exc_info=True)
|
||||
raise AirflowException(f"Failed to clear Redis key: {e}")
|
||||
|
||||
# --- DAG Definition ---
|
||||
default_args = {
|
||||
'owner': 'airflow',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': False,
|
||||
'email_on_retry': False,
|
||||
'retries': 0, # No retries for manual clear operation
|
||||
'start_date': days_ago(1)
|
||||
}
|
||||
|
||||
with DAG(
|
||||
dag_id='ytdlp_mgmt_queue_clear',
|
||||
default_args=default_args,
|
||||
schedule_interval=None, # Manually triggered
|
||||
catchup=False,
|
||||
description='Manually clear/delete a specific YTDLP Redis queue/key (inbox, progress, result, fail). Use with caution!',
|
||||
tags=['ytdlp', 'queue', 'management', 'redis', 'manual', 'clear'],
|
||||
params={
|
||||
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
|
||||
'queue_to_clear': Param(
|
||||
DEFAULT_QUEUE_TO_CLEAR,
|
||||
type="string",
|
||||
description="Exact name of the Redis key to clear (e.g., 'video_queue_inbox_account_xyz', 'video_queue_progress', 'video_queue_result', 'video_queue_fail')."
|
||||
),
|
||||
}
|
||||
) as dag:
|
||||
|
||||
clear_queue_task = PythonOperator(
|
||||
task_id='clear_specified_queue',
|
||||
python_callable=clear_queue_callable,
|
||||
# Params are implicitly passed via context['params']
|
||||
)
|
||||
clear_queue_task.doc_md = """
|
||||
### Clear Specified Queue/Key Task
|
||||
Deletes the Redis key specified by the `queue_to_clear` parameter.
|
||||
This can target any key, including:
|
||||
- `_inbox` (Redis List): Contains URLs waiting to be processed.
|
||||
- `_progress` (Redis Hash): Contains URLs currently being processed.
|
||||
- `_result` (Redis Hash): Contains details of successfully processed URLs.
|
||||
- `_fail` (Redis Hash): Contains details of failed URLs.
|
||||
|
||||
**Warning:** This operation is destructive and cannot be undone. Ensure you specify the correct key name.
|
||||
*Trigger this task manually via the UI.*
|
||||
"""
|
||||
163
dags/ytdlp_mgmt_queue_list_contents.py
Normal file
163
dags/ytdlp_mgmt_queue_list_contents.py
Normal file
@ -0,0 +1,163 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# vim:fenc=utf-8
|
||||
#
|
||||
# Copyright © 2024 rl <rl@rlmbp>
|
||||
#
|
||||
# Distributed under terms of the MIT license.
|
||||
|
||||
"""
|
||||
Airflow DAG for manually listing the contents of a specific Redis key used by YTDLP queues.
|
||||
"""
|
||||
|
||||
from airflow import DAG
|
||||
from airflow.exceptions import AirflowException
|
||||
from airflow.models.param import Param
|
||||
from airflow.operators.python import PythonOperator
|
||||
from airflow.providers.redis.hooks.redis import RedisHook
|
||||
from airflow.utils.dates import days_ago
|
||||
from datetime import timedelta
|
||||
import logging
|
||||
import json
|
||||
import redis # Import redis exceptions if needed
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Default settings
|
||||
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
||||
# Default to a common inbox pattern, user should override with the specific key
|
||||
DEFAULT_QUEUE_TO_LIST = 'video_queue_inbox'
|
||||
DEFAULT_MAX_ITEMS = 100 # Limit number of items listed by default
|
||||
|
||||
# --- Helper Function ---
|
||||
|
||||
def _get_redis_client(redis_conn_id):
|
||||
"""Gets a Redis client connection using RedisHook."""
|
||||
try:
|
||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
||||
# decode_responses=True removed as it's not supported by get_conn in some environments
|
||||
# We will decode manually where needed.
|
||||
client = hook.get_conn()
|
||||
client.ping()
|
||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
||||
return client
|
||||
except redis.exceptions.AuthenticationError:
|
||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
||||
|
||||
# --- Python Callable for List Contents Task ---
|
||||
|
||||
def list_contents_callable(**context):
|
||||
"""Lists the contents of the specified Redis key (list or hash)."""
|
||||
params = context['params']
|
||||
redis_conn_id = params['redis_conn_id']
|
||||
queue_to_list = params['queue_to_list']
|
||||
max_items = params.get('max_items', DEFAULT_MAX_ITEMS)
|
||||
|
||||
if not queue_to_list:
|
||||
raise ValueError("Parameter 'queue_to_list' cannot be empty.")
|
||||
|
||||
logger.info(f"Attempting to list contents of Redis key '{queue_to_list}' (max: {max_items}) using connection '{redis_conn_id}'.")
|
||||
try:
|
||||
redis_client = _get_redis_client(redis_conn_id)
|
||||
key_type_bytes = redis_client.type(queue_to_list)
|
||||
key_type = key_type_bytes.decode('utf-8') # Decode type
|
||||
|
||||
if key_type == 'list':
|
||||
list_length = redis_client.llen(queue_to_list)
|
||||
# Get range, respecting max_items (0 to max_items-1)
|
||||
items_to_fetch = min(max_items, list_length)
|
||||
# lrange returns list of bytes, decode each item
|
||||
contents_bytes = redis_client.lrange(queue_to_list, 0, items_to_fetch - 1)
|
||||
contents = [item.decode('utf-8') for item in contents_bytes]
|
||||
logger.info(f"--- Contents of Redis List '{queue_to_list}' (showing first {len(contents)} of {list_length}) ---")
|
||||
for i, item in enumerate(contents):
|
||||
logger.info(f" [{i}]: {item}") # item is now a string
|
||||
if list_length > len(contents):
|
||||
logger.info(f" ... ({list_length - len(contents)} more items not shown)")
|
||||
logger.info(f"--- End of List Contents ---")
|
||||
# Optionally push contents to XCom if small enough
|
||||
# context['task_instance'].xcom_push(key='list_contents', value=contents)
|
||||
|
||||
elif key_type == 'hash':
|
||||
hash_size = redis_client.hlen(queue_to_list)
|
||||
# HGETALL can be risky for large hashes. Consider HSCAN for production.
|
||||
# For manual inspection, HGETALL is often acceptable.
|
||||
if hash_size > max_items * 2: # Heuristic: avoid huge HGETALL
|
||||
logger.warning(f"Hash '{queue_to_list}' has {hash_size} fields, which is large. Listing might be slow or incomplete. Consider using redis-cli HSCAN.")
|
||||
# Optionally implement HSCAN here for large hashes
|
||||
# hgetall returns dict of bytes keys and bytes values, decode them
|
||||
contents_bytes = redis_client.hgetall(queue_to_list)
|
||||
contents = {k.decode('utf-8'): v.decode('utf-8') for k, v in contents_bytes.items()}
|
||||
logger.info(f"--- Contents of Redis Hash '{queue_to_list}' ({len(contents)} fields) ---")
|
||||
item_count = 0
|
||||
for key, value in contents.items(): # key and value are now strings
|
||||
if item_count >= max_items:
|
||||
logger.info(f" ... (stopped listing after {max_items} items of {hash_size})")
|
||||
break
|
||||
# Attempt to pretty-print if value is JSON
|
||||
try:
|
||||
parsed_value = json.loads(value)
|
||||
pretty_value = json.dumps(parsed_value, indent=2)
|
||||
logger.info(f" '{key}':\n{pretty_value}")
|
||||
except json.JSONDecodeError:
|
||||
logger.info(f" '{key}': {value}") # Print as string if not JSON
|
||||
item_count += 1
|
||||
logger.info(f"--- End of Hash Contents ---")
|
||||
# Optionally push contents to XCom if small enough
|
||||
# context['task_instance'].xcom_push(key='hash_contents', value=contents)
|
||||
|
||||
elif key_type == 'none':
|
||||
logger.info(f"Redis key '{queue_to_list}' does not exist.")
|
||||
else:
|
||||
logger.info(f"Redis key '{queue_to_list}' is of type '{key_type}'. Listing contents for this type is not implemented.")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to list contents of Redis key '{queue_to_list}': {e}", exc_info=True)
|
||||
raise AirflowException(f"Failed to list Redis key contents: {e}")
|
||||
|
||||
# --- DAG Definition ---
|
||||
default_args = {
|
||||
'owner': 'airflow',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': False,
|
||||
'email_on_retry': False,
|
||||
'retries': 0, # No retries for manual list operation
|
||||
'start_date': days_ago(1)
|
||||
}
|
||||
|
||||
with DAG(
|
||||
dag_id='ytdlp_mgmt_queue_list_contents',
|
||||
default_args=default_args,
|
||||
schedule_interval=None, # Manually triggered
|
||||
catchup=False,
|
||||
description='Manually list the contents of a specific YTDLP Redis queue/key (list or hash).',
|
||||
tags=['ytdlp', 'queue', 'management', 'redis', 'manual', 'list'],
|
||||
params={
|
||||
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
|
||||
'queue_to_list': Param(
|
||||
DEFAULT_QUEUE_TO_LIST,
|
||||
type="string",
|
||||
description="Exact name of the Redis key (list/hash) to list contents for (e.g., 'video_queue_inbox_account_xyz', 'video_queue_progress', etc.)."
|
||||
),
|
||||
'max_items': Param(DEFAULT_MAX_ITEMS, type="integer", description="Maximum number of items/fields to list from the key."),
|
||||
}
|
||||
) as dag:
|
||||
|
||||
list_contents_task = PythonOperator(
|
||||
task_id='list_specified_queue_contents',
|
||||
python_callable=list_contents_callable,
|
||||
# Params are implicitly passed via context['params']
|
||||
)
|
||||
list_contents_task.doc_md = """
|
||||
### List Specified Queue/Key Contents Task
|
||||
Lists the contents of the Redis key specified by `queue_to_list`.
|
||||
- For **Lists** (e.g., `_inbox`), shows the first `max_items`.
|
||||
- For **Hashes** (e.g., `_progress`, `_result`, `_fail`), shows up to `max_items` key-value pairs. Attempts to pretty-print JSON values.
|
||||
- Logs a warning for very large hashes.
|
||||
|
||||
*Trigger this task manually via the UI.*
|
||||
"""
|
||||
910
dags/ytdlp_proc_sequential_processor.py
Normal file
910
dags/ytdlp_proc_sequential_processor.py
Normal file
@ -0,0 +1,910 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# vim:fenc=utf-8
|
||||
#
|
||||
# Copyright © 2024 rl <rl@rlmbp>
|
||||
#
|
||||
# Distributed under terms of the MIT license.
|
||||
|
||||
"""
|
||||
DAG for processing YouTube URLs sequentially from a Redis queue using YTDLP Ops Thrift service.
|
||||
"""
|
||||
|
||||
from airflow import DAG
|
||||
from airflow.exceptions import AirflowException, AirflowSkipException, AirflowFailException
|
||||
from airflow.hooks.base import BaseHook
|
||||
from airflow.models import BaseOperator, Variable
|
||||
from airflow.models.param import Param
|
||||
from airflow.operators.bash import BashOperator # Import BashOperator
|
||||
from airflow.operators.python import PythonOperator
|
||||
from airflow.providers.redis.hooks.redis import RedisHook
|
||||
from airflow.utils.dates import days_ago
|
||||
from airflow.utils.decorators import apply_defaults
|
||||
from datetime import datetime, timedelta
|
||||
from pangramia.yt.common.ttypes import TokenUpdateMode
|
||||
from pangramia.yt.exceptions.ttypes import PBServiceException
|
||||
from pangramia.yt.tokens_ops import YTTokenOpService
|
||||
from thrift.protocol import TBinaryProtocol
|
||||
from thrift.transport import TSocket, TTransport
|
||||
from thrift.transport.TTransport import TTransportException
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import redis # Import redis exceptions if needed
|
||||
import socket
|
||||
import time
|
||||
import traceback # For logging stack traces in failure handler
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Default settings
|
||||
DEFAULT_QUEUE_NAME = 'video_queue' # Base name for queues
|
||||
DEFAULT_REDIS_CONN_ID = 'redis_default'
|
||||
DEFAULT_TIMEOUT = 30 # Default Thrift timeout in seconds
|
||||
MAX_RETRIES_REDIS_LOOKUP = 3 # Retries for fetching service details from Redis
|
||||
RETRY_DELAY_REDIS_LOOKUP = 10 # Delay (seconds) for Redis lookup retries
|
||||
|
||||
# --- Helper Functions ---
|
||||
|
||||
def _get_redis_client(redis_conn_id):
|
||||
"""Gets a Redis client connection using RedisHook."""
|
||||
try:
|
||||
hook = RedisHook(redis_conn_id=redis_conn_id)
|
||||
client = hook.get_conn()
|
||||
client.ping()
|
||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}'.")
|
||||
return client
|
||||
except redis.exceptions.AuthenticationError:
|
||||
logger.error(f"Redis authentication failed for connection '{redis_conn_id}'. Check password.")
|
||||
raise AirflowException(f"Redis authentication failed for '{redis_conn_id}'.")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get Redis client for connection '{redis_conn_id}': {e}")
|
||||
raise AirflowException(f"Redis connection failed for '{redis_conn_id}': {e}")
|
||||
|
||||
def _extract_video_id(url):
|
||||
"""Extracts YouTube video ID from URL."""
|
||||
if not url or not isinstance(url, str):
|
||||
logger.debug("URL is empty or not a string, cannot extract video ID.")
|
||||
return None
|
||||
try:
|
||||
video_id = None
|
||||
if 'youtube.com/watch?v=' in url:
|
||||
video_id = url.split('v=')[1].split('&')[0]
|
||||
elif 'youtu.be/' in url:
|
||||
video_id = url.split('youtu.be/')[1].split('?')[0]
|
||||
|
||||
if video_id and len(video_id) >= 11:
|
||||
video_id = video_id[:11] # Standard ID length
|
||||
logger.debug(f"Extracted video ID '{video_id}' from URL: {url}")
|
||||
return video_id
|
||||
else:
|
||||
logger.debug(f"Could not extract a standard video ID pattern from URL: {url}")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to extract video ID from URL '{url}'. Error: {e}")
|
||||
return None
|
||||
|
||||
# --- Queue Management Callables ---
|
||||
|
||||
def pop_url_from_queue(**context):
|
||||
"""Pops a URL from the inbox queue and pushes to XCom."""
|
||||
params = context['params']
|
||||
queue_name = params['queue_name']
|
||||
inbox_queue = f"{queue_name}_inbox"
|
||||
redis_conn_id = params.get('redis_conn_id', DEFAULT_REDIS_CONN_ID)
|
||||
logger.info(f"Attempting to pop URL from inbox queue: {inbox_queue}")
|
||||
|
||||
try:
|
||||
client = _get_redis_client(redis_conn_id)
|
||||
# LPOP is non-blocking, returns None if empty
|
||||
url_bytes = client.lpop(inbox_queue) # Returns bytes if decode_responses=False on hook/client
|
||||
|
||||
if url_bytes:
|
||||
url = url_bytes.decode('utf-8') if isinstance(url_bytes, bytes) else url_bytes
|
||||
logger.info(f"Popped URL: {url}")
|
||||
context['task_instance'].xcom_push(key='current_url', value=url)
|
||||
return url # Return URL for logging/potential use
|
||||
else:
|
||||
logger.info(f"Inbox queue '{inbox_queue}' is empty. Skipping downstream tasks.")
|
||||
context['task_instance'].xcom_push(key='current_url', value=None)
|
||||
# Raise AirflowSkipException to signal downstream tasks to skip
|
||||
raise AirflowSkipException(f"Inbox queue '{inbox_queue}' is empty.")
|
||||
except AirflowSkipException:
|
||||
raise # Re-raise skip exception
|
||||
except Exception as e:
|
||||
logger.error(f"Error popping URL from Redis queue '{inbox_queue}': {e}", exc_info=True)
|
||||
raise AirflowException(f"Failed to pop URL from Redis: {e}")
|
||||
|
||||
|
||||
def move_url_to_progress(**context):
|
||||
"""Moves the current URL from XCom to the progress hash."""
|
||||
ti = context['task_instance']
|
||||
url = ti.xcom_pull(task_ids='pop_url_from_queue', key='current_url')
|
||||
|
||||
# This task should be skipped if pop_url_from_queue raised AirflowSkipException
|
||||
# Adding check for robustness
|
||||
if not url:
|
||||
logger.info("No URL found in XCom (or upstream skipped). Skipping move to progress.")
|
||||
raise AirflowSkipException("No URL to process.")
|
||||
|
||||
params = context['params']
|
||||
queue_name = params['queue_name']
|
||||
progress_queue = f"{queue_name}_progress"
|
||||
redis_conn_id = params.get('redis_conn_id', DEFAULT_REDIS_CONN_ID)
|
||||
logger.info(f"Moving URL '{url}' to progress hash: {progress_queue}")
|
||||
|
||||
progress_data = {
|
||||
'status': 'processing',
|
||||
'start_time': time.time(),
|
||||
'dag_run_id': context['dag_run'].run_id,
|
||||
'task_instance_key_str': context['task_instance_key_str']
|
||||
}
|
||||
|
||||
try:
|
||||
client = _get_redis_client(redis_conn_id)
|
||||
client.hset(progress_queue, url, json.dumps(progress_data))
|
||||
logger.info(f"Moved URL '{url}' to progress hash '{progress_queue}'.")
|
||||
except Exception as e:
|
||||
logger.error(f"Error moving URL to Redis progress hash '{progress_queue}': {e}", exc_info=True)
|
||||
# If this fails, the URL is popped but not tracked as processing. Fail the task.
|
||||
raise AirflowException(f"Failed to move URL to progress hash: {e}")
|
||||
|
||||
|
||||
def handle_success(**context):
|
||||
"""Moves URL from progress to result hash on success."""
|
||||
ti = context['task_instance']
|
||||
url = ti.xcom_pull(task_ids='pop_url_from_queue', key='current_url')
|
||||
if not url:
|
||||
logger.warning("handle_success called but no URL found from pop_url_from_queue XCom. This shouldn't happen on success path.")
|
||||
return # Or raise error
|
||||
|
||||
params = context['params']
|
||||
queue_name = params['queue_name']
|
||||
progress_queue = f"{queue_name}_progress"
|
||||
result_queue = f"{queue_name}_result"
|
||||
redis_conn_id = params.get('redis_conn_id', DEFAULT_REDIS_CONN_ID)
|
||||
|
||||
# Pull results from get_token task
|
||||
info_json_path = ti.xcom_pull(task_ids='get_token', key='info_json_path')
|
||||
socks_proxy = ti.xcom_pull(task_ids='get_token', key='socks_proxy')
|
||||
ytdlp_command = ti.xcom_pull(task_ids='get_token', key='ytdlp_command') # Original command
|
||||
|
||||
logger.info(f"Handling success for URL: {url}")
|
||||
logger.info(f" Info JSON Path: {info_json_path}")
|
||||
logger.info(f" SOCKS Proxy: {socks_proxy}")
|
||||
logger.info(f" YTDLP Command: {ytdlp_command[:100] if ytdlp_command else 'None'}...") # Log truncated command
|
||||
|
||||
result_data = {
|
||||
'status': 'success',
|
||||
'end_time': time.time(),
|
||||
'info_json_path': info_json_path,
|
||||
'socks_proxy': socks_proxy,
|
||||
'ytdlp_command': ytdlp_command,
|
||||
'url': url,
|
||||
'dag_run_id': context['dag_run'].run_id,
|
||||
'task_instance_key_str': context['task_instance_key_str'] # Record which task instance succeeded
|
||||
}
|
||||
|
||||
try:
|
||||
client = _get_redis_client(redis_conn_id)
|
||||
# Remove from progress hash
|
||||
removed_count = client.hdel(progress_queue, url)
|
||||
if removed_count > 0:
|
||||
logger.info(f"Removed URL '{url}' from progress hash '{progress_queue}'.")
|
||||
else:
|
||||
logger.warning(f"URL '{url}' not found in progress hash '{progress_queue}' during success handling.")
|
||||
|
||||
# Add to result hash
|
||||
client.hset(result_queue, url, json.dumps(result_data))
|
||||
logger.info(f"Stored success result for URL '{url}' in result hash '{result_queue}'.")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error handling success in Redis for URL '{url}': {e}", exc_info=True)
|
||||
# Even if Redis fails, the task succeeded. Log error but don't fail the task.
|
||||
# Consider adding retry logic for Redis operations here or marking state differently.
|
||||
|
||||
|
||||
def handle_failure(**context):
|
||||
"""Moves URL from progress to fail hash on failure."""
|
||||
ti = context['task_instance']
|
||||
url = ti.xcom_pull(task_ids='pop_url_from_queue', key='current_url')
|
||||
if not url:
|
||||
logger.error("handle_failure called but no URL found from pop_url_from_queue XCom.")
|
||||
# Cannot move to fail queue if URL is unknown
|
||||
return
|
||||
|
||||
params = context['params']
|
||||
queue_name = params['queue_name']
|
||||
progress_queue = f"{queue_name}_progress"
|
||||
fail_queue = f"{queue_name}_fail"
|
||||
redis_conn_id = params.get('redis_conn_id', DEFAULT_REDIS_CONN_ID)
|
||||
|
||||
# Get failure reason from the exception context
|
||||
exception = context.get('exception')
|
||||
error_message = str(exception) if exception else "Unknown error"
|
||||
# Get traceback if available
|
||||
tb_str = traceback.format_exc() if exception else "No traceback available."
|
||||
|
||||
logger.info(f"Handling failure for URL: {url}")
|
||||
logger.error(f" Failure Reason: {error_message}") # Log the error that triggered failure
|
||||
logger.debug(f" Traceback:\n{tb_str}") # Log traceback at debug level
|
||||
|
||||
fail_data = {
|
||||
'status': 'failed',
|
||||
'end_time': time.time(),
|
||||
'error': error_message,
|
||||
'traceback': tb_str, # Store traceback
|
||||
'url': url,
|
||||
'dag_run_id': context['dag_run'].run_id,
|
||||
'task_instance_key_str': context['task_instance_key_str'] # Record which task instance failed
|
||||
}
|
||||
|
||||
try:
|
||||
client = _get_redis_client(redis_conn_id)
|
||||
# Remove from progress hash
|
||||
removed_count = client.hdel(progress_queue, url)
|
||||
if removed_count > 0:
|
||||
logger.info(f"Removed URL '{url}' from progress hash '{progress_queue}'.")
|
||||
else:
|
||||
logger.warning(f"URL '{url}' not found in progress hash '{progress_queue}' during failure handling.")
|
||||
|
||||
# Add to fail hash
|
||||
client.hset(fail_queue, url, json.dumps(fail_data))
|
||||
logger.info(f"Stored failure details for URL '{url}' in fail hash '{fail_queue}'.")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error handling failure in Redis for URL '{url}': {e}", exc_info=True)
|
||||
# Log error, but the task already failed.
|
||||
|
||||
|
||||
# --- YtdlpOpsOperator ---
|
||||
|
||||
class YtdlpOpsOperator(BaseOperator):
|
||||
"""
|
||||
Custom Airflow operator to interact with YTDLP Thrift service. Handles direct connections
|
||||
and Redis-based discovery, retrieves tokens, saves info.json, and manages errors.
|
||||
Modified to pull URL from XCom for sequential processing.
|
||||
"""
|
||||
# Removed 'url' from template_fields as it's pulled from XCom
|
||||
template_fields = ('service_ip', 'service_port', 'account_id', 'timeout', 'info_json_dir', 'redis_conn_id')
|
||||
|
||||
@apply_defaults
|
||||
def __init__(self,
|
||||
# url parameter removed - will be pulled from XCom
|
||||
redis_conn_id=DEFAULT_REDIS_CONN_ID,
|
||||
max_retries_lookup=MAX_RETRIES_REDIS_LOOKUP,
|
||||
retry_delay_lookup=RETRY_DELAY_REDIS_LOOKUP,
|
||||
service_ip=None,
|
||||
service_port=None,
|
||||
redis_enabled=False, # Default to direct connection now
|
||||
account_id=None,
|
||||
# save_info_json removed, always True
|
||||
info_json_dir=None,
|
||||
# get_socks_proxy removed, always True
|
||||
# store_socks_proxy removed, always True
|
||||
# get_socks_proxy=True, # Removed
|
||||
# store_socks_proxy=True, # Store proxy in XCom by default # Removed
|
||||
timeout=DEFAULT_TIMEOUT,
|
||||
*args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
logger.info(f"Initializing YtdlpOpsOperator (Processor Version) with parameters: "
|
||||
f"redis_conn_id={redis_conn_id}, max_retries_lookup={max_retries_lookup}, retry_delay_lookup={retry_delay_lookup}, "
|
||||
f"service_ip={service_ip}, service_port={service_port}, redis_enabled={redis_enabled}, "
|
||||
f"account_id={account_id}, info_json_dir={info_json_dir}, timeout={timeout}")
|
||||
# save_info_json, get_socks_proxy, store_socks_proxy removed from log
|
||||
|
||||
# Validate parameters based on connection mode
|
||||
if redis_enabled:
|
||||
# If using Redis, account_id is essential for lookup
|
||||
if not account_id:
|
||||
raise ValueError("account_id is required when redis_enabled=True for service lookup.")
|
||||
else:
|
||||
# If direct connection, IP and Port are essential
|
||||
if not service_ip or not service_port:
|
||||
raise ValueError("Both service_ip and service_port must be specified when redis_enabled=False.")
|
||||
# Account ID is still needed for the API call itself, but rely on DAG param or operator config
|
||||
if not account_id:
|
||||
logger.warning("No account_id provided for direct connection mode. Ensure it's set in DAG params or operator config.")
|
||||
# We won't assign 'default' here, let the value passed during instantiation be used.
|
||||
|
||||
# self.url is no longer needed here
|
||||
self.redis_conn_id = redis_conn_id
|
||||
self.max_retries_lookup = max_retries_lookup
|
||||
self.retry_delay_lookup = int(retry_delay_lookup.total_seconds() if isinstance(retry_delay_lookup, timedelta) else retry_delay_lookup)
|
||||
self.service_ip = service_ip
|
||||
self.service_port = service_port
|
||||
self.redis_enabled = redis_enabled
|
||||
self.account_id = account_id
|
||||
# self.save_info_json removed
|
||||
self.info_json_dir = info_json_dir # Still needed
|
||||
# self.get_socks_proxy removed
|
||||
# self.store_socks_proxy removed
|
||||
self.timeout = timeout
|
||||
|
||||
def execute(self, context):
|
||||
logger.info("Executing YtdlpOpsOperator (Processor Version)")
|
||||
transport = None
|
||||
ti = context['task_instance'] # Get task instance for XCom access
|
||||
|
||||
try:
|
||||
# --- Get URL from XCom ---
|
||||
url = ti.xcom_pull(task_ids='pop_url_from_queue', key='current_url')
|
||||
if not url:
|
||||
# This should ideally be caught by upstream skip, but handle defensively
|
||||
logger.info("No URL found in XCom from pop_url_from_queue. Skipping execution.")
|
||||
raise AirflowSkipException("Upstream task did not provide a URL.")
|
||||
logger.info(f"Processing URL from XCom: {url}")
|
||||
# --- End Get URL ---
|
||||
|
||||
logger.info("Getting task parameters and rendering templates")
|
||||
params = context['params'] # DAG run params
|
||||
|
||||
# Render template fields using context
|
||||
# Use render_template_as_native for better type handling if needed, else render_template
|
||||
redis_conn_id = self.render_template(self.redis_conn_id, context)
|
||||
service_ip = self.render_template(self.service_ip, context)
|
||||
service_port_rendered = self.render_template(self.service_port, context)
|
||||
account_id = self.render_template(self.account_id, context)
|
||||
timeout_rendered = self.render_template(self.timeout, context)
|
||||
info_json_dir = self.render_template(self.info_json_dir, context) # Rendered here for _save_info_json
|
||||
|
||||
# Determine effective settings (DAG params override operator defaults)
|
||||
redis_enabled = params.get('redis_enabled', self.redis_enabled)
|
||||
account_id = params.get('account_id', account_id) # Use DAG param if provided
|
||||
redis_conn_id = params.get('redis_conn_id', redis_conn_id) # Use DAG param if provided
|
||||
|
||||
logger.info(f"Effective settings: redis_enabled={redis_enabled}, account_id='{account_id}', redis_conn_id='{redis_conn_id}'")
|
||||
|
||||
host = None
|
||||
port = None
|
||||
|
||||
if redis_enabled:
|
||||
# Get Redis connection using the helper for consistency
|
||||
redis_client = _get_redis_client(redis_conn_id)
|
||||
logger.info(f"Successfully connected to Redis using connection '{redis_conn_id}' for service discovery.")
|
||||
|
||||
# Get service details from Redis with retries
|
||||
service_key = f"ytdlp:{account_id}"
|
||||
legacy_key = account_id # For backward compatibility
|
||||
|
||||
for attempt in range(self.max_retries_lookup):
|
||||
try:
|
||||
logger.info(f"Attempt {attempt + 1}/{self.max_retries_lookup}: Fetching service details from Redis for keys: '{service_key}', '{legacy_key}'")
|
||||
service_details = redis_client.hgetall(service_key)
|
||||
if not service_details:
|
||||
logger.warning(f"Key '{service_key}' not found, trying legacy key '{legacy_key}'")
|
||||
service_details = redis_client.hgetall(legacy_key)
|
||||
|
||||
if not service_details:
|
||||
raise ValueError(f"No service details found in Redis for keys: {service_key} or {legacy_key}")
|
||||
|
||||
# Find IP and port (case-insensitive keys)
|
||||
ip_key = next((k for k in service_details if k.lower() == 'ip'), None)
|
||||
port_key = next((k for k in service_details if k.lower() == 'port'), None)
|
||||
|
||||
if not ip_key: raise ValueError(f"'ip' key not found in Redis hash for {service_key}/{legacy_key}")
|
||||
if not port_key: raise ValueError(f"'port' key not found in Redis hash for {service_key}/{legacy_key}")
|
||||
|
||||
host = service_details[ip_key] # Assumes decode_responses=True in hook
|
||||
port_str = service_details[port_key]
|
||||
|
||||
try:
|
||||
port = int(port_str)
|
||||
except (ValueError, TypeError):
|
||||
raise ValueError(f"Invalid port value '{port_str}' found in Redis for {service_key}/{legacy_key}")
|
||||
|
||||
logger.info(f"Extracted from Redis - Service IP: {host}, Service Port: {port}")
|
||||
break # Success
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Attempt {attempt + 1} failed to get Redis details: {str(e)}")
|
||||
if attempt == self.max_retries_lookup - 1:
|
||||
logger.error("Max retries reached for fetching Redis details.")
|
||||
raise AirflowException(f"Failed to get service details from Redis after {self.max_retries_lookup} attempts: {e}")
|
||||
logger.info(f"Retrying in {self.retry_delay_lookup} seconds...")
|
||||
time.sleep(self.retry_delay_lookup)
|
||||
else:
|
||||
# Direct connection: Use rendered/param values
|
||||
host = params.get('service_ip', service_ip) # Use DAG param if provided
|
||||
port_str = params.get('service_port', service_port_rendered) # Use DAG param if provided
|
||||
|
||||
logger.info(f"Using direct connection settings: service_ip={host}, service_port={port_str}")
|
||||
|
||||
if not host or not port_str:
|
||||
raise ValueError("Direct connection requires service_ip and service_port (check Operator config and DAG params)")
|
||||
try:
|
||||
port = int(port_str)
|
||||
except (ValueError, TypeError):
|
||||
raise ValueError(f"Invalid service_port value: {port_str}")
|
||||
|
||||
logger.info(f"Connecting directly to Thrift service at {host}:{port} (Redis bypassed)")
|
||||
|
||||
# Validate and use timeout
|
||||
try:
|
||||
timeout = int(timeout_rendered)
|
||||
if timeout <= 0: raise ValueError("Timeout must be positive")
|
||||
logger.info(f"Using timeout: {timeout} seconds")
|
||||
except (ValueError, TypeError):
|
||||
logger.warning(f"Invalid timeout value: '{timeout_rendered}'. Using default: {DEFAULT_TIMEOUT}")
|
||||
timeout = DEFAULT_TIMEOUT
|
||||
|
||||
# Create Thrift connection objects
|
||||
# socket_conn = TSocket.TSocket(host, port) # Original
|
||||
socket_conn = TSocket.TSocket(host, port, socket_family=socket.AF_INET) # Explicitly use AF_INET (IPv4)
|
||||
socket_conn.setTimeout(timeout * 1000) # Thrift timeout is in milliseconds
|
||||
transport = TTransport.TFramedTransport(socket_conn) # Use TFramedTransport if server expects it
|
||||
# transport = TTransport.TBufferedTransport(socket_conn) # Use TBufferedTransport if server expects it
|
||||
protocol = TBinaryProtocol.TBinaryProtocol(transport)
|
||||
client = YTTokenOpService.Client(protocol)
|
||||
|
||||
logger.info(f"Attempting to connect to Thrift server at {host}:{port}...")
|
||||
try:
|
||||
transport.open()
|
||||
logger.info("Successfully connected to Thrift server.")
|
||||
|
||||
# Test connection with ping
|
||||
try:
|
||||
client.ping()
|
||||
logger.info("Server ping successful.")
|
||||
except Exception as e:
|
||||
logger.error(f"Server ping failed: {e}")
|
||||
raise AirflowException(f"Server connection test (ping) failed: {e}")
|
||||
|
||||
# Get token from service using the URL from XCom
|
||||
try:
|
||||
logger.info(f"Requesting token for accountId='{account_id}', url='{url}'")
|
||||
token_data = client.getOrRefreshToken(
|
||||
accountId=account_id,
|
||||
updateType=TokenUpdateMode.AUTO,
|
||||
url=url # Use the url variable from XCom
|
||||
)
|
||||
logger.info("Successfully retrieved token data from service.")
|
||||
except PBServiceException as e:
|
||||
# Handle specific service exceptions
|
||||
error_code = getattr(e, 'errorCode', 'N/A')
|
||||
error_message = getattr(e, 'message', 'N/A')
|
||||
error_context = getattr(e, 'context', {})
|
||||
logger.error(f"PBServiceException occurred: Code={error_code}, Message={error_message}")
|
||||
if error_context:
|
||||
logger.error(f" Context: {error_context}") # Log context separately
|
||||
# Construct a concise error message for AirflowException
|
||||
error_msg = f"YTDLP service error (Code: {error_code}): {error_message}"
|
||||
# Add specific error code handling if needed...
|
||||
logger.error(f"Failing task instance due to PBServiceException: {error_msg}") # Add explicit log before raising
|
||||
raise AirflowException(error_msg) # Fail task on service error
|
||||
except TTransportException as e:
|
||||
logger.error(f"Thrift transport error during getOrRefreshToken: {e}")
|
||||
logger.error(f"Failing task instance due to TTransportException: {e}") # Add explicit log before raising
|
||||
raise AirflowException(f"Transport error during API call: {e}")
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error during getOrRefreshToken: {e}")
|
||||
logger.error(f"Failing task instance due to unexpected error during API call: {e}") # Add explicit log before raising
|
||||
raise AirflowException(f"Unexpected error during API call: {e}")
|
||||
|
||||
except TTransportException as e:
|
||||
# Handle connection errors
|
||||
logger.error(f"Thrift transport error during connection: {str(e)}")
|
||||
logger.error(f"Failing task instance due to TTransportException during connection: {e}") # Add explicit log before raising
|
||||
raise AirflowException(f"Transport error connecting to YTDLP service: {str(e)}")
|
||||
# Removed the overly broad except Exception block here, as inner blocks raise AirflowException
|
||||
|
||||
# --- Process Token Data ---
|
||||
logger.debug(f"Token data received. Attributes: {dir(token_data)}")
|
||||
|
||||
info_json_path = None # Initialize
|
||||
|
||||
# save_info_json is now always True
|
||||
logger.info("Proceeding to save info.json (save_info_json=True).")
|
||||
info_json = self._get_info_json(token_data)
|
||||
if info_json and self._is_valid_json(info_json):
|
||||
try:
|
||||
# Pass rendered info_json_dir to helper
|
||||
info_json_path = self._save_info_json(context, info_json, url, account_id, info_json_dir)
|
||||
if info_json_path:
|
||||
ti.xcom_push(key='info_json_path', value=info_json_path)
|
||||
logger.info(f"Successfully saved info.json and pushed path to XCom: {info_json_path}")
|
||||
else:
|
||||
ti.xcom_push(key='info_json_path', value=None)
|
||||
logger.warning("info.json saving failed (check logs from _save_info_json).")
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error during info.json saving process: {e}", exc_info=True)
|
||||
ti.xcom_push(key='info_json_path', value=None)
|
||||
elif info_json:
|
||||
logger.warning("Retrieved infoJson is not valid JSON. Skipping save.")
|
||||
ti.xcom_push(key='info_json_path', value=None)
|
||||
else:
|
||||
logger.info("No infoJson found in token data. Skipping save.")
|
||||
ti.xcom_push(key='info_json_path', value=None)
|
||||
|
||||
|
||||
# Extract and potentially store SOCKS proxy
|
||||
# get_socks_proxy and store_socks_proxy are now always True
|
||||
socks_proxy = None
|
||||
logger.info("Attempting to extract SOCKS proxy (get_socks_proxy=True).")
|
||||
proxy_attr = next((attr for attr in ['socks5Proxy', 'socksProxy', 'socks'] if hasattr(token_data, attr)), None)
|
||||
if proxy_attr:
|
||||
socks_proxy = getattr(token_data, proxy_attr)
|
||||
if socks_proxy:
|
||||
logger.info(f"Extracted SOCKS proxy ({proxy_attr}): {socks_proxy}")
|
||||
# Always store if found (store_socks_proxy=True)
|
||||
ti.xcom_push(key='socks_proxy', value=socks_proxy)
|
||||
logger.info("Pushed 'socks_proxy' to XCom.")
|
||||
else:
|
||||
logger.info(f"Found proxy attribute '{proxy_attr}' but value is empty.")
|
||||
# Store None if attribute found but empty
|
||||
ti.xcom_push(key='socks_proxy', value=None)
|
||||
logger.info("Pushed None to XCom for 'socks_proxy' as extracted value was empty.")
|
||||
else:
|
||||
logger.info("No SOCKS proxy attribute found in token data.")
|
||||
# Store None if attribute not found
|
||||
ti.xcom_push(key='socks_proxy', value=None)
|
||||
logger.info("Pushed None to XCom for 'socks_proxy' as attribute was not found.")
|
||||
|
||||
|
||||
# --- Removed old logic block ---
|
||||
# # Extract and potentially store SOCKS proxy
|
||||
# socks_proxy = None
|
||||
# get_socks_proxy = params.get('get_socks_proxy', self.get_socks_proxy)
|
||||
# store_socks_proxy = params.get('store_socks_proxy', self.store_socks_proxy)
|
||||
#
|
||||
# if get_socks_proxy:
|
||||
# proxy_attr = next((attr for attr in ['socks5Proxy', 'socksProxy', 'socks'] if hasattr(token_data, attr)), None)
|
||||
# if proxy_attr:
|
||||
# socks_proxy = getattr(token_data, proxy_attr)
|
||||
# if socks_proxy:
|
||||
# logger.info(f"Extracted SOCKS proxy ({proxy_attr}): {socks_proxy}")
|
||||
# if store_socks_proxy:
|
||||
# ti.xcom_push(key='socks_proxy', value=socks_proxy)
|
||||
# logger.info("Pushed 'socks_proxy' to XCom.")
|
||||
# else:
|
||||
# logger.info(f"Found proxy attribute '{proxy_attr}' but value is empty.")
|
||||
# if store_socks_proxy: ti.xcom_push(key='socks_proxy', value=None)
|
||||
# else:
|
||||
# logger.info("get_socks_proxy is True, but no SOCKS proxy attribute found.")
|
||||
# if store_socks_proxy: ti.xcom_push(key='socks_proxy', value=None)
|
||||
# else:
|
||||
# logger.info("get_socks_proxy is False. Skipping proxy extraction.")
|
||||
# if store_socks_proxy: ti.xcom_push(key='socks_proxy', value=None)
|
||||
# --- End Removed old logic block ---
|
||||
|
||||
|
||||
# Get the original command from the server
|
||||
ytdlp_cmd = getattr(token_data, 'ytdlpCommand', None)
|
||||
if not ytdlp_cmd:
|
||||
logger.error("No 'ytdlpCommand' attribute found in token data.")
|
||||
raise AirflowException("Required 'ytdlpCommand' not received from service.")
|
||||
|
||||
logger.info(f"Original command received from server: {ytdlp_cmd[:100]}...") # Log truncated
|
||||
|
||||
# Push the *original* command to XCom
|
||||
ti.xcom_push(key='ytdlp_command', value=ytdlp_cmd)
|
||||
logger.info("Pushed original command to XCom key 'ytdlp_command'.")
|
||||
|
||||
# No explicit return needed, success is implicit if no exception raised
|
||||
|
||||
except (AirflowSkipException, AirflowFailException) as e:
|
||||
logger.info(f"Task skipped or failed explicitly: {e}")
|
||||
raise # Re-raise to let Airflow handle state
|
||||
except AirflowException as e: # Catch AirflowExceptions raised explicitly
|
||||
logger.error(f"Operation failed due to AirflowException: {e}", exc_info=True)
|
||||
raise # Re-raise AirflowExceptions to ensure task failure
|
||||
except (TTransportException, PBServiceException) as e: # Catch specific Thrift/Service errors not already handled inside inner try
|
||||
logger.error(f"Unhandled YTDLP Service/Transport error in outer block: {e}", exc_info=True)
|
||||
logger.error(f"Failing task instance due to unhandled outer Service/Transport error: {e}") # Add explicit log before raising
|
||||
raise AirflowException(f"Unhandled YTDLP service error: {e}") # Wrap in AirflowException to fail task
|
||||
except Exception as e: # General catch-all for truly unexpected errors
|
||||
logger.error(f"Caught unexpected error in YtdlpOpsOperator outer block: {e}", exc_info=True)
|
||||
logger.error(f"Failing task instance due to unexpected outer error: {e}") # Add explicit log before raising
|
||||
raise AirflowException(f"Unexpected error caused task failure: {e}") # Wrap to fail task
|
||||
finally:
|
||||
if transport and transport.isOpen():
|
||||
logger.info("Closing Thrift transport.")
|
||||
transport.close()
|
||||
|
||||
# --- Helper Methods ---
|
||||
|
||||
def _get_info_json(self, token_data):
|
||||
"""Safely extracts infoJson from token data."""
|
||||
return getattr(token_data, 'infoJson', None)
|
||||
|
||||
def _is_valid_json(self, json_str):
|
||||
"""Checks if a string is valid JSON."""
|
||||
if not json_str or not isinstance(json_str, str): return False
|
||||
try:
|
||||
json.loads(json_str)
|
||||
return True
|
||||
except json.JSONDecodeError:
|
||||
return False
|
||||
|
||||
def _save_info_json(self, context, info_json, url, account_id, rendered_info_json_dir):
|
||||
"""Saves info_json to a file. Uses pre-rendered directory path."""
|
||||
try:
|
||||
video_id = _extract_video_id(url) # Use standalone helper
|
||||
|
||||
save_dir = rendered_info_json_dir or "." # Use rendered path
|
||||
logger.info(f"Target directory for info.json: {save_dir}")
|
||||
|
||||
# Ensure directory exists
|
||||
try:
|
||||
os.makedirs(save_dir, exist_ok=True)
|
||||
logger.info(f"Ensured directory exists: {save_dir}")
|
||||
except OSError as e:
|
||||
logger.error(f"Could not create directory {save_dir}: {e}. Cannot save info.json.")
|
||||
return None
|
||||
|
||||
# Construct filename
|
||||
timestamp = int(time.time())
|
||||
base_filename = f"info_{video_id or 'unknown'}_{account_id}_{timestamp}.json"
|
||||
info_json_path = os.path.join(save_dir, base_filename)
|
||||
latest_json_path = os.path.join(save_dir, "latest.json") # Path for the latest symlink/copy
|
||||
|
||||
# Write to timestamped file
|
||||
try:
|
||||
logger.info(f"Writing info.json content (received from service) to {info_json_path}...")
|
||||
with open(info_json_path, 'w', encoding='utf-8') as f:
|
||||
f.write(info_json)
|
||||
logger.info(f"Successfully saved info.json to timestamped file: {info_json_path}")
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to write info.json to {info_json_path}: {e}")
|
||||
return None
|
||||
|
||||
# Write to latest.json (overwrite) - best effort
|
||||
try:
|
||||
with open(latest_json_path, 'w', encoding='utf-8') as f:
|
||||
f.write(info_json)
|
||||
logger.info(f"Updated latest.json file: {latest_json_path}")
|
||||
except IOError as e:
|
||||
logger.warning(f"Failed to update latest.json at {latest_json_path}: {e}")
|
||||
|
||||
return info_json_path
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error in _save_info_json: {e}", exc_info=True)
|
||||
return None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# DAG Definition
|
||||
# =============================================================================
|
||||
|
||||
default_args = {
|
||||
'owner': 'airflow',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': False,
|
||||
'email_on_retry': False,
|
||||
'retries': 1, # Default retries for tasks like queue management
|
||||
'retry_delay': timedelta(minutes=1),
|
||||
'start_date': days_ago(1),
|
||||
# Add concurrency control if needed for sequential processing
|
||||
# 'concurrency': 1, # Ensure only one task instance runs at a time per DAG run
|
||||
# 'max_active_runs': 1, # Ensure only one DAG run is active
|
||||
}
|
||||
|
||||
# Define DAG
|
||||
with DAG(
|
||||
dag_id='ytdlp_proc_sequential_processor', # New DAG ID
|
||||
default_args=default_args,
|
||||
schedule_interval=None, # Manually triggered or triggered by external sensor/event
|
||||
catchup=False,
|
||||
description='Processes YouTube URLs sequentially from a Redis queue using YTDLP Ops.',
|
||||
tags=['ytdlp', 'thrift', 'client', 'sequential', 'queue', 'processor'], # Updated tags
|
||||
params={
|
||||
# Define DAG parameters
|
||||
'queue_name': Param(DEFAULT_QUEUE_NAME, type="string", description="Base name for Redis queues (e.g., 'video_queue' -> video_queue_inbox, video_queue_progress, etc.)."),
|
||||
'redis_conn_id': Param(DEFAULT_REDIS_CONN_ID, type="string", description="Airflow Redis connection ID."),
|
||||
# YtdlpOpsOperator specific params (can be overridden at task level if needed)
|
||||
'redis_enabled': Param(False, type="boolean", description="Use Redis for service discovery? If False, uses service_ip/port."), # Default changed to False
|
||||
'service_ip': Param(None, type=["null", "string"], description="Required Service IP if redis_enabled=False."), # Clarified requirement
|
||||
'service_port': Param(None, type=["null", "integer"], description="Required Service port if redis_enabled=False."), # Clarified requirement
|
||||
'account_id': Param('default_account', type="string", description="Account ID for the API call (used for Redis lookup if redis_enabled=True)."), # Clarified usage
|
||||
'timeout': Param(DEFAULT_TIMEOUT, type="integer", description="Timeout in seconds for the Thrift connection."),
|
||||
# save_info_json removed, always True
|
||||
# get_socks_proxy removed, always True
|
||||
# store_socks_proxy removed, always True
|
||||
# Download specific parameters
|
||||
'download_format': Param(
|
||||
# Default to best audio-only format (e.g., m4a)
|
||||
'ba[ext=m4a]/bestaudio/best',
|
||||
type="string",
|
||||
description="yt-dlp format selection string (e.g., 'ba' for best audio, 'wv*+wa/w' for worst video+audio)."
|
||||
),
|
||||
'output_path_template': Param(
|
||||
# Simplified template, removed queue_name subdir
|
||||
"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloads') }}/%(title)s [%(id)s].%(ext)s",
|
||||
type="string",
|
||||
description="yt-dlp output template (e.g., '/path/to/downloads/%(title)s.%(ext)s'). Uses Airflow Variable 'DOWNLOADS_TEMP'."
|
||||
),
|
||||
# Simplified info_json_dir, just uses DOWNLOADS_TEMP variable
|
||||
'info_json_dir': Param(
|
||||
"{{ var.value.get('DOWNLOADS_TEMP', '/opt/airflow/downloadfiles') }}",
|
||||
type="string",
|
||||
description="Directory to save info.json. Uses Airflow Variable 'DOWNLOADS_TEMP'."
|
||||
)
|
||||
}
|
||||
) as dag:
|
||||
|
||||
# --- Task Definitions ---
|
||||
|
||||
pop_url = PythonOperator(
|
||||
task_id='pop_url_from_queue',
|
||||
python_callable=pop_url_from_queue,
|
||||
# Params are implicitly passed via context
|
||||
)
|
||||
pop_url.doc_md = """
|
||||
### Pop URL from Inbox Queue
|
||||
Pops the next available URL from the `{{ params.queue_name }}_inbox` Redis list.
|
||||
Pushes the URL to XCom key `current_url`.
|
||||
If the queue is empty, raises `AirflowSkipException` to skip downstream tasks.
|
||||
"""
|
||||
|
||||
move_to_progress = PythonOperator(
|
||||
task_id='move_url_to_progress',
|
||||
python_callable=move_url_to_progress,
|
||||
trigger_rule='all_success', # Only run if pop_url succeeded (didn't skip)
|
||||
)
|
||||
move_to_progress.doc_md = """
|
||||
### Move URL to Progress Hash
|
||||
Retrieves the `current_url` from XCom (pushed by `pop_url_from_queue`).
|
||||
Adds the URL as a key to the `{{ params.queue_name }}_progress` Redis hash with status 'processing'.
|
||||
This task is skipped if `pop_url_from_queue` was skipped.
|
||||
"""
|
||||
|
||||
# YtdlpOpsOperator task to get the token
|
||||
get_token = YtdlpOpsOperator(
|
||||
task_id='get_token',
|
||||
# Operator params are inherited from DAG params by default,
|
||||
# but can be overridden here if needed.
|
||||
# We rely on the operator pulling the URL from XCom internally.
|
||||
# Pass DAG params explicitly to ensure they are used if overridden
|
||||
redis_conn_id="{{ params.redis_conn_id }}",
|
||||
redis_enabled="{{ params.redis_enabled }}",
|
||||
service_ip="{{ params.service_ip }}",
|
||||
service_port="{{ params.service_port }}",
|
||||
account_id="{{ params.account_id }}",
|
||||
timeout="{{ params.timeout }}",
|
||||
# save_info_json removed
|
||||
info_json_dir="{{ params.info_json_dir }}", # Pass the simplified path template
|
||||
# get_socks_proxy removed
|
||||
# store_socks_proxy removed
|
||||
retries=0, # Set operator retries to 0; failure handled by branching/failure handler
|
||||
trigger_rule='all_success', # Only run if move_to_progress succeeded
|
||||
)
|
||||
get_token.doc_md = """
|
||||
### Get Token and Info Task
|
||||
Connects to the YTDLP Thrift service for the URL pulled from XCom (`current_url`).
|
||||
Retrieves token, metadata, command, and potentially proxy. Saves `info.json`.
|
||||
Failure of this task triggers the `handle_failure` path.
|
||||
Success triggers the `handle_success` path.
|
||||
|
||||
**Pulls from XCom:**
|
||||
- `current_url` (from `pop_url_from_queue`) - *Used internally*
|
||||
|
||||
**Pushes to XCom:**
|
||||
- `info_json_path`
|
||||
- `socks_proxy`
|
||||
- `ytdlp_command`
|
||||
"""
|
||||
|
||||
# Task to perform the actual download using yt-dlp
|
||||
# Ensure info_json_path and socks_proxy are correctly quoted within the bash command
|
||||
# Use {% raw %} {% endraw %} around Jinja if needed, but direct templating should work here.
|
||||
# Added --no-simulate, --no-write-info-json, --ignore-errors, --no-progress
|
||||
download_video = BashOperator(
|
||||
task_id='download_video',
|
||||
bash_command="""
|
||||
INFO_JSON_PATH="{{ ti.xcom_pull(task_ids='get_token', key='info_json_path') }}"
|
||||
PROXY="{{ ti.xcom_pull(task_ids='get_token', key='socks_proxy') }}"
|
||||
FORMAT="{{ params.download_format }}"
|
||||
OUTPUT_TEMPLATE="{{ params.output_path_template }}"
|
||||
|
||||
echo "Starting download..."
|
||||
echo "Info JSON Path: $INFO_JSON_PATH"
|
||||
echo "Proxy: $PROXY"
|
||||
echo "Format: $FORMAT"
|
||||
echo "Output Template: $OUTPUT_TEMPLATE"
|
||||
|
||||
# Check if info.json path exists
|
||||
if [ -z "$INFO_JSON_PATH" ] || [ ! -f "$INFO_JSON_PATH" ]; then
|
||||
echo "Error: info.json path is missing or file does not exist ($INFO_JSON_PATH)."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Construct command
|
||||
CMD="yt-dlp --load-info-json \"$INFO_JSON_PATH\""
|
||||
|
||||
# Add proxy if it exists
|
||||
if [ -n "$PROXY" ]; then
|
||||
CMD="$CMD --proxy \"$PROXY\""
|
||||
fi
|
||||
|
||||
# Add format and output template
|
||||
CMD="$CMD -f \"$FORMAT\" -o \"$OUTPUT_TEMPLATE\""
|
||||
|
||||
# Add other useful flags
|
||||
CMD_ARRAY=(yt-dlp --load-info-json "$INFO_JSON_PATH")
|
||||
|
||||
# Add proxy if it exists
|
||||
if [ -n "$PROXY" ]; then
|
||||
CMD_ARRAY+=(--proxy "$PROXY")
|
||||
fi
|
||||
|
||||
# Add format and output template
|
||||
CMD_ARRAY+=(-f "$FORMAT" -o "$OUTPUT_TEMPLATE")
|
||||
|
||||
# Add other useful flags
|
||||
CMD_ARRAY+=(--no-progress --no-simulate --no-write-info-json --ignore-errors --verbose)
|
||||
|
||||
echo "Executing command array:"
|
||||
# Use printf to safely quote and display the command array
|
||||
printf "%q " "${CMD_ARRAY[@]}"
|
||||
echo "" # Newline after command
|
||||
|
||||
# Execute the command directly using the array
|
||||
"${CMD_ARRAY[@]}"
|
||||
|
||||
# Check exit code
|
||||
EXIT_CODE=$?
|
||||
if [ $EXIT_CODE -ne 0 ]; then
|
||||
echo "Error: yt-dlp command failed with exit code $EXIT_CODE"
|
||||
exit $EXIT_CODE
|
||||
fi
|
||||
echo "Download command completed successfully."
|
||||
""",
|
||||
trigger_rule='all_success', # Run only if get_token succeeded
|
||||
)
|
||||
download_video.doc_md = """
|
||||
### Download Video/Audio Task
|
||||
Executes `yt-dlp` using the `info.json` and proxy obtained from the `get_token` task.
|
||||
Uses the `download_format` and `output_path_template` parameters from the DAG run configuration.
|
||||
Failure of this task triggers the `handle_failure` path.
|
||||
|
||||
**Pulls from XCom (task_id='get_token'):**
|
||||
- `info_json_path`
|
||||
- `socks_proxy`
|
||||
"""
|
||||
|
||||
|
||||
# Task to handle successful token retrieval AND download
|
||||
success_handler = PythonOperator(
|
||||
task_id='handle_success',
|
||||
python_callable=handle_success,
|
||||
trigger_rule='all_success', # Run only if get_token succeeds
|
||||
)
|
||||
success_handler.doc_md = """
|
||||
### Handle Success Task
|
||||
Runs after `get_token` succeeds.
|
||||
Retrieves `current_url` and results from `get_token` via XCom.
|
||||
Removes the URL from the `{{ params.queue_name }}_progress` hash.
|
||||
Adds the URL and results to the `{{ params.queue_name }}_result` hash.
|
||||
"""
|
||||
|
||||
# Task to handle failed token retrieval or download
|
||||
failure_handler = PythonOperator(
|
||||
task_id='handle_failure',
|
||||
python_callable=handle_failure,
|
||||
trigger_rule='one_failed', # Run only if get_token or download_video fails
|
||||
)
|
||||
failure_handler.doc_md = """
|
||||
### Handle Failure Task
|
||||
# Runs after `get_token` (or potentially `move_url_to_progress`) fails.
|
||||
# Retrieves `current_url` from XCom.
|
||||
# Retrieves the error message and traceback from the context.
|
||||
# Removes the URL from the `{{ params.queue_name }}_progress` hash.
|
||||
# Adds the URL and error details to the `{{ params.queue_name }}_fail` hash.
|
||||
# **Important:** This task succeeding means the failure was *handled*, the DAG run itself might still be marked as failed if `get_token` failed.
|
||||
# """
|
||||
|
||||
|
||||
# --- Task Dependencies ---
|
||||
# Core processing flow
|
||||
pop_url >> move_to_progress >> get_token >> download_video
|
||||
|
||||
# Handlers depend on the outcome of both token retrieval and download
|
||||
# Success handler runs only if download_video succeeds
|
||||
download_video >> success_handler # Default trigger_rule='all_success' is suitable
|
||||
|
||||
# Failure handler runs if either get_token or download_video fails
|
||||
[get_token, download_video] >> failure_handler # Uses trigger_rule='one_failed' defined in the task
|
||||
|
||||
# Removed Jinja filters as they are no longer needed for the simplified info_json_dir
|
||||
@ -1,15 +1,40 @@
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
ytdlp-ops:
|
||||
image: pangramia/ytdlp-ops-server:latest
|
||||
camoufox:
|
||||
build:
|
||||
context: ./camoufox # Path relative to the docker-compose file
|
||||
dockerfile: Dockerfile
|
||||
ports:
|
||||
- "9090:9090"
|
||||
- "9091:9091"
|
||||
# Optionally expose the camoufox port to the host for debugging
|
||||
# - "12345:12345"
|
||||
- "12345" # Expose port within the docker network, pass in Dockerfile
|
||||
- "5900:5900" # Expose VNC port to the host
|
||||
networks:
|
||||
- airflow_prod_proxynet
|
||||
command: [
|
||||
"--ws-host", "0.0.0.0",
|
||||
"--port", "12345",
|
||||
"--ws-path", "mypath",
|
||||
"--proxy-url", "socks5://sslocal-rust-1082:1082",
|
||||
"--locale", "en-US",
|
||||
"--geoip",
|
||||
"--extensions", "/app/extensions/google_sign_in_popup_blocker-1.0.2.xpi,/app/extensions/spoof_timezone-0.3.4.xpi,/app/extensions/youtube_ad_auto_skipper-0.6.0.xpi"
|
||||
]
|
||||
restart: unless-stopped
|
||||
# Add healthcheck if desired
|
||||
|
||||
ytdlp-ops:
|
||||
image: pangramia/ytdlp-ops-server:latest # Don't comment
|
||||
depends_on:
|
||||
- camoufox # Ensure camoufox starts first
|
||||
ports:
|
||||
- "9090:9090" # Main RPC port
|
||||
- "9091:9091" # Health check port
|
||||
volumes:
|
||||
- context-data:/app/context-data
|
||||
networks:
|
||||
- airflow_workers_prod_proxynet
|
||||
- airflow_prod_proxynet
|
||||
command:
|
||||
- "--script-dir"
|
||||
- "/app/scripts"
|
||||
@ -18,10 +43,18 @@ services:
|
||||
- "--port"
|
||||
- "9090"
|
||||
- "--clients"
|
||||
- "ios,android,mweb"
|
||||
# Add 'web' client since we now have camoufox
|
||||
- "web,ios,android,mweb"
|
||||
- "--proxy"
|
||||
- "socks5://sslocal-rust-1084:1084"
|
||||
- "socks5://sslocal-rust-1082:1082"
|
||||
# Add the endpoint argument pointing to the camoufox service
|
||||
- "--endpoint"
|
||||
- "ws://camoufox:12345/mypath"
|
||||
- "--probe"
|
||||
# Add --camouflage-only if you don't want ytdlp-ops to manage the browser directly
|
||||
- "--camouflage-only"
|
||||
# Add flag to print full tokens in logs by default
|
||||
- "--print-tokens"
|
||||
restart: unless-stopped
|
||||
pull_policy: always
|
||||
|
||||
@ -30,5 +63,5 @@ volumes:
|
||||
name: context-data
|
||||
|
||||
networks:
|
||||
airflow_workers_prod_proxynet:
|
||||
airflow_prod_proxynet:
|
||||
external: true
|
||||
|
||||
@ -1 +0,0 @@
|
||||
info_json_vKTVLpmvznI_1743507631.json
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user