Merge pull request #222 from bellingcat/feat/yt-dlp-pots

yt-dlp proposed extractor_args and PO Token client.
pull/287/head
Erin Clark 2025-03-28 13:54:27 +00:00 zatwierdzone przez GitHub
commit 1d18399d70
Nie znaleziono w bazie danych klucza dla tego podpisu
ID klucza GPG: B5690EEEBB952194
7 zmienionych plików z 304 dodań i 92 usunięć

Wyświetl plik

@ -106,5 +106,117 @@ Finally,Some important things to remember:
## Authenticating on XXXX site with username/password
```{note} This section is still under construction 🚧
```{note}
This section is still under construction 🚧
```
# Proof of Origin Tokens
YouTube uses **Proof of Origin Tokens (POT)** as part of its bot detection system to verify that requests originate from valid clients. If a token is missing or invalid, some videos may return errors like "Sign in to confirm you're not a bot."
yt-dlp provides [a detailed guide to POTs](https://github.com/yt-dlp/yt-dlp/wiki/PO-Token-Guide).
### How Auto Archiver Uses POT
This feature is enabled for the Generic Archiver via two yt-dlp plugins:
- **Client-side plugin**: [yt-dlp-get-pot](https://github.com/coletdjnz/yt-dlp-get-pot)
Detects when a token is required and requests one from a provider.
- **Provider plugin**: [bgutil-ytdlp-pot-provider](https://github.com/Brainicism/bgutil-ytdlp-pot-provider)
Includes both a Python plugin and a **Node.js server or script** to generate the token.
These are installed in our Poetry environment.
### Integration Methods
**Docker (Recommended)**:
When running the Auto Archiver using the Docker image, we use the [Node.js token generation script](https://github.com/Brainicism/bgutil-ytdlp-pot-provider/tree/master/server).
This is to avoid managing a separate server process, and is handled automatically inside the Docker container when needed.
This is already included in the Docker image, however if you need to disable this you can set the config option `bguils_po_token_method` under the `generic_extractor` section of your `orchestration.yaml` config file to "disabled".
```yaml
generic_extractor:
bguils_po_token_method: "disabled"
```
**PyPi/ Local**:
When using the Auto Archiver PyPI package, or running locally, you will need additional system requirements to run the token generation script, namely either Docker, or Node.js and Yarn.
See the [bgutil-ytdlp-pot-provider](https://github.com/Brainicism/bgutil-ytdlp-pot-provider?tab=readme-ov-file#a-http-server-option) documentation for more details.
WARNING⚠: This will add the server scripts to the home directory of wherever this is running.
- You can set the config option `bguils_po_token_method` under the `generic_extractor` section of your `orchestration.yaml` config file to "script" to enable the token generation script process locally.
- Alternatively you can run the bgutil-ytdlp-pot-provider server separately using their Docker image or Node.js server.
### Notes
- The token generation script is only triggered when needed by yt-dlp, so it should have no effect unless YouTube requests a POT.
- If you're running the Auto Archiver in Docker, this is set up automatically.
- If you're running locally, you'll need to run the setup script manually or enable the feature in your config.
- You can set up both the server and the script, and the plugin will fallback on each other if needed. This is recommended for robustness!
### Configurations:
## Configurations Summary
| Option | Behavior | Docker Default? |
|------------| ------------------------------------------------------------------------------------------------------------------------------------------ | --------------- |
| `auto` | Docker: Automatically downloads and uses the token generation script. Local: Does nothing; assumes a separate server is running externally. | ✅ Yes |
| `script` | Explicitly downloads and uses the token generation script, even locally. | ❌ No |
| `disabled` | Disables token generation completely. | ❌ No |
Example configuration:
```yaml
generic_extractor:
# ...
bguils_po_token_method: "script"
# For debugging add the verbose flag here:
ytdlp_args: "--no-abort-on-error --abort-on-error --verbose"
```
**Advanced Configuration:**
If you change the default port of the bgutil-ytdlp-pot-provider server, you can pass the updated values using our `extractor_args` option for the gereric extractor.
```yaml
generic_extractor:
ytdlp_args: "--no-abort-on-error --abort-on-error --verbose"
ytdlp_update_interval: 5
bguils_po_token_method: "script"
extractor_args:
youtube:
getpot_bgutil_baseurl: "http://127.0.0.1:8080"
player_client: web,tv
```
For more details on this for bgutils see [here](https://github.com/Brainicism/bgutil-ytdlp-pot-provider?tab=readme-ov-file#usage)
### Checking the logs
To verify that the POT process working, look for the following lines in your log after adding the config option:
```shell
[GetPOT] BgUtilScript: Generating POT via script: /Users/you/bgutil-ytdlp-pot-provider/server/build/generate_once.js
[debug] [GetPOT] BgUtilScript: Executing command to get POT via script: /Users/you/.nvm/versions/node/v20.18.0/bin/node /Users/you/bgutil-ytdlp-pot-provider/server/build/generate_once.js -v ymCMy8OflKM
[debug] [GetPOT] BgUtilScript: stdout:
{"poToken":"MlMxojNFhEJvUzGeHEkVRSK_luXtwcDnwSNIOgaUutqB7t99nmlNvtWgYayboopG6ZopZgmQ-6PJCWEMHv89MIiFGGlJRY25Fkwzxmia_8uYgf5AWf==","generatedAt":"2025-03-26T10:45:26.156Z","visitIdentifier":"ymCMy8OflKM"}
[debug] [GetPOT] Fetching gvs PO Token for tv client
```
If it can't find the script or something, you'll see something like this:
```shell
[debug] [GetPOT] Fetching player PO Token for tv client
WARNING: [GetPOT] BgUtilScript: Script path doesn't exist: /Users/you/bgutil-ytdlp-pot-provider/server/build/generate_once.js. Please make sure the script has been transpiled correctly.
WARNING: [GetPOT] BgUtilHTTP: Error reaching GET http://127.0.0.1:4416/ping (caused by TransportError). Please make sure that the server is reachable at http://127.0.0.1:4416.
[debug] [GetPOT] No player PO Token provider available for tv client
```
In this case check that the script has been transpiled correctly and is available at the path specified in the log,
or that the server is running and reachable.

84
poetry.lock wygenerowano
Wyświetl plik

@ -158,6 +158,21 @@ charset-normalizer = ["charset-normalizer"]
html5lib = ["html5lib"]
lxml = ["lxml"]
[[package]]
name = "bgutil-ytdlp-pot-provider"
version = "0.7.4"
description = ""
optional = false
python-versions = ">=3.8"
groups = ["main"]
files = [
{file = "bgutil_ytdlp_pot_provider-0.7.4-py3-none-any.whl", hash = "sha256:5f0b1d884fec66dff703c421ea06f5fc9b11022d9c0babdaa0cab13ed99b9d77"},
{file = "bgutil_ytdlp_pot_provider-0.7.4.tar.gz", hash = "sha256:b6c1462b8f979540078085cd82462ef967b8b70cd0810d469243a31f5081e5c6"},
]
[package.dependencies]
yt-dlp-get-pot = ">=0.1.1"
[[package]]
name = "boto3"
version = "1.37.18"
@ -2265,23 +2280,6 @@ files = [
[package.dependencies]
rich = ">=11.0.0"
[[package]]
name = "roman-numerals-py"
version = "3.1.0"
description = "Manipulate well-formed Roman numerals"
optional = false
python-versions = ">=3.9"
groups = ["docs"]
markers = "python_version >= \"3.12\""
files = [
{file = "roman_numerals_py-3.1.0-py3-none-any.whl", hash = "sha256:9da2ad2fb670bcf24e81070ceb3be72f6c11c440d73bd579fbeca1e9f330954c"},
{file = "roman_numerals_py-3.1.0.tar.gz", hash = "sha256:be4bf804f083a4ce001b5eb7e3c0862479d10f94c936f6c4e5f250aa5ff5bd2d"},
]
[package.extras]
lint = ["mypy (==1.15.0)", "pyright (==1.1.394)", "ruff (==0.9.7)"]
test = ["pytest (>=8)"]
[[package]]
name = "rsa"
version = "4.9"
@ -2506,7 +2504,6 @@ description = "Python documentation generator"
optional = false
python-versions = ">=3.10"
groups = ["docs"]
markers = "python_version < \"3.12\""
files = [
{file = "sphinx-8.1.3-py3-none-any.whl", hash = "sha256:09719015511837b76bf6e03e42eb7595ac8c2e41eeb9c29c5b755c6b677992a2"},
{file = "sphinx-8.1.3.tar.gz", hash = "sha256:43c1911eecb0d3e161ad78611bc905d1ad0e523e4ddc202a58a821773dc4c927"},
@ -2536,43 +2533,6 @@ docs = ["sphinxcontrib-websupport"]
lint = ["flake8 (>=6.0)", "mypy (==1.11.1)", "pyright (==1.1.384)", "pytest (>=6.0)", "ruff (==0.6.9)", "sphinx-lint (>=0.9)", "tomli (>=2)", "types-Pillow (==10.2.0.20240822)", "types-Pygments (==2.18.0.20240506)", "types-colorama (==0.4.15.20240311)", "types-defusedxml (==0.7.0.20240218)", "types-docutils (==0.21.0.20241005)", "types-requests (==2.32.0.20240914)", "types-urllib3 (==1.26.25.14)"]
test = ["cython (>=3.0)", "defusedxml (>=0.7.1)", "pytest (>=8.0)", "setuptools (>=70.0)", "typing_extensions (>=4.9)"]
[[package]]
name = "sphinx"
version = "8.2.3"
description = "Python documentation generator"
optional = false
python-versions = ">=3.11"
groups = ["docs"]
markers = "python_version >= \"3.12\""
files = [
{file = "sphinx-8.2.3-py3-none-any.whl", hash = "sha256:4405915165f13521d875a8c29c8970800a0141c14cc5416a38feca4ea5d9b9c3"},
{file = "sphinx-8.2.3.tar.gz", hash = "sha256:398ad29dee7f63a75888314e9424d40f52ce5a6a87ae88e7071e80af296ec348"},
]
[package.dependencies]
alabaster = ">=0.7.14"
babel = ">=2.13"
colorama = {version = ">=0.4.6", markers = "sys_platform == \"win32\""}
docutils = ">=0.20,<0.22"
imagesize = ">=1.3"
Jinja2 = ">=3.1"
packaging = ">=23.0"
Pygments = ">=2.17"
requests = ">=2.30.0"
roman-numerals-py = ">=1.0.0"
snowballstemmer = ">=2.2"
sphinxcontrib-applehelp = ">=1.0.7"
sphinxcontrib-devhelp = ">=1.0.6"
sphinxcontrib-htmlhelp = ">=2.0.6"
sphinxcontrib-jsmath = ">=1.0.1"
sphinxcontrib-qthelp = ">=1.0.6"
sphinxcontrib-serializinghtml = ">=1.1.9"
[package.extras]
docs = ["sphinxcontrib-websupport"]
lint = ["betterproto (==2.0.0b6)", "mypy (==1.15.0)", "pypi-attestations (==0.0.21)", "pyright (==1.1.395)", "pytest (>=8.0)", "ruff (==0.9.9)", "sphinx-lint (>=0.9)", "types-Pillow (==10.2.0.20240822)", "types-Pygments (==2.19.0.20250219)", "types-colorama (==0.4.15.20240311)", "types-defusedxml (==0.7.0.20240218)", "types-docutils (==0.21.0.20241128)", "types-requests (==2.32.0.20241016)", "types-urllib3 (==1.26.25.14)"]
test = ["cython (>=3.0)", "defusedxml (>=0.7.1)", "pytest (>=8.0)", "pytest-xdist[psutil] (>=3.4)", "setuptools (>=70.0)", "typing_extensions (>=4.9)"]
[[package]]
name = "sphinx-autoapi"
version = "3.6.0"
@ -3376,7 +3336,19 @@ secretstorage = ["cffi", "secretstorage"]
static-analysis = ["autopep8 (>=2.0,<3.0)", "ruff (>=0.11.0,<0.12.0)"]
test = ["pytest (>=8.1,<9.0)", "pytest-rerunfailures (>=14.0,<15.0)"]
[[package]]
name = "yt-dlp-get-pot"
version = "0.3.0"
description = ""
optional = false
python-versions = ">=3.9"
groups = ["main"]
files = [
{file = "yt_dlp_get_pot-0.3.0-py3-none-any.whl", hash = "sha256:a49a596a3e3c02cd9ce051192ea3fe8168cf24ece8954bed6aa331a87d86954f"},
{file = "yt_dlp_get_pot-0.3.0.tar.gz", hash = "sha256:ac9530b9e7b3d667235b9119da475f595d2dc7e6f6bbf98b965011be454e8833"},
]
[metadata]
lock-version = "2.1"
python-versions = ">=3.10,<3.13"
content-hash = "ac5d473189adbadb3ee5d8a36e1898a39725755704e0677768303ae46bc246c8"
content-hash = "c612e9f98ca5199092141bb04a0de4cd5314a8fdc8cb12c1d63eafe26bbf16aa"

Wyświetl plik

@ -56,6 +56,7 @@ dependencies = [
"rfc3161-client (>=1.0.1,<2.0.0)",
"cryptography (>44.0.1,<45.0.0)",
"opentimestamps (>=0.4.5,<0.5.0)",
"bgutil-ytdlp-pot-provider (>=0.7.3,<0.8.0)",
]
[tool.poetry.group.dev.dependencies]

Wyświetl plik

@ -74,6 +74,11 @@ If you are having issues with the extractor, you can review the version of `yt-d
"default": "inf",
"help": "Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit.",
},
"bguils_po_token_method": {
"default": "auto",
"help": "Set up a Proof of origin token provider. This process has additional requirements. See [authentication](https://auto-archiver.readthedocs.io/en/latest/how_to/authentication_how_to.html) for more information.",
"choices": ["auto", "script", "disabled"],
},
"extractor_args": {
"default": {},
"help": "Additional arguments to pass to the yt-dlp extractor. See https://github.com/yt-dlp/yt-dlp/blob/master/README.md#extractor-arguments.",

Wyświetl plik

@ -1,10 +1,13 @@
import shutil
import sys
import datetime
import os
import importlib
import subprocess
import zipfile
from typing import Generator, Type
from urllib.request import urlretrieve
import yt_dlp
from yt_dlp.extractor.common import InfoExtractor
@ -26,59 +29,138 @@ class GenericExtractor(Extractor):
_dropins = {}
def setup(self):
# check for file .ytdlp-update in the secrets folder
self.check_for_extractor_updates()
self.setup_po_tokens()
def check_for_extractor_updates(self):
"""Checks whether yt-dlp or its plugins need updating and triggers a restart if so."""
if self.ytdlp_update_interval < 0:
return
use_secrets = os.path.exists("secrets")
path = os.path.join("secrets" if use_secrets else "", ".ytdlp-update")
next_update_check = None
if os.path.exists(path):
with open(path, "r") as f:
next_update_check = datetime.datetime.fromisoformat(f.read())
update_file = os.path.join("secrets" if os.path.exists("secrets") else "", ".ytdlp-update")
next_check = None
if os.path.exists(update_file):
with open(update_file, "r") as f:
next_check = datetime.datetime.fromisoformat(f.read())
if not next_update_check or next_update_check < datetime.datetime.now():
updated = self.update_ytdlp()
if next_check and next_check > datetime.datetime.now():
return
next_update_check = datetime.datetime.now() + datetime.timedelta(days=self.ytdlp_update_interval)
with open(path, "w") as f:
f.write(next_update_check.isoformat())
yt_dlp_updated = self.update_package("yt-dlp")
bgutil_updated = self.update_package("bgutil-ytdlp-pot-provider")
if not updated:
return
# Write the new timestamp
with open(update_file, "w") as f:
next_check = datetime.datetime.now() + datetime.timedelta(days=self.ytdlp_update_interval)
f.write(next_check.isoformat())
if yt_dlp_updated or bgutil_updated:
if os.environ.get("AUTO_ARCHIVER_ALLOW_RESTART", "1") != "1":
logger.warning(
"yt-dlp has been updated. Auto archiver should be restarted for these changes to take effect"
)
logger.warning("yt-dlp or plugin was updated — please restart auto-archiver manually")
else:
logger.warning("Restarting auto-archiver to apply yt-dlp update")
logger.warning("yt-dlp or plugin was updated — restarting auto-archiver")
logger.warning(" ======= RESTARTING ======= ")
os.execv(sys.executable, [sys.executable] + sys.argv)
def update_ytdlp(self):
logger.info("Checking and updating yt-dlp...")
logger.info(
f"Tip: change the 'ytdlp_update_interval' setting to control how often yt-dlp is updated. Set to -1 to disable or 0 to enable on every run. Current setting: {self.ytdlp_update_interval}"
)
def update_package(self, package_name: str) -> bool:
logger.info(f"Checking and updating {package_name}...")
from importlib.metadata import version as get_version
old_version = get_version("yt-dlp")
old_version = get_version(package_name)
try:
# try and update with pip (this works inside poetry environment and in a normal virtualenv)
result = subprocess.run(["pip", "install", "--upgrade", "yt-dlp"], check=True, capture_output=True)
if "Successfully installed yt-dlp" in result.stdout.decode():
new_version = importlib.metadata.version("yt-dlp")
logger.info(f"yt-dlp successfully (from {old_version} to {new_version})")
result = subprocess.run(["pip", "install", "--upgrade", package_name], check=True, capture_output=True)
if f"Successfully installed {package_name}" in result.stdout.decode():
new_version = importlib.metadata.version(package_name)
logger.info(f"{package_name} updated from {old_version} to {new_version}")
return True
logger.info(f"{package_name} already up to date")
except Exception as e:
logger.error(f"Error updating {package_name}: {e}")
return False
def setup_po_tokens(self) -> None:
"""Setup Proof of Origin Token method conditionally.
Uses provider: https://github.com/Brainicism/bgutil-ytdlp-pot-provider.
"""
in_docker = os.environ.get("RUNNING_IN_DOCKER")
if self.bguils_po_token_method == "disabled":
# This allows disabling of the PO Token generation script in the Docker implementation.
logger.warning("Proof of Origin Token generation is disabled.")
return
if self.bguils_po_token_method == "auto" and not in_docker:
logger.info(
"Proof of Origin Token method not explicitly set. "
"If you're running an external HTTP server separately, you can safely ignore this message. "
"To reduce the likelihood of bot detection, enable one of the methods described in the documentation: "
"https://auto-archiver.readthedocs.io/en/settings_page/installation/authentication.html#proof-of-origin-tokens"
)
return
# Either running in Docker, or "script" method is set beyond this point
self.setup_token_generation_script()
def setup_token_generation_script(self) -> None:
"""This function sets up the Proof of Origin Token generation script method for
bgutil-ytdlp-pot-provider if enabled or in Docker."""
missing_tools = [tool for tool in ("node", "yarn", "npx") if shutil.which(tool) is None]
if missing_tools:
logger.error(
f"Cannot set up PO Token script; missing required tools: {', '.join(missing_tools)}. "
"Install these tools or run bgutils via Docker. "
"See: https://github.com/Brainicism/bgutil-ytdlp-pot-provider"
)
return
try:
from importlib.metadata import version as get_version
plugin_version = get_version("bgutil-ytdlp-pot-provider")
base_dir = os.path.expanduser("~/bgutil-ytdlp-pot-provider")
server_dir = os.path.join(base_dir, "server")
version_file = os.path.join(server_dir, ".VERSION")
transpiled_script = os.path.join(server_dir, "build", "generate_once.js")
# Skip setup if version is correct and transpiled script exists
if os.path.isfile(transpiled_script) and os.path.isfile(version_file):
with open(version_file) as vf:
if vf.read().strip() == plugin_version:
logger.info("PO Token script already set up and up to date.")
else:
logger.info("yt-dlp already up to date")
return False
# Remove an outdated directory and pull a new version
if os.path.exists(base_dir):
shutil.rmtree(base_dir)
os.makedirs(base_dir, exist_ok=True)
zip_url = (
f"https://github.com/Brainicism/bgutil-ytdlp-pot-provider/archive/refs/tags/{plugin_version}.zip"
)
zip_path = os.path.join(base_dir, f"{plugin_version}.zip")
logger.info(f"Downloading bgutils release zip for version {plugin_version}...")
urlretrieve(zip_url, zip_path)
with zipfile.ZipFile(zip_path, "r") as z:
z.extractall(base_dir)
os.remove(zip_path)
extracted_root = os.path.join(base_dir, f"bgutil-ytdlp-pot-provider-{plugin_version}")
shutil.move(os.path.join(extracted_root, "server"), server_dir)
shutil.rmtree(extracted_root)
logger.info("Installing dependencies and transpiling PoT Generator script...")
subprocess.run(["yarn", "install", "--frozen-lockfile"], cwd=server_dir, check=True)
subprocess.run(["npx", "tsc"], cwd=server_dir, check=True)
with open(version_file, "w") as vf:
vf.write(plugin_version)
script_path = os.path.join(server_dir, "build", "generate_once.js")
if not os.path.exists(script_path):
logger.error("generate_once.js not found after transpilation.")
return
self.extractor_args.setdefault("youtube", {})["getpot_bgutil_script"] = script_path
logger.info(f"PO Token script configured at: {script_path}")
except Exception as e:
logger.error(f"Error updating yt-dlp: {e}")
return False
logger.error(f"Failed to set up PO Token script: {e}")
def suitable_extractors(self, url: str) -> Generator[str, None, None]:
"""

Wyświetl plik

@ -24,7 +24,7 @@ TESTS_TO_RUN_LAST = ["test_generic_archiver", "test_twitter_api_archiver"]
@pytest.fixture(autouse=True)
def skip_check_for_update(mocker):
update_ytdlp = mocker.patch(
"auto_archiver.modules.generic_extractor.generic_extractor.GenericExtractor.update_ytdlp"
"auto_archiver.modules.generic_extractor.generic_extractor.GenericExtractor.update_package"
)
update_ytdlp.return_value = False

Wyświetl plik

@ -29,6 +29,7 @@ class TestGenericExtractor(TestExtractorBase):
"proxy": None,
"cookies_from_browser": False,
"cookie_file": None,
"pot_provider": False,
}
def test_load_dropin(self):
@ -291,3 +292,42 @@ class TestGenericExtractor(TestExtractorBase):
post = self.extractor.download(make_item(url))
assert "Bellingcat researcher Kolina Koltai delves deeper into Clothoff" in post.get("content")
assert post.get_title() == "Bellingcat"
class TestGenericExtractorPoToken:
@pytest.fixture
def extractor(self, mocker):
extractor = GenericExtractor()
extractor.extractor_args = {}
extractor.setup_token_generation_script = mocker.Mock()
return extractor
def test_po_token_disabled_does_not_call_setup(self, extractor):
extractor.bguils_po_token_method = "disabled"
extractor.in_docker = True
extractor.setup_po_tokens()
extractor.setup_token_generation_script.assert_not_called()
def test_po_token_default_in_docker_calls_setup(self, extractor, mocker):
extractor.bguils_po_token_method = "auto"
mocker.patch.dict(os.environ, {"RUNNING_IN_DOCKER": "1"})
extractor.setup_po_tokens()
extractor.setup_token_generation_script.assert_called_once()
def test_po_token_default_local_does_not_call_setup(self, extractor, caplog, mocker):
extractor.bguils_po_token_method = "auto"
# clears env vars for this test
mocker.patch.dict(os.environ, {}, clear=True)
extractor.setup_po_tokens()
extractor.setup_token_generation_script.assert_not_called()
assert "Proof of Origin Token method not explicitly set" in caplog.text
def test_po_token_script_always_calls_setup(self, extractor):
extractor.bguils_po_token_method = "script"
extractor.in_docker = False
extractor.setup_po_tokens()
extractor.setup_token_generation_script.assert_called_once()
extractor.setup_token_generation_script.reset_mock()
extractor.in_docker = True
extractor.setup_po_tokens()
extractor.setup_token_generation_script.assert_called_once()