WIP: Docs tidyups+add howto on logging and authentication

(Authentication is WIP)
pull/211/head
Patrick Robertson 2025-02-19 10:29:05 +00:00
rodzic 3c543a3a6a
commit 222a94563f
15 zmienionych plików z 173 dodań i 82 usunięć

Wyświetl plik

@ -6,7 +6,7 @@
1. Update the version number in [version.py](src/auto_archiver/version.py)
2. Go to github releases > new release > use `vx.y.z` for matching version notation
1. package is automatically updated in pypi
2. docker image is automatically pushed to dockerhup
2. docker image is automatically pushed to dockerhub

Wyświetl plik

@ -1,49 +1,6 @@
# How-To Guides
## How to use Google Sheets to load and store archive information
The `--gsheet_feeder.sheet` property is the name of the Google Sheet to check for URLs.
This sheet must have been shared with the Google Service account used by `gspread`.
This sheet must also have specific columns (case-insensitive) in the `header` - see the [Gsheet Feeder Docs](modules/autogen/feeder/gsheet_feeder.md) for more info. The default names of these columns and their purpose is:
Inputs:
* **Link** *(required)*: the URL of the post to archive
* **Destination folder**: custom folder for archived file (regardless of storage)
Outputs:
* **Archive status** *(required)*: Status of archive operation
* **Archive location**: URL of archived post
* **Archive date**: Date archived
* **Thumbnail**: Embeds a thumbnail for the post in the spreadsheet
* **Timestamp**: Timestamp of original post
* **Title**: Post title
* **Text**: Post text
* **Screenshot**: Link to screenshot of post
* **Hash**: Hash of archived HTML file (which contains hashes of post media) - for checksums/verification
* **Perceptual Hash**: Perceptual hashes of found images - these can be used for de-duplication of content
* **WACZ**: Link to a WACZ web archive of post
* **ReplayWebpage**: Link to a ReplayWebpage viewer of the WACZ archive
For example, this is a spreadsheet configured with all of the columns for the auto archiver and a few URLs to archive. (Note that the column names are not case sensitive.)
![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column](../demo-before.png)
Now the auto archiver can be invoked, with this command in this example: `docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --config secrets/orchestration-global.yaml --gsheet_feeder.sheet "Auto archive test 2023-2"`. Note that the sheet name has been overridden/specified in the command line invocation.
When the auto archiver starts running, it updates the "Archive status" column.
![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column. The auto archiver has added "archive in progress" to one of the status columns.](../demo-progress.png)
The links are downloaded and archived, and the spreadsheet is updated to the following:
![A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.](../demo-after.png)
Note that the first row is skipped, as it is assumed to be a header row (`--gsheet_feeder.header=1` and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
The "archive location" link contains the path of the archived file, in local storage, S3, or in Google Drive.
![The archive result for a link in the demo sheet.](../demo-archive.png)
The follow pages contain helpful how-to guides for comon use cases of the Auto-Archiver.
---
```{toctree}
@ -51,4 +8,5 @@ The "archive location" link contains the path of the archived file, in local sto
:glob:
how_to/*
```

Wyświetl plik

@ -0,0 +1,6 @@
# How to login (authenticate) to websites
This how-to guide shows you how you can add authentication to Auto Archiver for a site you are trying to archive. In this example, we will authenticate on use Twitter/X.com using cookies, and on XXXX using username/password.
```{note} This page is still under construction 🚧
```

Wyświetl plik

@ -0,0 +1,44 @@
# Using Google Sheets
The `--gsheet_feeder.sheet` property is the name of the Google Sheet to check for URLs.
This sheet must have been shared with the Google Service account used by `gspread`.
This sheet must also have specific columns (case-insensitive) in the `header` - see the [Gsheet Feeder Docs](modules/autogen/feeder/gsheet_feeder.md) for more info. The default names of these columns and their purpose is:
Inputs:
* **Link** *(required)*: the URL of the post to archive
* **Destination folder**: custom folder for archived file (regardless of storage)
Outputs:
* **Archive status** *(required)*: Status of archive operation
* **Archive location**: URL of archived post
* **Archive date**: Date archived
* **Thumbnail**: Embeds a thumbnail for the post in the spreadsheet
* **Timestamp**: Timestamp of original post
* **Title**: Post title
* **Text**: Post text
* **Screenshot**: Link to screenshot of post
* **Hash**: Hash of archived HTML file (which contains hashes of post media) - for checksums/verification
* **Perceptual Hash**: Perceptual hashes of found images - these can be used for de-duplication of content
* **WACZ**: Link to a WACZ web archive of post
* **ReplayWebpage**: Link to a ReplayWebpage viewer of the WACZ archive
For example, this is a spreadsheet configured with all of the columns for the auto archiver and a few URLs to archive. (Note that the column names are not case sensitive.)
![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column](../demo-before.png)
Now the auto archiver can be invoked, with this command in this example: `docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --config secrets/orchestration-global.yaml --gsheet_feeder.sheet "Auto archive test 2023-2"`. Note that the sheet name has been overridden/specified in the command line invocation.
When the auto archiver starts running, it updates the "Archive status" column.
![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column. The auto archiver has added "archive in progress" to one of the status columns.](../demo-progress.png)
The links are downloaded and archived, and the spreadsheet is updated to the following:
![A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.](../demo-after.png)
Note that the first row is skipped, as it is assumed to be a header row (`--gsheet_feeder.header=1` and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
The "archive location" link contains the path of the archived file, in local storage, S3, or in Google Drive.
![The archive result for a link in the demo sheet.](../demo-archive.png)

Wyświetl plik

@ -0,0 +1,55 @@
# Logging
Auto Archiver's logs can be helpful for debugging problematic archiving processes. This guide shows you how to use the logs to
## Setting up logging
Logging settings can be set on the command line or using the orchestration config file ([learn more](../installation/configuration)). A special `logging` section defines the logging options.
#### Logging Level
There are 7 logging levels in total, with 4 commonly used levels. They are: `DEBUG`, `INFO`, `WARNING` and `ERROR`.
Change the warning level by setting the value in your orchestration config file:
```{code} yaml
:caption: orchestration.yaml
...
logging:
level: DEBUG # or INFO / WARNING / ERROR
...
```
For normal usage, it is recommended to use the `INFO` level, or if you prefer quieter logs with less information, you can use the `WARNING` level. If you encounter issues with the archiving, then it's recommended to enable the `DEBUG` level.
```{note} To learn about all logging levels, see the [loguru documentation](https://loguru.readthedocs.io/en/stable/api/logger.html)
```
### Logging to a file
As default, auto-archiver will log to the console. But if you wish to store your logs for future reference, or you are running the auto-archiver from within code a implementation, then you may with to enable file logging. This can be done by setting the `file:` config value in the logging settings.
**Rotation:** For file logging, you can choose to 'rotate' your log files (creating new log files) so they do not get too large. Change this by setting the 'rotation' option in your logging settings. For a full list of rotation options, see the [loguru docs](https://loguru.readthedocs.io/en/stable/overview.html#easier-file-logging-with-rotation-retention-compression).
```{code} yaml
:caption: orchestration.yaml
logging:
...
file: /my/log/file.log
rotation: 1 day
```
### Full logging example
The below example logs only `WARNING` logs to the console and to the file `/my/file.log`, rotating that file once per week:
```{code} yaml
:caption: orchestration.yaml
logging:
level: WARNING
file: /my/file.log
rotation: 1 week
```

Wyświetl plik

@ -9,7 +9,7 @@
Overview <self>
contributing
installation/installation.rst
installation/installation
core_modules.md
how_to
development/developer_guidelines

Wyświetl plik

@ -4,22 +4,42 @@ The Authentication framework for auto-archiver allows you to add login details f
There are two main use cases for authentication:
* Some websites require some kind of authentication in order to view the content. Examples include Facebook, Telegram etc.
* Some websites use anti-bot systems to block bot-like tools from accessig the website. Adding real login information to auto-archiver can sometimes bypass this.
* Some websites use anti-bot systems to block bot-like tools from accessing the website. Adding real login information to auto-archiver can sometimes bypass this.
## The Authentication Config
You can save your authentication information directly inside your orchestration config file, or as a separate file (for security/multi-deploy purposes). Whether storing your settings inside the orchestration file, or as a separate file, the configuration format is the same.
You can save your authentication information directly inside your orchestration config file, or as a separate file (for security/multi-deploy purposes). Whether storing your settings inside the orchestration file, or as a separate file, the configuration format is the same. Currently, auto-archiver supports the following authentication types:
**Username & Password:**
- `username`: str - the username to use for login
- `password`: str - the password to use for login
**API**
- `api_key`: str - the API key to use for login
- `api_secret`: str - the API secret to use for login
**Cookies**
- `cookie`: str - a cookie string to use for login (specific to this site)
- `cookies_from_browser`: str - load cookies from this browser, for this site only.
- `cookies_file`: str - load cookies from this file, for this site only.
```{note}
The Username & Password, and API settings only work with the Generic Extractor. Other modules (like the screenshot enricher) can only use the `cookies` options. Furthermore, many sites can still detect bots and block username/password logins. Twitter/X and YouTube are two prominent ones that block username/password logging.
One of the 'Cookies' options is recommended for the most robust archiving.
```
```{code} yaml
authentication:
# optional file to load authentication information from, for security or multi-system deploy purposes
load_from_file: path/to/authentication/file.txt
# optional setting to load cookies from the named browser on the system.
# optional setting to load cookies from the named browser on the system, for **ALL** websites
cookies_from_browser: firefox
# optional setting to load cookies from a cookies.txt/cookies.jar file. See note below on extracting these
# optional setting to load cookies from a cookies.txt/cookies.jar file, for **ALL** websites. See note below on extracting these
cookies_file: path/to/cookies.jar
twitter.com,x.com:
mysite.com:
username: myusername
password: 123
@ -29,15 +49,10 @@ authentication:
othersite.com:
api_key: 123
api_secret: 1234
# All available options:
# - username: str - the username to use for login
# - password: str - the password to use for login
# - api_key: str - the API key to use for login
# - api_secret: str - the API secret to use for login
# - cookie: str - a cookie string to use for login (specific to this site)
```
### Recommendations for authentication
1. **Store authentication information separately:**

Wyświetl plik

@ -23,7 +23,7 @@ A default `orchestration.yaml` will be created for you the first time you run au
## Configuring from the Command Line
You can run auto-archiver directy from the command line, without the need for a configuration file, command line arguments are parsed using the format `module_name.config_value`. For example, a config value of `api_key` in the `instagram_extractor` module would be passed on the command line with the flag `--instagram_extractor.api_key=API_KEY`.
You can run auto-archiver directly from the command line, without the need for a configuration file, command line arguments are parsed using the format `module_name.config_value`. For example, a config value of `api_key` in the `instagram_extractor` module would be passed on the command line with the flag `--instagram_extractor.api_key=API_KEY`.
The command line arguments are useful for testing or editing config values and enabling/disabling modules on the fly. When you are happy with your settings, you can store them back in your configuration file by passing the `-s/--store` flag on the command line.

Wyświetl plik

@ -5,6 +5,7 @@
:hidden:
configurations.md
authentication.md
config_cheatsheet.md
```

Wyświetl plik

@ -4,7 +4,7 @@ Extractor modules are used to extract the content of a given URL. Typically, one
Extractors that are able to extract content from a wide range of websites include:
1. Generic Extractor: parses videos and images on sites using the powerful yt-dlp library.
2. Wayback Machine Extractor: sends pages to the Waygback machine for archiving, and stores the link.
2. Wayback Machine Extractor: sends pages to the Wayback machine for archiving, and stores the link.
3. WACZ Extractor: runs a web browser to 'browse' the URL and save a copy of the page in WACZ format.
```{include} autogen/extractor.md

Wyświetl plik

@ -63,12 +63,18 @@ class BaseModule(ABC):
def config_setup(self, config: dict):
authentication = config.get('authentication', {})
# this is important. Each instance is given its own deepcopied config, so modules cannot
# change values to affect other modules
config = deepcopy(config)
authentication = deepcopy(config.pop('authentication', {}))
# extract out concatenated sites
for key, val in copy(authentication).items():
if "," in key:
for site in key.split(","):
authentication[site] = val
del authentication[key]
self.authentication = authentication
self.config = config
for key, val in config.get(self.name, {}).items():
@ -102,7 +108,7 @@ class BaseModule(ABC):
# TODO: think about if/how we can deal with sites that have multiple domains (main one is x.com/twitter.com)
# for now the user must enter them both, like "x.com,twitter.com" in their config. Maybe we just hard-code?
site = UrlUtil.domain_for_url(site)
site = UrlUtil.domain_for_url(site).lstrip("www.")
# add the 'www' version of the site to the list of sites to check
authdict = {}
@ -128,17 +134,30 @@ class BaseModule(ABC):
# collections.namedtuple('ParsedOptions', ('parser', 'options', 'urls', 'ydl_opts'))
ytdlp_opts = getattr(parse_options(args), 'ydl_opts')
return yt_dlp.YoutubeDL(ytdlp_opts).cookiejar
get_cookiejar_options = None
# get the cookies jar, prefer the browser cookies than the file
if 'cookies_from_browser' in self.authentication:
# order of priority:
# 1. cookies_from_browser setting in site config
# 2. cookies_file setting in site config
# 3. cookies_from_browser setting in global config
# 4. cookies_file setting in global config
if 'cookies_from_browser' in authdict:
get_cookiejar_options = ['--cookies-from-browser', authdict['cookies_from_browser']]
elif 'cookies_file' in authdict:
get_cookiejar_options = ['--cookies', authdict['cookies_file']]
elif 'cookies_from_browser' in self.authentication:
authdict['cookies_from_browser'] = self.authentication['cookies_from_browser']
if extract_cookies:
authdict['cookies_jar'] = get_ytdlp_cookiejar(['--cookies-from-browser', self.authentication['cookies_from_browser']])
get_cookiejar_options = ['--cookies-from-browser', self.authentication['cookies_from_browser']]
elif 'cookies_file' in self.authentication:
authdict['cookies_file'] = self.authentication['cookies_file']
if extract_cookies:
authdict['cookies_jar'] = get_ytdlp_cookiejar(['--cookies', self.authentication['cookies_file']])
get_cookiejar_options = ['--cookies', self.authentication['cookies_file']]
if get_cookiejar_options:
authdict['cookies_jar'] = get_ytdlp_cookiejar(get_cookiejar_options)
return authdict
def repr(self):

Wyświetl plik

@ -10,7 +10,7 @@ from ruamel.yaml import YAML, CommentedMap, add_representer
from loguru import logger
from copy import deepcopy
from copy import deepcopy, copy
from .module import BaseModule
from typing import Any, List, Type, Tuple
@ -154,7 +154,7 @@ def read_yaml(yaml_filename: str) -> CommentedMap:
if not config:
config = EMPTY_CONFIG
return config
# TODO: make this tidier/find a way to notify of which keys should not be stored

Wyświetl plik

@ -8,7 +8,6 @@ from __future__ import annotations
from typing import Generator, Union, List, Type
from urllib.parse import urlparse
from ipaddress import ip_address
from copy import copy
import argparse
import os
import sys
@ -75,13 +74,6 @@ class AuthenticationJsonParseAction(JsonParseAction):
continue
if not isinstance(key, str) or not isinstance(auth, dict):
raise argparse.ArgumentTypeError(f"Authentication must be a dictionary of site names and their authentication methods. Valid global configs are {global_options}")
# extract out concatenated sites
for key, val in copy(auth_dict).items():
if "," in key:
for site in key.split(","):
auth_dict[site] = val
del auth_dict[key]
setattr(namespace, self.dest, auth_dict)

Wyświetl plik

@ -280,6 +280,7 @@ class GenericExtractor(Extractor):
# set up auth
auth = self.auth_for_site(url, extract_cookies=False)
# order of importance: username/pasword -> api_key -> cookie -> cookie_from_browser -> cookies_file
if auth:
if 'username' in auth and 'password' in auth:
@ -290,11 +291,11 @@ class GenericExtractor(Extractor):
logger.debug(f'Using provided auth cookie for {url}')
yt_dlp.utils.std_headers['cookie'] = auth['cookie']
elif 'cookie_from_browser' in auth:
logger.debug(f'Using extracted cookies from browser {self.cookies_from_browser} for {url}')
logger.debug(f'Using extracted cookies from browser {auth["cookies_from_browser"]} for {url}')
ydl_options['cookiesfrombrowser'] = auth['cookies_from_browser']
elif 'cookies_file' in auth:
logger.debug(f'Using cookies from file {self.cookie_file} for {url}')
ydl_options['cookiesfile'] = auth['cookies_file']
logger.debug(f'Using cookies from file {auth["cookies_file"]} for {url}')
ydl_options['cookiefile'] = auth['cookies_file']
ydl = yt_dlp.YoutubeDL(ydl_options) # allsubtitles and subtitleslangs not working as expected, so default lang is always "en"

Wyświetl plik

@ -68,7 +68,7 @@ class TestGenericExtractor(TestExtractorBase):
"twitter.com/bellingcat/status/123",
"https://www.youtube.com/watch?v=1"
])
def test_download_nonexistend_media(self, make_item, url):
def test_download_nonexistent_media(self, make_item, url):
"""
Test to make sure that the extractor doesn't break on non-existend posts/media