Further tweaks and fixes

pull/190/head
Patrick Robertson 2025-02-11 14:37:29 +00:00
rodzic 29901da601
commit 62154ddfef
10 zmienionych plików z 30 dodań i 747 usunięć

Wyświetl plik

@ -32,3 +32,15 @@ Util Functions
{% endfor %}
Core Modules
------------
.. toctree::
:titlesonly:
{% for page in pages|selectattr("is_top_level_object") %}
{% if page.name != 'core' and page.name != 'utils' %}
{{ page.include_path }}
{% endif %}
{% endfor %}

Wyświetl plik

@ -58,8 +58,8 @@ def generate_module_docs():
configs_cheatsheet += f"| `{module.name}.{key}` | {help} | {value.get('default', '')} | {type} |\n"
# make type folder if it doesn't exist
# add a link to the autodoc refs
readme_str += f"\n[API Reference](../../../autoapi/{module.name}/index)\n"
# create the module.type folder, use the first type just for where to store the file
type_folder = SAVE_FOLDER / module.type[0]
type_folder.mkdir(exist_ok=True)

Wyświetl plik

@ -1,742 +0,0 @@
Configs
-------
This section documents all configuration options available for various components.
InstagramAPIArchiver
--------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - access_token
- None
- a valid instagrapi-api token
* - api_endpoint
- None
- API endpoint to use
* - full_profile
- False
- if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information.
* - full_profile_max_posts
- 0
- Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights
* - minimize_json_output
- True
- if true, will remove empty values from the json output
InstagramArchiver
-----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - username
- None
- a valid Instagram username
* - password
- None
- the corresponding Instagram account password
* - download_folder
- instaloader
- name of a folder to temporarily download content to
* - session_file
- secrets/instaloader.session
- path to the instagram session which saves session credentials
InstagramTbotArchiver
---------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_id
- None
- telegram API_ID value, go to https://my.telegram.org/apps
* - api_hash
- None
- telegram API_HASH value, go to https://my.telegram.org/apps
* - session_file
- secrets/anon-insta
- optional, records the telegram login session for future usage, '.session' will be appended to the provided value.
* - timeout
- 45
- timeout to fetch the instagram content in seconds.
TelethonArchiver
----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_id
- None
- telegram API_ID value, go to https://my.telegram.org/apps
* - api_hash
- None
- telegram API_HASH value, go to https://my.telegram.org/apps
* - bot_token
- None
- optional, but allows access to more content such as large videos, talk to @botfather
* - session_file
- secrets/anon
- optional, records the telegram login session for future usage, '.session' will be appended to the provided value.
* - join_channels
- True
- disables the initial setup with channel_invites config, useful if you have a lot and get stuck
* - channel_invites
- {}
- (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup
TwitterApiArchiver
------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - bearer_token
- None
- [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret
* - bearer_tokens
- []
- a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line
* - consumer_key
- None
- twitter API consumer_key
* - consumer_secret
- None
- twitter API consumer_secret
* - access_token
- None
- twitter API access_token
* - access_secret
- None
- twitter API access_secret
VkArchiver
----------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - username
- None
- valid VKontakte username
* - password
- None
- valid VKontakte password
* - session_file
- secrets/vk_config.v2.json
- valid VKontakte password
YoutubeDLArchiver
-----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - facebook_cookie
- None
- optional facebook cookie to have more access to content, from browser, looks like 'cookie: datr= xxxx'
* - subtitles
- True
- download subtitles if available
* - comments
- False
- download all comments if available, may lead to large metadata
* - livestreams
- False
- if set, will download live streams, otherwise will skip them; see --max-filesize for more control
* - live_from_start
- False
- if set, will download live streams from their earliest available moment, otherwise starts now.
* - proxy
-
- http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy- user:password@proxy-ip:port
* - end_means_success
- True
- if True, any archived content will mean a 'success', if False this archiver will not return a 'success' stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve.
* - allow_playlist
- False
- If True will also download playlists, set to False if the expectation is to download a single video.
* - max_downloads
- inf
- Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit.
* - cookies_from_browser
- None
- optional browser for ytdl to extract cookies from, can be one of: brave, chrome, chromium, edge, firefox, opera, safari, vivaldi, whale
* - cookie_file
- None
- optional cookie file to use for Youtube, see instructions here on how to export from your browser: https://github.com/yt-dlp/yt- dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp
AAApiDb
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_endpoint
- None
- API endpoint where calls are made to
* - api_token
- None
- API Bearer token.
* - public
- False
- whether the URL should be publicly available via the API
* - author_id
- None
- which email to assign as author
* - group_id
- None
- which group of users have access to the archive in case public=false as author
* - allow_rearchive
- True
- if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived
* - store_results
- True
- when set, will send the results to the API database.
* - tags
- []
- what tags to add to the archived URL
AtlosDb
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_token
- None
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
* - atlos_url
- https://platform.atlos.org
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
CSVDb
-----
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - csv_file
- db.csv
- CSV file name
HashEnricher
------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - algorithm
- SHA-256
- hash algorithm to use
* - chunksize
- 16000000
- number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB
ScreenshotEnricher
------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - width
- 1280
- width of the screenshots
* - height
- 720
- height of the screenshots
* - timeout
- 60
- timeout for taking the screenshot
* - sleep_before_screenshot
- 4
- seconds to wait for the pages to load before taking screenshot
* - http_proxy
-
- http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port
* - save_to_pdf
- False
- save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter
* - print_options
- {}
- options to pass to the pdf printer
SSLEnricher
-----------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - skip_when_nothing_archived
- True
- if true, will skip enriching when no media is archived
ThumbnailEnricher
-----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - thumbnails_per_minute
- 60
- how many thumbnails to generate per minute of video, can be limited by max_thumbnails
* - max_thumbnails
- 16
- limit the number of thumbnails to generate per video, 0 means no limit
TimestampingEnricher
--------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - tsa_urls
- ['http://timestamp.digicert.com', 'http://timestamp.identrust.com', 'http://timestamp.globalsign.com/tsa/r6advanced1', 'http://tss.accv.es:8318/tsa']
- List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line.
WaczArchiverEnricher
--------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - profile
- None
- browsertrix-profile (for profile generation see https://github.com/webrecorder/browsertrix- crawler#creating-and-using-browser-profiles).
* - docker_commands
- None
- if a custom docker invocation is needed
* - timeout
- 120
- timeout for WACZ generation in seconds
* - extract_media
- False
- If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.
* - extract_screenshot
- True
- If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.
* - socks_proxy_host
- None
- SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host
* - socks_proxy_port
- None
- SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234
* - proxy_server
- None
- SOCKS server proxy URL, in development
WaybackArchiverEnricher
-----------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - timeout
- 15
- seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.
* - if_not_archived_within
- None
- only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1N sv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA
* - key
- None
- wayback API key. to get credentials visit https://archive.org/account/s3.php
* - secret
- None
- wayback API secret. to get credentials visit https://archive.org/account/s3.php
* - proxy_http
- None
- http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port
* - proxy_https
- None
- https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port
WhisperEnricher
---------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_endpoint
- None
- WhisperApi api endpoint, eg: https://whisperbox- api.com/api/v1, a deployment of https://github.com/bellingcat/whisperbox- transcribe.
* - api_key
- None
- WhisperApi api key for authentication
* - include_srt
- False
- Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players).
* - timeout
- 90
- How many seconds to wait at most for a successful job completion.
* - action
- translate
- which Whisper operation to execute
AtlosFeeder
-----------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_token
- None
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
* - atlos_url
- https://platform.atlos.org
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
CLIFeeder
---------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - urls
- None
- URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml
GsheetsFeeder
-------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - sheet
- None
- name of the sheet to archive
* - sheet_id
- None
- (alternative to sheet name) the id of the sheet to archive
* - header
- 1
- index of the header row (starts at 1)
* - service_account
- secrets/service_account.json
- service account JSON file path
* - columns
- {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'}
- names of columns in the google sheet (stringified JSON object)
* - allow_worksheets
- set()
- (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed
* - block_worksheets
- set()
- (CSV) explicitly block some worksheets from being processed
* - use_sheet_names_in_stored_paths
- True
- if True the stored files path will include 'workbook_name/worksheet_name/...'
HtmlFormatter
-------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - detect_thumbnails
- True
- if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'
AtlosStorage
------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - api_token
- None
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
* - atlos_url
- https://platform.atlos.org
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
GDriveStorage
-------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - root_folder_id
- None
- root google drive folder ID to use as storage, found in URL: 'https://drive.google.com/drive/folders/FOLDER_ID'
* - oauth_token
- None
- JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account.
* - service_account
- secrets/service_account.json
- service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account.
LocalStorage
------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - save_to
- ./archived
- folder where to save archived content
* - save_absolute
- False
- whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)
S3Storage
---------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - bucket
- None
- S3 bucket name
* - region
- None
- S3 region name
* - key
- None
- S3 API key
* - secret
- None
- S3 API secret
* - random_no_duplicate
- False
- if set, it will override `path_generator`, `filename_generator` and `folder`. It will check if the file already exists and if so it will not upload it again. Creates a new root folder path `no-dups/`
* - endpoint_url
- https://{region}.digitaloceanspaces.com
- S3 bucket endpoint, {region} are inserted at runtime
* - cdn_url
- https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}
- S3 CDN url, {bucket}, {region} and {key} are inserted at runtime
* - private
- False
- if true S3 files will not be readable online
Storage
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
Gsheets
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - sheet
- None
- name of the sheet to archive
* - sheet_id
- None
- (alternative to sheet name) the id of the sheet to archive
* - header
- 1
- index of the header row (starts at 1)
* - service_account
- secrets/service_account.json
- service account JSON file path
* - columns
- {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'}
- names of columns in the google sheet (stringified JSON object)

Wyświetl plik

@ -36,10 +36,12 @@ exclude_patterns = []
# -- AutoAPI Configuration ---------------------------------------------------
autoapi_type = 'python'
autoapi_dirs = ["../../src/auto_archiver/core/", "../../src/auto_archiver/utils/", "../../src/auto_archiver/modules/"]
autoapi_dirs = ["../../src/auto_archiver/core/", "../../src/auto_archiver/utils/"]
# get all the modules and add them to the autoapi_dirs
autoapi_dirs.extend([f"../../src/auto_archiver/modules/{m}" for m in os.listdir("../../src/auto_archiver/modules")])
autodoc_typehints = "signature" # Include type hints in the signature
autoapi_ignore = ["*/version.py", ] # Ignore specific modules
autoapi_keep_files = False # Option to retain intermediate JSON files for debugging
autoapi_keep_files = True # Option to retain intermediate JSON files for debugging
autoapi_add_toctree_entry = True # Include API docs in the TOC
autoapi_python_use_implicit_namespaces = True
autoapi_template_dir = "../_templates/autoapi"
@ -47,7 +49,6 @@ autoapi_options = [
"members",
"undoc-members",
"show-inheritance",
"show-module-summary",
"imported-members",
]

Wyświetl plik

@ -9,6 +9,7 @@ The default (enabled) databases are the CSV Database and the Console Database.
```{toctree}
:depth: 1
:hidden:
:glob:
autogen/database/*
```

Wyświetl plik

@ -8,6 +8,7 @@ Enricher modules are used to add additional information to the items that have
```{toctree}
:depth: 1
:hidden:
:glob:
autogen/enricher/*
```

Wyświetl plik

@ -12,6 +12,7 @@ Extractors that are able to extract content from a wide range of websites includ
```{toctree}
:depth: 1
:hidden:
:glob:
autogen/extractor/*
```

Wyświetl plik

@ -10,5 +10,6 @@ The default feeder is the command line feeder, which allows you to input URLs di
```{toctree}
:depth: 1
:glob:
:hidden:
autogen/feeder/*
```

Wyświetl plik

@ -7,6 +7,7 @@ Formatter modules are used to format the data extracted from a URL into a specif
```{toctree}
:depth: 1
:hidden:
:glob:
autogen/formatter/*
```

Wyświetl plik

@ -5,4 +5,11 @@ Storage modules are used to store the data extracted from a URL in a persistent
The default is to store the files downloaded (e.g. images, videos) in a local directory.
```{include} autogen/storage.md
```
```{toctree}
:depth: 1
:hidden:
:glob:
autogen/storage/*
```