|
|
|
@ -1,742 +0,0 @@
|
|
|
|
|
|
|
|
|
|
Configs
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
This section documents all configuration options available for various components.
|
|
|
|
|
|
|
|
|
|
InstagramAPIArchiver
|
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - access_token
|
|
|
|
|
- None
|
|
|
|
|
- a valid instagrapi-api token
|
|
|
|
|
* - api_endpoint
|
|
|
|
|
- None
|
|
|
|
|
- API endpoint to use
|
|
|
|
|
* - full_profile
|
|
|
|
|
- False
|
|
|
|
|
- if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information.
|
|
|
|
|
* - full_profile_max_posts
|
|
|
|
|
- 0
|
|
|
|
|
- Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights
|
|
|
|
|
* - minimize_json_output
|
|
|
|
|
- True
|
|
|
|
|
- if true, will remove empty values from the json output
|
|
|
|
|
|
|
|
|
|
InstagramArchiver
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - username
|
|
|
|
|
- None
|
|
|
|
|
- a valid Instagram username
|
|
|
|
|
* - password
|
|
|
|
|
- None
|
|
|
|
|
- the corresponding Instagram account password
|
|
|
|
|
* - download_folder
|
|
|
|
|
- instaloader
|
|
|
|
|
- name of a folder to temporarily download content to
|
|
|
|
|
* - session_file
|
|
|
|
|
- secrets/instaloader.session
|
|
|
|
|
- path to the instagram session which saves session credentials
|
|
|
|
|
|
|
|
|
|
InstagramTbotArchiver
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - api_id
|
|
|
|
|
- None
|
|
|
|
|
- telegram API_ID value, go to https://my.telegram.org/apps
|
|
|
|
|
* - api_hash
|
|
|
|
|
- None
|
|
|
|
|
- telegram API_HASH value, go to https://my.telegram.org/apps
|
|
|
|
|
* - session_file
|
|
|
|
|
- secrets/anon-insta
|
|
|
|
|
- optional, records the telegram login session for future usage, '.session' will be appended to the provided value.
|
|
|
|
|
* - timeout
|
|
|
|
|
- 45
|
|
|
|
|
- timeout to fetch the instagram content in seconds.
|
|
|
|
|
|
|
|
|
|
TelethonArchiver
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - api_id
|
|
|
|
|
- None
|
|
|
|
|
- telegram API_ID value, go to https://my.telegram.org/apps
|
|
|
|
|
* - api_hash
|
|
|
|
|
- None
|
|
|
|
|
- telegram API_HASH value, go to https://my.telegram.org/apps
|
|
|
|
|
* - bot_token
|
|
|
|
|
- None
|
|
|
|
|
- optional, but allows access to more content such as large videos, talk to @botfather
|
|
|
|
|
* - session_file
|
|
|
|
|
- secrets/anon
|
|
|
|
|
- optional, records the telegram login session for future usage, '.session' will be appended to the provided value.
|
|
|
|
|
* - join_channels
|
|
|
|
|
- True
|
|
|
|
|
- disables the initial setup with channel_invites config, useful if you have a lot and get stuck
|
|
|
|
|
* - channel_invites
|
|
|
|
|
- {}
|
|
|
|
|
- (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup
|
|
|
|
|
|
|
|
|
|
TwitterApiArchiver
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - bearer_token
|
|
|
|
|
- None
|
|
|
|
|
- [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret
|
|
|
|
|
* - bearer_tokens
|
|
|
|
|
- []
|
|
|
|
|
- a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line
|
|
|
|
|
* - consumer_key
|
|
|
|
|
- None
|
|
|
|
|
- twitter API consumer_key
|
|
|
|
|
* - consumer_secret
|
|
|
|
|
- None
|
|
|
|
|
- twitter API consumer_secret
|
|
|
|
|
* - access_token
|
|
|
|
|
- None
|
|
|
|
|
- twitter API access_token
|
|
|
|
|
* - access_secret
|
|
|
|
|
- None
|
|
|
|
|
- twitter API access_secret
|
|
|
|
|
|
|
|
|
|
VkArchiver
|
|
|
|
|
----------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - username
|
|
|
|
|
- None
|
|
|
|
|
- valid VKontakte username
|
|
|
|
|
* - password
|
|
|
|
|
- None
|
|
|
|
|
- valid VKontakte password
|
|
|
|
|
* - session_file
|
|
|
|
|
- secrets/vk_config.v2.json
|
|
|
|
|
- valid VKontakte password
|
|
|
|
|
|
|
|
|
|
YoutubeDLArchiver
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - facebook_cookie
|
|
|
|
|
- None
|
|
|
|
|
- optional facebook cookie to have more access to content, from browser, looks like 'cookie: datr= xxxx'
|
|
|
|
|
* - subtitles
|
|
|
|
|
- True
|
|
|
|
|
- download subtitles if available
|
|
|
|
|
* - comments
|
|
|
|
|
- False
|
|
|
|
|
- download all comments if available, may lead to large metadata
|
|
|
|
|
* - livestreams
|
|
|
|
|
- False
|
|
|
|
|
- if set, will download live streams, otherwise will skip them; see --max-filesize for more control
|
|
|
|
|
* - live_from_start
|
|
|
|
|
- False
|
|
|
|
|
- if set, will download live streams from their earliest available moment, otherwise starts now.
|
|
|
|
|
* - proxy
|
|
|
|
|
-
|
|
|
|
|
- http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy- user:password@proxy-ip:port
|
|
|
|
|
* - end_means_success
|
|
|
|
|
- True
|
|
|
|
|
- if True, any archived content will mean a 'success', if False this archiver will not return a 'success' stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve.
|
|
|
|
|
* - allow_playlist
|
|
|
|
|
- False
|
|
|
|
|
- If True will also download playlists, set to False if the expectation is to download a single video.
|
|
|
|
|
* - max_downloads
|
|
|
|
|
- inf
|
|
|
|
|
- Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit.
|
|
|
|
|
* - cookies_from_browser
|
|
|
|
|
- None
|
|
|
|
|
- optional browser for ytdl to extract cookies from, can be one of: brave, chrome, chromium, edge, firefox, opera, safari, vivaldi, whale
|
|
|
|
|
* - cookie_file
|
|
|
|
|
- None
|
|
|
|
|
- optional cookie file to use for Youtube, see instructions here on how to export from your browser: https://github.com/yt-dlp/yt- dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp
|
|
|
|
|
|
|
|
|
|
AAApiDb
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - api_endpoint
|
|
|
|
|
- None
|
|
|
|
|
- API endpoint where calls are made to
|
|
|
|
|
* - api_token
|
|
|
|
|
- None
|
|
|
|
|
- API Bearer token.
|
|
|
|
|
* - public
|
|
|
|
|
- False
|
|
|
|
|
- whether the URL should be publicly available via the API
|
|
|
|
|
* - author_id
|
|
|
|
|
- None
|
|
|
|
|
- which email to assign as author
|
|
|
|
|
* - group_id
|
|
|
|
|
- None
|
|
|
|
|
- which group of users have access to the archive in case public=false as author
|
|
|
|
|
* - allow_rearchive
|
|
|
|
|
- True
|
|
|
|
|
- if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived
|
|
|
|
|
* - store_results
|
|
|
|
|
- True
|
|
|
|
|
- when set, will send the results to the API database.
|
|
|
|
|
* - tags
|
|
|
|
|
- []
|
|
|
|
|
- what tags to add to the archived URL
|
|
|
|
|
|
|
|
|
|
AtlosDb
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - api_token
|
|
|
|
|
- None
|
|
|
|
|
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
|
|
|
|
|
* - atlos_url
|
|
|
|
|
- https://platform.atlos.org
|
|
|
|
|
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
|
|
|
|
|
|
|
|
|
|
CSVDb
|
|
|
|
|
-----
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - csv_file
|
|
|
|
|
- db.csv
|
|
|
|
|
- CSV file name
|
|
|
|
|
|
|
|
|
|
HashEnricher
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - algorithm
|
|
|
|
|
- SHA-256
|
|
|
|
|
- hash algorithm to use
|
|
|
|
|
* - chunksize
|
|
|
|
|
- 16000000
|
|
|
|
|
- number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB
|
|
|
|
|
|
|
|
|
|
ScreenshotEnricher
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - width
|
|
|
|
|
- 1280
|
|
|
|
|
- width of the screenshots
|
|
|
|
|
* - height
|
|
|
|
|
- 720
|
|
|
|
|
- height of the screenshots
|
|
|
|
|
* - timeout
|
|
|
|
|
- 60
|
|
|
|
|
- timeout for taking the screenshot
|
|
|
|
|
* - sleep_before_screenshot
|
|
|
|
|
- 4
|
|
|
|
|
- seconds to wait for the pages to load before taking screenshot
|
|
|
|
|
* - http_proxy
|
|
|
|
|
-
|
|
|
|
|
- http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port
|
|
|
|
|
* - save_to_pdf
|
|
|
|
|
- False
|
|
|
|
|
- save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter
|
|
|
|
|
* - print_options
|
|
|
|
|
- {}
|
|
|
|
|
- options to pass to the pdf printer
|
|
|
|
|
|
|
|
|
|
SSLEnricher
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - skip_when_nothing_archived
|
|
|
|
|
- True
|
|
|
|
|
- if true, will skip enriching when no media is archived
|
|
|
|
|
|
|
|
|
|
ThumbnailEnricher
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - thumbnails_per_minute
|
|
|
|
|
- 60
|
|
|
|
|
- how many thumbnails to generate per minute of video, can be limited by max_thumbnails
|
|
|
|
|
* - max_thumbnails
|
|
|
|
|
- 16
|
|
|
|
|
- limit the number of thumbnails to generate per video, 0 means no limit
|
|
|
|
|
|
|
|
|
|
TimestampingEnricher
|
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - tsa_urls
|
|
|
|
|
- ['http://timestamp.digicert.com', 'http://timestamp.identrust.com', 'http://timestamp.globalsign.com/tsa/r6advanced1', 'http://tss.accv.es:8318/tsa']
|
|
|
|
|
- List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line.
|
|
|
|
|
|
|
|
|
|
WaczArchiverEnricher
|
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - profile
|
|
|
|
|
- None
|
|
|
|
|
- browsertrix-profile (for profile generation see https://github.com/webrecorder/browsertrix- crawler#creating-and-using-browser-profiles).
|
|
|
|
|
* - docker_commands
|
|
|
|
|
- None
|
|
|
|
|
- if a custom docker invocation is needed
|
|
|
|
|
* - timeout
|
|
|
|
|
- 120
|
|
|
|
|
- timeout for WACZ generation in seconds
|
|
|
|
|
* - extract_media
|
|
|
|
|
- False
|
|
|
|
|
- If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.
|
|
|
|
|
* - extract_screenshot
|
|
|
|
|
- True
|
|
|
|
|
- If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.
|
|
|
|
|
* - socks_proxy_host
|
|
|
|
|
- None
|
|
|
|
|
- SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host
|
|
|
|
|
* - socks_proxy_port
|
|
|
|
|
- None
|
|
|
|
|
- SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234
|
|
|
|
|
* - proxy_server
|
|
|
|
|
- None
|
|
|
|
|
- SOCKS server proxy URL, in development
|
|
|
|
|
|
|
|
|
|
WaybackArchiverEnricher
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - timeout
|
|
|
|
|
- 15
|
|
|
|
|
- seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.
|
|
|
|
|
* - if_not_archived_within
|
|
|
|
|
- None
|
|
|
|
|
- only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1N sv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA
|
|
|
|
|
* - key
|
|
|
|
|
- None
|
|
|
|
|
- wayback API key. to get credentials visit https://archive.org/account/s3.php
|
|
|
|
|
* - secret
|
|
|
|
|
- None
|
|
|
|
|
- wayback API secret. to get credentials visit https://archive.org/account/s3.php
|
|
|
|
|
* - proxy_http
|
|
|
|
|
- None
|
|
|
|
|
- http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port
|
|
|
|
|
* - proxy_https
|
|
|
|
|
- None
|
|
|
|
|
- https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port
|
|
|
|
|
|
|
|
|
|
WhisperEnricher
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - api_endpoint
|
|
|
|
|
- None
|
|
|
|
|
- WhisperApi api endpoint, eg: https://whisperbox- api.com/api/v1, a deployment of https://github.com/bellingcat/whisperbox- transcribe.
|
|
|
|
|
* - api_key
|
|
|
|
|
- None
|
|
|
|
|
- WhisperApi api key for authentication
|
|
|
|
|
* - include_srt
|
|
|
|
|
- False
|
|
|
|
|
- Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players).
|
|
|
|
|
* - timeout
|
|
|
|
|
- 90
|
|
|
|
|
- How many seconds to wait at most for a successful job completion.
|
|
|
|
|
* - action
|
|
|
|
|
- translate
|
|
|
|
|
- which Whisper operation to execute
|
|
|
|
|
|
|
|
|
|
AtlosFeeder
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - api_token
|
|
|
|
|
- None
|
|
|
|
|
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
|
|
|
|
|
* - atlos_url
|
|
|
|
|
- https://platform.atlos.org
|
|
|
|
|
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
|
|
|
|
|
|
|
|
|
|
CLIFeeder
|
|
|
|
|
---------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - urls
|
|
|
|
|
- None
|
|
|
|
|
- URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml
|
|
|
|
|
|
|
|
|
|
GsheetsFeeder
|
|
|
|
|
-------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - sheet
|
|
|
|
|
- None
|
|
|
|
|
- name of the sheet to archive
|
|
|
|
|
* - sheet_id
|
|
|
|
|
- None
|
|
|
|
|
- (alternative to sheet name) the id of the sheet to archive
|
|
|
|
|
* - header
|
|
|
|
|
- 1
|
|
|
|
|
- index of the header row (starts at 1)
|
|
|
|
|
* - service_account
|
|
|
|
|
- secrets/service_account.json
|
|
|
|
|
- service account JSON file path
|
|
|
|
|
* - columns
|
|
|
|
|
- {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'}
|
|
|
|
|
- names of columns in the google sheet (stringified JSON object)
|
|
|
|
|
* - allow_worksheets
|
|
|
|
|
- set()
|
|
|
|
|
- (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed
|
|
|
|
|
* - block_worksheets
|
|
|
|
|
- set()
|
|
|
|
|
- (CSV) explicitly block some worksheets from being processed
|
|
|
|
|
* - use_sheet_names_in_stored_paths
|
|
|
|
|
- True
|
|
|
|
|
- if True the stored files path will include 'workbook_name/worksheet_name/...'
|
|
|
|
|
|
|
|
|
|
HtmlFormatter
|
|
|
|
|
-------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - detect_thumbnails
|
|
|
|
|
- True
|
|
|
|
|
- if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'
|
|
|
|
|
|
|
|
|
|
AtlosStorage
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - path_generator
|
|
|
|
|
- url
|
|
|
|
|
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
|
|
|
|
|
* - filename_generator
|
|
|
|
|
- random
|
|
|
|
|
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
|
|
|
|
|
* - api_token
|
|
|
|
|
- None
|
|
|
|
|
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
|
|
|
|
|
* - atlos_url
|
|
|
|
|
- https://platform.atlos.org
|
|
|
|
|
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
|
|
|
|
|
|
|
|
|
|
GDriveStorage
|
|
|
|
|
-------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - path_generator
|
|
|
|
|
- url
|
|
|
|
|
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
|
|
|
|
|
* - filename_generator
|
|
|
|
|
- random
|
|
|
|
|
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
|
|
|
|
|
* - root_folder_id
|
|
|
|
|
- None
|
|
|
|
|
- root google drive folder ID to use as storage, found in URL: 'https://drive.google.com/drive/folders/FOLDER_ID'
|
|
|
|
|
* - oauth_token
|
|
|
|
|
- None
|
|
|
|
|
- JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account.
|
|
|
|
|
* - service_account
|
|
|
|
|
- secrets/service_account.json
|
|
|
|
|
- service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account.
|
|
|
|
|
|
|
|
|
|
LocalStorage
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - path_generator
|
|
|
|
|
- url
|
|
|
|
|
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
|
|
|
|
|
* - filename_generator
|
|
|
|
|
- random
|
|
|
|
|
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
|
|
|
|
|
* - save_to
|
|
|
|
|
- ./archived
|
|
|
|
|
- folder where to save archived content
|
|
|
|
|
* - save_absolute
|
|
|
|
|
- False
|
|
|
|
|
- whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)
|
|
|
|
|
|
|
|
|
|
S3Storage
|
|
|
|
|
---------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - path_generator
|
|
|
|
|
- url
|
|
|
|
|
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
|
|
|
|
|
* - filename_generator
|
|
|
|
|
- random
|
|
|
|
|
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
|
|
|
|
|
* - bucket
|
|
|
|
|
- None
|
|
|
|
|
- S3 bucket name
|
|
|
|
|
* - region
|
|
|
|
|
- None
|
|
|
|
|
- S3 region name
|
|
|
|
|
* - key
|
|
|
|
|
- None
|
|
|
|
|
- S3 API key
|
|
|
|
|
* - secret
|
|
|
|
|
- None
|
|
|
|
|
- S3 API secret
|
|
|
|
|
* - random_no_duplicate
|
|
|
|
|
- False
|
|
|
|
|
- if set, it will override `path_generator`, `filename_generator` and `folder`. It will check if the file already exists and if so it will not upload it again. Creates a new root folder path `no-dups/`
|
|
|
|
|
* - endpoint_url
|
|
|
|
|
- https://{region}.digitaloceanspaces.com
|
|
|
|
|
- S3 bucket endpoint, {region} are inserted at runtime
|
|
|
|
|
* - cdn_url
|
|
|
|
|
- https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}
|
|
|
|
|
- S3 CDN url, {bucket}, {region} and {key} are inserted at runtime
|
|
|
|
|
* - private
|
|
|
|
|
- False
|
|
|
|
|
- if true S3 files will not be readable online
|
|
|
|
|
|
|
|
|
|
Storage
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - path_generator
|
|
|
|
|
- url
|
|
|
|
|
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
|
|
|
|
|
* - filename_generator
|
|
|
|
|
- random
|
|
|
|
|
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
|
|
|
|
|
|
|
|
|
|
Gsheets
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
The following table lists all configuration options for this component:
|
|
|
|
|
|
|
|
|
|
.. list-table:: Configuration Options
|
|
|
|
|
:header-rows: 1
|
|
|
|
|
:widths: 25 20 55
|
|
|
|
|
|
|
|
|
|
* - **Key**
|
|
|
|
|
- **Default**
|
|
|
|
|
- **Description**
|
|
|
|
|
* - sheet
|
|
|
|
|
- None
|
|
|
|
|
- name of the sheet to archive
|
|
|
|
|
* - sheet_id
|
|
|
|
|
- None
|
|
|
|
|
- (alternative to sheet name) the id of the sheet to archive
|
|
|
|
|
* - header
|
|
|
|
|
- 1
|
|
|
|
|
- index of the header row (starts at 1)
|
|
|
|
|
* - service_account
|
|
|
|
|
- secrets/service_account.json
|
|
|
|
|
- service account JSON file path
|
|
|
|
|
* - columns
|
|
|
|
|
- {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'}
|
|
|
|
|
- names of columns in the google sheet (stringified JSON object)
|
|
|
|
|
|