auto-archiver/docs/source/installation/configurations.md

5.5 KiB

Configuration

The recommended way to configure auto-archiver for first-time users is to run the Auto Archiver and have it auto-generate a default configuration for you. Then, if needed, you can edit the configuration file using one of the following methods.

1. Configuration file

The configuration file is typically called orchestration.yaml and stored in the secrets folder on your desktop. The configuration file contains all the settings for your entire Auto Archiver workflow in one easy-to-find place.

If you want to have Auto Archiver run with the recommended 'basic' setup,

Advanced Configuration

The structure of orchestration file is split into 2 parts: steps (what steps to use) and configurations (settings for individual modules).

A default orchestration.yaml will be created for you the first time you run auto-archiver (without any arguments). Here's what it looks like:

View exampleorchestration.yaml
   :language: yaml
   :caption: orchestration.yaml

2. Command Line configuration

You can run auto-archiver directly from the command line, without the need for a configuration file, command line arguments are parsed using the format module_name.config_value. For example, a config value of api_key in the instagram_extractor module would be passed on the command line with the flag --instagram_extractor.api_key=API_KEY.

The command line arguments are useful for testing or editing config values and enabling/disabling modules on the fly. When you are happy with your settings, you can store them back in your configuration file by passing the -s/--store flag on the command line.

auto-archiver --instagram_extractor.api_key=123 --other_module.setting --store
# will store the new settings into the configuration file (default: orchestration.yaml)

Seeing all Configuration Options

View the configurable settings for the core modules on the individual doc pages for each . You can also view all settings available for the modules you have on your system using the --help flag in auto-archiver.

:caption: Example output when using the --help flag with auto-archiver
$ auto-archiver --help
...
Positional Arguments:
  urls                  URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml

Options:
  --help, -h            show a full help message and exit
  --version             show program's version number and exit
  --config CONFIG_FILE  the filename of the YAML configuration file (defaults to 'config.yaml')
  --mode {simple,full}  the mode to run the archiver in
  -s, --store, --no-store
                        Store the created config in the config file
  --module_paths MODULE_PATHS [MODULE_PATHS ...]
                        additional paths to search for modules
  --feeders STEPS.FEEDERS [STEPS.FEEDERS ...]
                        the feeders to use
  --enrichers STEPS.ENRICHERS [STEPS.ENRICHERS ...]
                        the enrichers to use
  --extractors STEPS.EXTRACTORS [STEPS.EXTRACTORS ...]
                        the extractors to use
  --databases STEPS.DATABASES [STEPS.DATABASES ...]
                        the databases to use
  --storages STEPS.STORAGES [STEPS.STORAGES ...]
                        the storages to use
  --formatters STEPS.FORMATTERS [STEPS.FORMATTERS ...]
                        the formatter to use
  --authentication AUTHENTICATION
                        A dictionary of sites and their authentication methods (token, username etc.) that extractors can use to log into a website. If passing this on the command line, use a JSON string. You may
                        also pass a path to a valid JSON/YAML file which will be parsed.
  --logging.level {INFO,DEBUG,ERROR,WARNING}
                        the logging level to use
  --logging.file LOGGING.FILE
                        the logging file to write to
  --logging.rotation LOGGING.ROTATION
                        the logging rotation to use

Wayback Machine Enricher:
  Submits the current URL to the Wayback Machine for archiving and returns either a job ID or the...

  --wayback_extractor_enricher.timeout TIMEOUT
                        seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.
  --wayback_extractor_enricher.if_not_archived_within IF_NOT_ARCHIVED_WITHIN
                        only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information:
                        https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA
  --wayback_extractor_enricher.key KEY
                        wayback API key. to get credentials visit https://archive.org/account/s3.php
  --wayback_extractor_enricher.secret SECRET
                        wayback API secret. to get credentials visit https://archive.org/account/s3.php
  --wayback_extractor_enricher.proxy_http PROXY_HTTP
                        http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port
  --wayback_extractor_enricher.proxy_https PROXY_HTTPS
                        https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port