cleanups and docs

2023-02-08 22:13:19 +00:00 · 2023-02-08 22:13:19 +00:00 · 2a7ece5dcc
commit 2a7ece5dcc
--- a/.github/workflows/python-publish.yaml
+++ b/.github/workflows/python-publish.yaml
@ -6,7 +6,7 @@
 # separate terms of service, privacy policy, and support
 # documentation.

-name: Upload Python Package
+name: Pypi

 on:
  release:
@ -20,7 +20,7 @@ permissions:

 jobs:
  deploy:
-
+    name: Publish python package
    runs-on: ubuntu-latest

    steps:
--- a/README.md
+++ b/README.md
@ -4,8 +4,191 @@ Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.

 Python tool to automatically archive social media posts, videos, and images from a Google Sheets, the console, and more. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. If using Google Sheets as the source for links, it will be updated with information about the archived content. It can be run manually or on an automated basis.

-There are 3 ways to use the auto-archiver
-1. (simplest) via docker `docker ... TODO`
-2. (pypi) `pip install auto-archiver`
-3. (legacy) clone and manually install from repo (see legacy [tutorial video](https://youtu.be/VfAhcuV2tLQ))
+There are 3 ways to use the auto-archiver:
+1. (easiest installation) via docker
+2. (local python install) `pip install auto-archiver`
+3. (legacy/development) clone and manually install from repo (see legacy [tutorial video](https://youtu.be/VfAhcuV2tLQ))

+But **you always need a configuration/orchestration file**, which is where you'll configure where/what/how to archive. Make sure you read [orchestration](#orchestration).
+
+
+## How to run the auto-archiver
+
+### Option 1 - docker
+Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.
+
+
+1. install [docker](https://docs.docker.com/get-docker/)
+2. pull the auto-archiver docker [image](https://hub.docker.com/r/bellingcat/auto-archiver) with `docker pull bellingcat/auto-archiver`
+3. run the docker image locally in a container: `docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver -m auto_archiver  --config secrets/orchestration.yaml` breaking this command down:
+   1. `docker run` tells docker to start a new container (an instance of the image)
+   2. `--rm` makes sure this container is removed after execution (less garbage locally)
+   3. `-v $PWD/secrets:/app/secrets` - your secrets folder
+      1. `-v` is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
+      2. `$PWD/secrets` points to a `secrets/` folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
+      3. `/app/secrets` points to the path the docker container where this image can be found
+   4.  `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
+       1.  `-v` same as above, this is a volume instruction
+       2.  `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
+       3.  `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file 
+
+
+### Option 2 - python package
+1. make sure you have python 3.8 or higher installed
+2. install the package `pip/pipenv/conda install auto-archiver`
+3. test it's installed with `auto-archiver --help`
+4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml`
+   1. if your orchestration file is inside a `secrets/` which we advise
+
+
+### Option 3 - local installation
+This can also be used for development.
+
+<details><summary><code>Legacy instructions, only use if docker/package is not an option</code></summary>
+
+
+Install the following locally:
+1. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work. 
+2. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`. 
+3. [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`. 
+
+Clone and run:
+1. `git clone https://github.com/bellingcat/auto-archiver`
+2. `pipenv install`
+3. `pipenv run python -m src.auto_archiver --config secrets/orchestration.yaml`
+
+
+</details><br/>
+
+
+
+
+
+### Examples
+
+
+# Orchestration
+The archiver work is orchestrated by the following workflow (we call each a **step**): 
+1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
+2. **Archiver** tries to archive the link (twitter, youtube, ...)
+3. **Enricher** adds more info to the content (hashes, thumbnails, ...)
+4. **Formatter** creates a report from all the archived content (HTML, PDF, ...)
+5. **Database** knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)
+
+To check all available steps (which archivers, storages, databses, ...) exist check the [example.orchestration.yaml](example.orchestration.yaml).
+
+The great thing is you configure all the workflow in your `orchestration.yaml` file which we advise you put into a `secrets/` folder and don't share it with others because it will contain passwords and other secrets. 
+
+The structure of orchestration file is split into 2 parts: `steps` (what **steps** to use) and `configs` (how those steps should behave), here's a simplification:
+```yaml
+# orchestration.yaml content
+steps:
+  feeder: gsheet_feeder
+  archivers: # order matters
+    - youtubedl_enricher
+  enrichers:
+    - thumbnail_enricher
+  formatter: html_formatter
+  storages:
+    - local_storage
+  databases:
+    - gsheet_db
+
+configurations:
+  gsheet_feeder:
+    sheet: "your google sheet name"
+    header: 2 # row with header for your sheet
+  # ... configurations for the other steps here ...
+```
+
+All the `configurations` in the `orchestration.yaml` file (you can name it differently but need to pass it in the `--config FILENAME` argument) can be seen in the console by using the `--help` flag. They can also be overwritten, for example if you are using the `cli_feeder` to archive from the command line and want to provide the URLs you should do:
+
+```bash
+auto-archiver --config orchestration.yaml --cli_feeder.urls="url1,url2,url3"
+```
+
+Here's the complete workflow that the auto-archiver goes through:
+```mermaid
+graph TD
+    s((start)) --> F(fa:fa-table Feeder)
+    F -->|get and clean URL| D1{fa:fa-database Database}
+    D1 -->|is already archived| e((end))
+    D1 -->|not yet archived| a(fa:fa-download Archivers)
+    a -->|got media| E(fa:fa-chart-line Enrichers)
+    E --> S[fa:fa-box-archive Storages]
+    E --> Fo(fa:fa-code Formatter)
+    Fo --> S
+    Fo -->|update database| D2(fa:fa-database Database)
+    D2 --> e
+```
+
+## Orchestration checklist
+Use this to make sure you help making sure you did all the required steps:
+* [ ] you have a `/secrets` folder with all your configuration files including
+  * [ ] a orchestration file eg: `orchestration.yaml` pointing to the correct location of other files
+  * [ ] (optional if you use GoogleSheets) you have a `service_account.json` (see [how-to](https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account))
+  * [ ] (optional for telegram) a `anon.session` which appears after the 1st run where you login to telegram
+    * if you use private channels you need to add `channel_invites` and set `join_channels=true` at least once
+  * [ ] (optional for VK) a `vk_config.v2.json`
+  * [ ] (optional for using GoogleDrive storage) `gd-token.json` (see [help script](scripts/create_update_gdrive_oauth_token.py))
+  * [ ] (optional for instagram) `instaloader.session` file which appears after the 1st run and login in instagram
+  * [ ] (optional for browsertrix) `profile.tar.gz` file
+
+#### Example invocations
+These assume you've installed with pipenv, see docker section above for how to run through docker
+
+```bash
+# all the configurations come from ./orchestration.yaml
+auto-archiver
+# all the configurations come from ./secrets/orchestration.yaml
+auto-archiver --config orchestration.yaml
+# uses the configurations but for another google docs sheet 
+# with a header on row 2 and with some different column names
+# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
+auto-archiver --config orchestration.yaml --gsheets_feeder.sheet="use it on another sheets doc" --gsheets_feeder.header=2 --gsheets_feeder.columns='{"url": "link"}'
+# all the configurations come from orchestration.yaml and specifies that s3 files should be private
+auto-archiver --s3_storage.private=1
+```
+
+### Extra notes on configuration
+#### Google Drive
+To use Google Drive storage you need the id of the shared folder in the `config.yaml` file which must be shared with the service account eg `autoarchiverservice@auto-archiver-111111.iam.gserviceaccount.com` and then you can use `--storage=gd`
+
+#### Telethon (Telegrams API Library)
+The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your `anon.session` in the root.
+
+
+## Running on Google Sheets Feeder (gsheets_feeder)
+The `--gseets_feeder.sheet` property is the name of the Google Sheet to check for URLs. 
+This sheet must have been shared with the Google Service account used by `gspread`. 
+This sheet must also have specific columns (case-insensitive) in the `header` row - see [Gsheet.configs](src/auto_archiver/utils/gsheet.py) for all their names.
+
+For example, for use with this spreadsheet:
+
+![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column](docs/demo-before.png)
+
+When the auto archiver starts running, it updates the "Archive status" column.
+![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column. The auto archiver has added "archive in progress" to one of the status columns.](docs/demo-progress.png)
+The links are downloaded and archived, and the spreadsheet is updated to the following:
+![A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.](docs/demo-after.png)
+Note that the first row is skipped, as it is assumed to be a header row (`--gsheets_feeder.header=1` and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
+
+
+---
+## Development
+Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run from the local development environment.
+
+# Docker development
+* working with docker locally:
+  * `docker build . -t auto-archiver` to build a local image
+  * `docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml`
+    * to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
+* release to docker hub
+  * `docker image tag auto-archiver bellingcat/auto-archiver:latest`
+  * `docker push bellingcat/auto-archiver`
+
+# RELEASE
+* update version in [version.py](src/auto_archiver/version.py)
+* run `bash ./scripts/release.sh` and confirm
+* package is automatically updated in pypi
+* docker image is automatically pushed to dockerhup
--- a/example.config.yaml
+++ b/example.config.yaml
@ -1,143 +0,0 @@
---
-secrets:
-  # needed if you use storage=s3
-  s3:
-    # contains S3 info on region, bucket, key and secret
-    region: reg1
-    bucket: my-bucket
-    key: "s3 API key"
-    secret: "s3 API secret"
-    # use region format like such
-    endpoint_url: "https://{region}.digitaloceanspaces.com"
-    # endpoint_url: "https://s3.{region}.amazonaws.com"
-    #use bucket, region, and key (key is the archived file path generated when executing) format like such as:
-    cdn_url: "https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}"
-    # if private:true S3 urls will not be readable online
-    private: false
-    # with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
-    key_path: random
-
-  # needed if you use storage=gd
-  google_drive:
-    # To authenticate with google you have two options (1. service account OR 2. OAuth token)
-
-    # 1. service account - storage space will count towards the developer account
-    # filename can be the same or different file from google_sheets.service_account, defaults to "service_account.json"
-    # service_account: "service_account.json"
-
-    # 2. OAuth token  - storage space will count towards the owner of the GDrive folder
-    # (only 1. or 2. - if both specified then this 2. takes precedence)
-    # needs write access on the server so refresh flow works
-    # To get the token, run the file `create_update_test_oauth_token.py`
-    # you can edit that file if you want a different token filename, default is "gd-token.json"
-    oauth_token_filename: "gd-token.json"
-
-    root_folder_id: copy XXXX from https://drive.google.com/drive/folders/XXXX
-
-  # needed if you use storage=local
-  local:
-    # local path to save files in
-    save_to: "./local_archive"
-
-  wayback:
-    # to get credentials visit https://archive.org/account/s3.php
-    key: your API key
-    secret: your API secret
-
-  telegram:
-    # to get credentials see: https://telegra.ph/How-to-get-Telegram-APP-ID--API-HASH-05-27
-    api_id: your API key, see
-    api_hash: your API hash
-    # optional, but allows access to more content such as large videos, talk to @botfather
-    bot_token: your bot-token
-    # optional, defaults to ./anon, records the telegram login session for future usage
-    session_file: "secrets/anon"
-
-  # twitter configuration - API V2 only
-  # if you don't provide credentials the less-effective unofficial TwitterArchiver will be used instead
-  twitter:
-    # either bearer_token only
-    bearer_token: ""
-    # OR all of the below
-    consumer_key: ""
-    consumer_secret: ""
-    access_token: ""
-    access_secret: ""
-
-  # vkontakte (vk.com) credentials
-  vk:
-    username: "phone number or email"
-    password: "password"
-    # optional, defaults to ./vk_config.v2.json, records VK login session for future usage
-    session_file: "secrets/vk_config.v2.json"
-
-  # instagram  credentials
-  instagram:
-    username: "username"
-    password: "password"
-    session_file: "instaloader.session" # <- default value
-
-  google_sheets:
-    # local filename: defaults to service_account.json, see https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account
-    service_account: "service_account.json"
-
-  facebook:
-    # optional facebook cookie to have more access to content, from browser, looks like 'cookie: datr= xxxx'
-    cookie: ""
-execution:
-  # can be overwritten with CMD --sheet=
-  sheet: your-sheet-name
-
-  # block or allow worksheets by name, instead of defaulting to checking all worksheets in a Spreadsheet
-  # worksheet_allow and worksheet_block can be single values or lists
-  # if worksheet_allow is specified, worksheet_block is ignored
-  # worksheet_allow:
-  #   - Sheet1
-  #   - "Sheet 2"
-  # worksheet_block: BlockedSheet
-
-  # which row of your tabs contains the header, can be overwritten with CMD --header=
-  header: 1
-  # which storage to use, can be overwritten with CMD --storage=
-  storage: s3
-  # defaults to false, when true will try to avoid duplicate URL archives
-  check_if_exists: true
-
-  # choose a hash algorithm (either SHA-256 or SHA3-512, defaults to SHA-256)
-  # hash_algorithm: SHA-256
-
-  # optional configurations for the selenium browser that takes screenshots, these are the defaults
-  selenium:
-    # values under 10s might mean screenshots fail to grab screenshot
-    timeout_seconds: 120
-    window_width: 1400
-    window_height: 2000
-
-  # optional browsertrix configuration (for profile generation see https://github.com/webrecorder/browsertrix-crawler#creating-and-using-browser-profiles)
-  # browsertrix will capture a WACZ archive of the page which can then be seen as the original on replaywebpage
-  browsertrix:
-    enabled: true # defaults to false
-    profile: "./browsertrix/crawls/profile.tar.gz"
-    timeout_seconds: 120 # defaults to 90s
-  # puts execution logs into /logs folder, defaults to false
-  save_logs: true
-  # custom column names, only needed if different from default, can be overwritten with CMD --col-NAME="VALUE"
-  # url and status are the only columns required to be present in the google sheet
-  column_names:
-    url: link
-    status: archive status
-    archive: archive location
-    # use this column to override default location data
-    folder: folder
-    date: archive date
-    thumbnail: thumbnail
-    thumbnail_index: thumbnail index
-    timestamp: upload timestamp
-    title: upload title
-    duration: duration
-    screenshot: screenshot
-    hash: hash
-    wacz: wacz
-    # if you want the replaypage to work, make sure to allow CORS on your bucket, see https://replayweb.page/docs/embedding#cors-restrictions
-    replaywebpage: replaywebpage
-
--- a/example.orchestration.yaml
+++ b/example.orchestration.yaml
@ -26,8 +26,6 @@ steps:


 configurations:
-  global:
-    - save_logs: False
  gsheet_feeder:
    sheet: my-auto-archiver
    header: 2 # defaults to 1 in GSheetsFeeder
--- a/src/auto_archiver/core/config.py
+++ b/src/auto_archiver/core/config.py
@ -51,7 +51,7 @@ class Config:
                epilog="Check the code at https://github.com/bellingcat/auto-archiver"
            )

-            parser.add_argument('--config', action='store', dest='config', help='the filename of the YAML configuration file (defaults to \'config.yaml\')', default='config.yaml')
+            parser.add_argument('--config', action='store', dest='config', help='the filename of the YAML configuration file (defaults to \'config.yaml\')', default='orchestration.yaml')

        for configurable in self.configurable_parents:
            child: Step
--- a/src/auto_archiver/core/orchestrator.py
+++ b/src/auto_archiver/core/orchestrator.py
@ -31,7 +31,6 @@ class ArchivingOrchestrator:
            self.feed_item(item)

    def feed_item(self, item: Metadata) -> Metadata:
-        print("ARCHIVING", item)
        try:
            with tempfile.TemporaryDirectory(dir="./") as tmp_dir:
                item.set_tmp_dir(tmp_dir)