kopia lustrzana https://github.com/bellingcat/auto-archiver
Docs tidy ups and re-organising
rodzic
5b481f72ab
commit
d28d99daa6
|
@ -0,0 +1,49 @@
|
|||
# Contributing to Auto Archiver
|
||||
|
||||
Thank you for your interest in contributing to Auto Archiver! Your contributions help improve the project and make it more useful for everyone. Please follow the guidelines below to ensure a smooth collaboration.
|
||||
|
||||
### 1. Reporting a Bug
|
||||
|
||||
If you encounter a bug, please create an issue on GitHub with the following details:
|
||||
|
||||
* Describe the bug: Provide a clear and concise description of the issue.
|
||||
* Steps to reproduce: Include the steps needed to reproduce the bug.
|
||||
* Expected behavior: Describe what you expected to happen.
|
||||
* Actual behavior: Explain what actually happened.
|
||||
* Screenshots/logs: If applicable, attach screenshots or logs to help diagnose the problem.
|
||||
* Environment: Mention the OS, Ruby version, and any other relevant details.
|
||||
|
||||
### 2. Writing a Patch/Fix and Submitting Pull Requests
|
||||
|
||||
If you’d like to fix a bug or improve existing code:
|
||||
|
||||
1. Open a pull request on GitHub and link it to the relevant issue.
|
||||
2. Make sure to document your pull request with a clear description of what changes were made and why.
|
||||
3. Wait for review and make any requested changes.
|
||||
|
||||
### 3. Creating New Modules
|
||||
|
||||
If you want to add a new module to Auto Archiver:
|
||||
|
||||
1. Ensure your module follows the existing [coding style and project structure](https://auto-archiver.readthedocs.io/en/development/creating_modules.html).
|
||||
2. Write clear documentation explaining what your module does and how to use it.
|
||||
3. Ideally, include unit tests for your module!
|
||||
4. Follow the steps in Section 2 to submit a pull request.
|
||||
|
||||
### 4. Do You Have Questions About the Source Code?
|
||||
|
||||
If you have any questions about how the source code works or need help using Auto Archiver
|
||||
|
||||
📝 Check the [Auto Archiver](https://auto-archiver.readthedocs.io/en/latest/) documentation.
|
||||
|
||||
👉 Ask your questions in the [Bellingcat Discord](https://www.bellingcat.com/follow-bellingcat-on-social-media/).
|
||||
|
||||
### 5. Do You Want to Contribute to the Documentation?
|
||||
|
||||
We welcome contributions to the documentation!
|
||||
|
||||
📖 Please read [Contributing to the Auto Archiver Documentation](https://auto-archiver.readthedocs.io/en/development/docs.html) to learn how you can help improve the project's documentation.
|
||||
|
||||
------------------
|
||||
|
||||
Thank you for contributing to Auto Archiver! 🚀
|
110
README.md
110
README.md
|
@ -9,113 +9,29 @@
|
|||
<!-- [](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest) -->
|
||||
|
||||
|
||||
|
||||
Auto Archiver is a Python tool to automatically archive content on the web. It takes URLs from different sources (e.g. a CSV file, Google Sheets, command line etc.) and archives the content of each one. It can archive social media posts, videos, images and webpages. Content can enriched, then saved either locally or remotely (S3 bucket, Google Drive). The status of the archiving process can be appended to a CSV report, or if using Google Sheets – back to the original sheet.
|
||||
|
||||
<div class="hidden_rtd">
|
||||
**[See the Auto Arciver documentation for more information.](https://auto-archiver.readthedocs.io/en/latest/)**
|
||||
</div>
|
||||
|
||||
Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/).
|
||||
|
||||
|
||||
Python tool to automatically archive social media posts, videos, and images from a Google Sheets, the console, and more. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. If using Google Sheets as the source for links, it will be updated with information about the archived content. It can be run manually or on an automated basis.
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
For full instructions on how to install auto-archiver, view the [Installation Guide](installation/installation.md)
|
||||
View the [Installation Guide](installation/installation.md) for full instructions
|
||||
|
||||
Quick run using docker:
|
||||
To get started quickly using Docker:
|
||||
|
||||
`docker pull bellingcat/auto-archiver && docker run`
|
||||
|
||||
Or pip:
|
||||
|
||||
# Orchestration
|
||||
The archiver work is orchestrated by the following workflow (we call each a **step**):
|
||||
1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
|
||||
2. **Archiver** tries to archive the link (twitter, youtube, ...)
|
||||
3. **Enricher** adds more info to the content (hashes, thumbnails, ...)
|
||||
4. **Formatter** creates a report from all the archived content (HTML, PDF, ...)
|
||||
5. **Database** knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)
|
||||
`pip install auto-archiver && auto-archiver --help`
|
||||
|
||||
To setup an auto-archiver instance create an `orchestration.yaml` which contains the workflow you would like. We advise you put this file into a `secrets/` folder and do not share it with others because it will contain passwords and other secrets.
|
||||
## Contributing
|
||||
|
||||
The structure of orchestration file is split into 2 parts: `steps` (what **steps** to use) and `configurations` (how those steps should behave), here's a simplification:
|
||||
```yaml
|
||||
# orchestration.yaml content
|
||||
steps:
|
||||
feeder: gsheet_feeder
|
||||
archivers: # order matters
|
||||
- youtubedl_archiver
|
||||
enrichers:
|
||||
- thumbnail_enricher
|
||||
formatter: html_formatter
|
||||
storages:
|
||||
- local_storage
|
||||
databases:
|
||||
- gsheet_db
|
||||
We welcome contributions to the Auto Archiver project! See the [Contributing Guide](https://auto-archiver.readthedocs.io/en/contributing.html) for how to get involved!
|
||||
|
||||
configurations:
|
||||
gsheet_feeder:
|
||||
sheet: "your google sheet name"
|
||||
header: 2 # row with header for your sheet
|
||||
# ... configurations for the other steps here ...
|
||||
```
|
||||
|
||||
To see all available `steps` (which archivers, storages, databases, ...) exist check the [example.orchestration.yaml](example.orchestration.yaml).
|
||||
|
||||
All the `configurations` in the `orchestration.yaml` file (you can name it differently but need to pass it in the `--config FILENAME` argument) can be seen in the console by using the `--help` flag. They can also be overwritten, for example if you are using the `cli_feeder` to archive from the command line and want to provide the URLs you should do:
|
||||
|
||||
```bash
|
||||
auto-archiver --config secrets/orchestration.yaml --cli_feeder.urls="url1,url2,url3"
|
||||
```
|
||||
|
||||
Here's the complete workflow that the auto-archiver goes through:
|
||||
|
||||
```mermaid
|
||||
|
||||
graph TD
|
||||
s((start)) --> F(fa:fa-table Feeder)
|
||||
F -->|get and clean URL| D1{fa:fa-database Database}
|
||||
D1 -->|is already archived| e((end))
|
||||
D1 -->|not yet archived| a(fa:fa-download Archivers)
|
||||
a -->|got media| E(fa:fa-chart-line Enrichers)
|
||||
E --> S[fa:fa-box-archive Storages]
|
||||
E --> Fo(fa:fa-code Formatter)
|
||||
Fo --> S
|
||||
Fo -->|update database| D2(fa:fa-database Database)
|
||||
D2 --> e
|
||||
```
|
||||
|
||||
## Orchestration checklist
|
||||
Use this to make sure you help making sure you did all the required steps:
|
||||
* [ ] you have a `/secrets` folder with all your configuration files including
|
||||
* [ ] a orchestration file eg: `orchestration.yaml` pointing to the correct location of other files
|
||||
* [ ] (optional if you use GoogleSheets) you have a `service_account.json` (see [how-to](https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account))
|
||||
* [ ] (optional for telegram) a `anon.session` which appears after the 1st run where you login to telegram
|
||||
* if you use private channels you need to add `channel_invites` and set `join_channels=true` at least once
|
||||
* [ ] (optional for VK) a `vk_config.v2.json`
|
||||
* [ ] (optional for using GoogleDrive storage) `gd-token.json` (see [help script](scripts/create_update_gdrive_oauth_token.py))
|
||||
* [ ] (optional for instagram) `instaloader.session` file which appears after the 1st run and login in instagram
|
||||
* [ ] (optional for browsertrix) `profile.tar.gz` file
|
||||
|
||||
#### Example invocations
|
||||
The recommended way to run the auto-archiver is through Docker. The invocations below will run the auto-archiver Docker image using a configuration file that you have specified
|
||||
|
||||
```bash
|
||||
# all the configurations come from ./secrets/orchestration.yaml
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml
|
||||
# uses the same configurations but for another google docs sheet
|
||||
# with a header on row 2 and with some different column names
|
||||
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1
|
||||
```
|
||||
|
||||
The auto-archiver can also be run locally, if pre-requisites are correctly configured. Equivalent invocations are below.
|
||||
|
||||
```bash
|
||||
# all the configurations come from ./secrets/orchestration.yaml
|
||||
auto-archiver --config secrets/orchestration.yaml
|
||||
# uses the same configurations but for another google docs sheet
|
||||
# with a header on row 2 and with some different column names
|
||||
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||
auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
||||
auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1
|
||||
```
|
|
@ -0,0 +1,4 @@
|
|||
.hidden_rtd {
|
||||
display:none;
|
||||
}
|
||||
|
|
@ -15,7 +15,7 @@ generate_module_docs()
|
|||
# -- Project information -----------------------------------------------------
|
||||
package_metadata = metadata("auto-archiver")
|
||||
project = package_metadata["name"]
|
||||
authors = package_metadata["authors"]
|
||||
authors = "Bellingcat"
|
||||
release = package_metadata["version"]
|
||||
language = 'en'
|
||||
|
||||
|
@ -74,5 +74,6 @@ source_suffix = {
|
|||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
html_theme = 'sphinx_book_theme'
|
||||
# html_static_path = ['_static']
|
||||
html_static_path = ["../_static"]
|
||||
html_css_files = ["custom.css"]
|
||||
|
||||
|
|
|
@ -0,0 +1,2 @@
|
|||
```{include} ../../CONTRIBUTING.md
|
||||
```
|
|
@ -0,0 +1,30 @@
|
|||
|
||||
# Archiving Overview
|
||||
|
||||
The archiver archives web pages using the following workflow
|
||||
1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
|
||||
2. **Extractor** tries to extract content from the given link (e.g. videos from youtube, images from Twitter...)
|
||||
3. **Enricher** adds more info to the content (hashes, thumbnails, ...)
|
||||
4. **Formatter** creates a report from all the archived content (HTML, PDF, ...)
|
||||
5. **Database** knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)
|
||||
|
||||
Each step in the workflow is handled by 'modules' that interact with the data in different ways. For example, the Twitter Extractor Module would extract information from the Twitter website. The Screenshot Enricher Module will take screenshots of the given page. See the [core modules page](core_modules.md) for an overview of all the modules that are available.
|
||||
|
||||
Auto-archiver must have at least one module defined for each step of the workflow. This is done by setting the [configuration](installation/configurations.md) for your auto-archiver instance.
|
||||
|
||||
Here's the complete workflow that the auto-archiver goes through:
|
||||
|
||||
```mermaid
|
||||
|
||||
graph TD
|
||||
s((start)) --> F(fa:fa-table Feeder)
|
||||
F -->|get and clean URL| D1{fa:fa-database Database}
|
||||
D1 -->|is already archived| e((end))
|
||||
D1 -->|not yet archived| a(fa:fa-download Archivers)
|
||||
a -->|got media| E(fa:fa-chart-line Enrichers)
|
||||
E --> S[fa:fa-box-archive Storages]
|
||||
E --> Fo(fa:fa-code Formatter)
|
||||
Fo --> S
|
||||
Fo -->|update database| D2(fa:fa-database Database)
|
||||
D2 --> e
|
||||
```
|
|
@ -8,6 +8,7 @@
|
|||
:caption: Contents:
|
||||
|
||||
Overview <self>
|
||||
contributing
|
||||
installation/installation.rst
|
||||
core_modules.md
|
||||
how_to
|
||||
|
|
|
@ -3,6 +3,24 @@
|
|||
|
||||
This section of the documentation provides guidelines for configuring the tool.
|
||||
|
||||
## Configuring using a file
|
||||
|
||||
The recommended way to configure auto-archiver for long-term and deployed projects is a configuration file, typically called `orchestration.yaml`. This is a YAML file containing all the settings for your entire workflow.
|
||||
|
||||
The structure of orchestration file is split into 2 parts: `steps` (what [steps](../flow_overview.md) to use) and `configurations` (settings for different modules), here's a simplification:
|
||||
|
||||
A default `orchestration.yaml` will be created for you the first time you run auto-archiver (without any arguments). Here's what it looks like:
|
||||
|
||||
<details>
|
||||
<summary>View exampleorchestration.yaml</summary>
|
||||
|
||||
```{literalinclude} ../example.orchestration.yaml
|
||||
:language: yaml
|
||||
:caption: orchestration.yaml
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Configuring from the Command Line
|
||||
|
||||
You can run auto-archiver directy from the command line, without the need for a configuration file, command line arguments are parsed using the format `module_name.config_value`. For example, a config value of `api_key` in the `instagram_extractor` module would be passed on the command line with the flag `--instagram_extractor.api_key=API_KEY`.
|
||||
|
@ -14,23 +32,10 @@ auto-archiver --instagram_extractor.api_key=123 --other_module.setting --store
|
|||
# will store the new settings into the configuration file (default: orchestration.yaml)
|
||||
```
|
||||
|
||||
## Configuring using a file
|
||||
|
||||
The recommended way to configure auto-archiver for long-term and deployed projects is a configuration file, typically called `orchestration.yaml`. This is a YAML file containing all the settings for your entire workflow.
|
||||
|
||||
A default `orchestration.yaml` will be created for you the first time you run auto-archiver (without any arguments). Here's what it looks like:
|
||||
|
||||
<details>
|
||||
<summary>View example orchestration.yaml</summary>
|
||||
|
||||
```{literalinclude} ../example.orchestration.yaml
|
||||
:language: yaml
|
||||
:caption: orchestration.yaml
|
||||
```{note} Arguments passed on the command line override those saved in your settings file. Save them to your config file using the -s or --store flag
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Core Module Configuration
|
||||
## Seeing all Configuration Options
|
||||
|
||||
View the configurable settings for the core modules on the individual doc pages for each [](../core_modules.md).
|
||||
You can also view all settings available for the modules you have on your system using the `--help` flag in auto-archiver.
|
||||
|
|
|
@ -38,21 +38,52 @@ Docker works like a virtual machine running inside your computer, it isolates ev
|
|||
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
|
||||
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
|
||||
|
||||
### Example invocations
|
||||
|
||||
The invocations below will run the auto-archiver Docker image using a configuration file that you have specified
|
||||
|
||||
```bash
|
||||
# all the configurations come from ./secrets/orchestration.yaml
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml
|
||||
# uses the same configurations but for another google docs sheet
|
||||
# with a header on row 2 and with some different column names
|
||||
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1
|
||||
```
|
||||
|
||||
## Installing Locally with Pip
|
||||
|
||||
1. Make sure you have python 3.10 or higher installed
|
||||
2. Install the package with your preferred package manager: `pip/pipenv/conda install auto-archiver` or `poetry add auto-archiver`
|
||||
3. Test it's installed with `auto-archiver --help`
|
||||
4. Run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml` if your orchestration file is inside a `secrets/`, which we advise
|
||||
4. Install other local dependency requirements (for )
|
||||
5. Run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml` if your orchestration file is inside a `secrets/`, which we advise
|
||||
|
||||
### Example invocations
|
||||
|
||||
Once all your [local requirements](#installing-local-requirements) are correctly installed, the
|
||||
|
||||
```bash
|
||||
# all the configurations come from ./secrets/orchestration.yaml
|
||||
auto-archiver --config secrets/orchestration.yaml
|
||||
# uses the same configurations but for another google docs sheet
|
||||
# with a header on row 2 and with some different column names
|
||||
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||
auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
||||
auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1
|
||||
```
|
||||
|
||||
### Installing Local Requirements
|
||||
|
||||
If using the local installation method, you will also need to install the following dependencies locally:
|
||||
|
||||
1. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work.
|
||||
2. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`.
|
||||
1.[ffmpeg](https://www.ffmpeg.org/) - for handling of downloaded videos
|
||||
2. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin` - for taking webpage screenshots with the screenshot enricher
|
||||
3. (optional) [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`.
|
||||
|
||||
4. [Browsertrix Crawler docker image](https://hub.docker.com/r/webrecorder/browsertrix-crawler) for the WACZ enricher/archiver
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,16 @@
|
|||
|
||||
```{include} ../../README.md
|
||||
```
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
:caption: Contents:
|
||||
|
||||
Overview <self>
|
||||
installation/installation.rst
|
||||
core_modules.md
|
||||
how_to
|
||||
development/developer_guidelines
|
||||
autoapi/index.rst
|
||||
```
|
Ładowanie…
Reference in New Issue