kopia lustrzana https://github.com/bellingcat/auto-archiver
Add readme
rodzic
d6cb20dace
commit
0d1dc42654
|
@ -0,0 +1,50 @@
|
|||
# auto-archiver
|
||||
|
||||
## Setup
|
||||
|
||||
If you are using `pipenv` (recommended), `pipenv install` is sufficient to install Python prerequisites.
|
||||
|
||||
A Google Service account is necessary for use with `gspread`. These credentials should be stored in `~/.config/gspread/service_account.json`. =
|
||||
|
||||
A `.env` file is required for saving content to a Digital Ocean space. This file should contain the following variables:
|
||||
|
||||
```
|
||||
DO_SPACES_REGION=
|
||||
DO_BUCKET=
|
||||
DO_SPACES_KEY=
|
||||
DO_SPACES_SECRET=
|
||||
```
|
||||
|
||||
## Running
|
||||
|
||||
There are several necessary command line flags:
|
||||
* `--sheet name` sets the name of the Google Sheet to check for URLs. This sheet must have been shared with the Google Service account used by `gspread`.
|
||||
* `--url-col`, `--status-col`, `--archive-col`, and `--date-col` give the letter name of the column for the original source URL, the archiver status, the archive location, and the archive date respectively.
|
||||
|
||||
For example, for use with this spreadsheet:
|
||||
|
||||
![A screenshot of a Google Spreadsheet with a single Youtube URL in column A](docs/before.png)
|
||||
|
||||
```pipenv run python auto-archive.py --sheet archiver-test --url-col A --status-col B --archive-col C --date-col D ```
|
||||
|
||||
The Youtube link is downloaded and archived, and the spreadsheet is updated to the following:
|
||||
|
||||
![A screenshot of a Google Spreadsheet with the same Youtube URL in column A, and additional archival information in B, C, and D](docs/before.png)
|
||||
|
||||
By default, the archiver will skip over live streaming content. However, with the `--streaming` flag, it will skip over non-real time content and archive livestreams. This is blocking, and each execution of the archiver will start downloading only a single livestreamed video. This is so that it can be used in combination with a non-streaming archiver, as detailed below.
|
||||
|
||||
Finally, by default only the first worksheet in a Google Sheet is checked. To check all use the `--all-worksheets` flag. These worksheets must use the same column locations.
|
||||
|
||||
## Automating
|
||||
|
||||
The auto-archiver can be run automatically via cron. To prevent overlapping execution (which should not affect archive integrity, but will use unecessary compute and network resources) `flock` can be used to create a lockfile. An example crontab entry that runs the archiver every minute is as follows.
|
||||
|
||||
```* * * * * flock -w 0 archive.lock python auto-archive.py --sheet archiver-test --url-col A --status-col B --archive-col C --date-col D```
|
||||
|
||||
Of course, additional logging information, etc. might be required.
|
||||
|
||||
With streaming mode enabled, the archiver can run safely at any frequency (since each iteration affects only a single row in the spreadsheet and it marks when streaming has started.) An example crontab line to run it every minute is as follows:
|
||||
|
||||
```* * * * * python auto-archive.py --sheet archiver-test --url-col A --status-col B --archive-col C --date-col D --streaming```
|
||||
|
||||
When these two cronjobs are used together, the archiver should archive and store all media added to the Google Sheet every 60 seconds.
|
Plik binarny nie jest wyświetlany.
Po Szerokość: | Wysokość: | Rozmiar: 91 KiB |
Plik binarny nie jest wyświetlany.
Po Szerokość: | Wysokość: | Rozmiar: 74 KiB |
Ładowanie…
Reference in New Issue