auto-archiver/README.md

# auto-archiver
Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. The Google Sheets where the links come from is updated with information about the archived content. It can be run manually or on an automated basis.

## Setup

If you are using `pipenv` (recommended), `pipenv install` is sufficient to install Python prerequisites.

You also need:
1. [A Google Service account is necessary for use with `gspread`.](https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account) Credentials for this account should be stored in `service_account.json`, in the same directory as the script.
2. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work. 
3. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`. 
4. [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`. 
5. Internet Archive credentials can be retrieved from https://archive.org/account/s3.php.

### Configuration file
Configuration is done via a config.yaml file (see [example.config.yaml](example.config.yaml)) and some properties of that file can be overwritten via command line arguments. Here is the current result from running the `python auto_archive.py --help`:

<details><summary><code>python auto_archive.py --help</code></summary>


```js
usage: auto_archive.py [-h] [--config CONFIG] [--storage {s3,local,gd}] [--sheet SHEET] [--header HEADER] [--s3-private] [--col-url URL] [--col-folder FOLDER] [--col-archive ARCHIVE] [--col-date DATE] [--col-status STATUS] [--col-thumbnail THUMBNAIL] [--col-thumbnail_index THUMBNAIL_INDEX] [--col-timestamp TIMESTAMP] [--col-title TITLE] [--col-duration DURATION] [--col-screenshot SCREENSHOT] [--col-hash HASH]

Automatically archive social media posts, videos, and images from a Google Sheets document. The command line arguments will always override the configurations in the provided JSON config
file (--config), only some high-level options are allowed via the command line and the JSON configuration file is the preferred method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       the filename of the JSON configuration file (defaults to 'config.json')
  --storage {s3,local,gd}
                        which storage to use [execution.storage in config.json]
  --sheet SHEET         the name of the google sheets document [execution.sheet in config.json]
  --header HEADER       1-based index for the header row [execution.header in config.json]
  --s3-private          Store content without public access permission (only for storage=s3) [secrets.s3.private in config.json]
  --col-url URL         the name of the column to READ url FROM (default='link')
  --col-folder FOLDER   the name of the column to READ folder FROM (default='destination folder')
  --col-archive ARCHIVE
                        the name of the column to FILL WITH archive (default='archive location')
  --col-date DATE       the name of the column to FILL WITH date (default='archive date')
  --col-status STATUS   the name of the column to FILL WITH status (default='archive status')
  --col-thumbnail THUMBNAIL 
                        the name of the column to FILL WITH thumbnail (default='thumbnail')
  --col-thumbnail_index THUMBNAIL_INDEX
                        the name of the column to FILL WITH thumbnail_index (default='thumbnail index')
  --col-timestamp TIMESTAMP
                        the name of the column to FILL WITH timestamp (default='upload timestamp')
  --col-title TITLE     the name of the column to FILL WITH title (default='upload title')
  --col-duration DURATION
                        the name of the column to FILL WITH duration (default='duration')
  --col-screenshot SCREENSHOT
                        the name of the column to FILL WITH screenshot (default='screenshot')
  --col-hash HASH       the name of the column to FILL WITH hash (default='hash')
```

</details><br/>

#### Example invocations
All the configurations can be specified in the YAML config file, but sometimes it is useful to override only some of those like the sheet that we are running the archival on, here are some examples (possibly prepended by `pipenv run`):

```bash
# all the configurations come from config.yaml
python auto_archive.py

# all the configurations come from my_config.yaml
python auto_archive.py --config my_config.yaml

# reads the configurations but saves archived content to google drive instead
python auto_archive.py --config my_config.yaml --storage gd

# uses the configurations but for another google docs sheet 
# with a header on row 2 and with some different column names
python auto_archive.py --config my_config.yaml --sheet="use it on another sheets doc" --header=2 --col-link="put urls here"

# all the configurations come from config.yaml and specifies that s3 files should be private
python auto_archive.py --s3-private
```

### Extra notes on configuration
#### Google Drive
To use Google Drive storage you need the id of the shared folder in the `config.yaml` file which must be shared with the service account eg `autoarchiverservice@auto-archiver-111111.iam.gserviceaccount.com` and then you can use `--storage=gd`

#### Telethon (Telegrams API Library)
The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your `anon.session` in the root.


## Running
The `--sheet name` property (or `execution.sheet` in the YAML file) is the name of the Google Sheet to check for URLs. 
This sheet must have been shared with the Google Service account used by `gspread`. 
This sheet must also have specific columns (case-insensitive) in the `header` row (see `COLUMN_NAMES` in [gworksheet.py](utils/gworksheet.py)), only the `link` and `status` columns are mandatory:
* `Link` (required): the location of the media to be archived. This is the only column that should be supplied with data initially
* `Archive status` (required): the status of the auto archiver script. Any row with text in this column will be skipped automatically.
* `Destination folder`: (optional) by default files are saved to a folder called `name-of-sheets-document/name-of-sheets-tab/` using this option you can organize documents into folder from the sheet. 
* `Archive location`: the location of the archived version. For files that were not able to be auto archived, this can be manually updated.
* `Archive date`: the date that the auto archiver script ran for this file
* `Upload timestamp`: the timestamp extracted from the video. (For YouTube, this unfortunately does not currently include the time)
* `Upload title`: the "title" of the video from the original source
* `Hash`: a hash of the first video or image found
* `Screenshot`: a screenshot taken with from a browser view of opening the page
* in case of videos
  * `Duration`: duration in seconds
  * `Thumbnail`: an image thumbnail of the video (resize row height to make this more visible)
  * `Thumbnail index`: a link to a page that shows many thumbnails for the video, useful for quickly seeing video content


For example, for use with this spreadsheet:

![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column](docs/demo-before.png)

```pipenv run python auto_archive.py --sheet archiver-test```

When the auto archiver starts running, it updates the "Archive status" column.

![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column. The auto archiver has added "archive in progress" to one of the status columns.](docs/demo-progress.png)

The links are downloaded and archived, and the spreadsheet is updated to the following:

![A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.](docs/demo-after.png)

Live streaming content is recorded in a separate thread.

Note that the first row is skipped, as it is assumed to be a header row. Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.

## Automating

The auto-archiver can be run automatically via cron. An example crontab entry that runs the archiver every minute is as follows.

```* * * * * python auto_archive.py --sheet archiver-test```

With this configuration, the archiver should archive and store all media added to the Google Sheet every 60 seconds. Of course, additional logging information, etc. might be required.

# auto_auto_archiver

To make it easier to set up new auto-archiver sheets, the auto-auto-archiver will look at a particular sheet and run the auto-archiver on every sheet name in column A, starting from row 11. (It starts here to support instructional text in the first rows of the sheet, as shown below.) This script takes one command line argument, with `--sheet`, the name of the sheet. It must be shared with the same service account.

![A screenshot of a Google Spreadsheet configured to show instructional text and a list of sheet names to check with auto-archiver.](docs/auto-auto.png)

# Code structure
Code is split into functional concepts:
1. [Archivers](archivers/) - receive a URL that they try to archive
2. [Storages](storages/) - they deal with where the archived files go
3. [Utilities](utils/)
   1. [GWorksheet](utils/gworksheet.py) - facilitates some of the reading/writing tasks for a Google Worksheet

### Current Archivers
Archivers are tested in a meaningful order with Wayback Machine being the default, that can easily be changed in the code. 
```mermaid
graph TD
    A(Archiver) -->|parent of| B(YoutubeDLArchiver)
    A -->|parent of| C(TikTokArchiver)
    A -->|parent of| D(TwitterArchiver)
    A -->|parent of| E(TelegramArchiver)
    A -->|parent of| F(TelethonArchiver)
    A -->|parent of| G(WaybackArchiver)
```
### Current Storages
```mermaid
graph TD
    A(BaseStorage) -->|parent of| B(S3Storage)
    A(BaseStorage) -->|parent of| C(LocalStorage)
    A(BaseStorage) -->|parent of| D(GoogleDriveStorage)
```
Add readme 2021-02-09 14:19:46 +00:00			`# auto-archiver`
readme updates 2022-06-07 16:52:19 +00:00			`Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. The Google Sheets where the links come from is updated with information about the archived content. It can be run manually or on an automated basis.`
Add intro 2021-02-09 14:22:58 +00:00
Add readme 2021-02-09 14:19:46 +00:00			`## Setup`

			If you are using `pipenv` (recommended), `pipenv install` is sufficient to install Python prerequisites.

README updates 2022-06-07 16:41:43 +00:00			`You also need:`
			1. [A Google Service account is necessary for use with `gspread`.](https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account) Credentials for this account should be stored in `service_account.json`, in the same directory as the script.
numbers in markdown 2022-06-14 18:55:30 +00:00			`2. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work.`
			3. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`.
			4. [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`.
			`5. Internet Archive credentials can be retrieved from https://archive.org/account/s3.php.`
README updates 2022-06-07 16:41:43 +00:00
			`### Configuration file`
json -> yaml 2022-06-14 19:18:18 +00:00			Configuration is done via a config.yaml file (see [example.config.yaml](example.config.yaml)) and some properties of that file can be overwritten via command line arguments. Here is the current result from running the `python auto_archive.py --help`:
README updates 2022-06-07 16:41:43 +00:00
			`<details><summary><code>python auto_archive.py --help</code></summary>`



			```js
			`usage: auto_archive.py [-h] [--config CONFIG] [--storage {s3,local,gd}] [--sheet SHEET] [--header HEADER] [--s3-private] [--col-url URL] [--col-folder FOLDER] [--col-archive ARCHIVE] [--col-date DATE] [--col-status STATUS] [--col-thumbnail THUMBNAIL] [--col-thumbnail_index THUMBNAIL_INDEX] [--col-timestamp TIMESTAMP] [--col-title TITLE] [--col-duration DURATION] [--col-screenshot SCREENSHOT] [--col-hash HASH]`

			`Automatically archive social media posts, videos, and images from a Google Sheets document. The command line arguments will always override the configurations in the provided JSON config`
			`file (--config), only some high-level options are allowed via the command line and the JSON configuration file is the preferred method.`

			`optional arguments:`
			`-h, --help show this help message and exit`
			`--config CONFIG the filename of the JSON configuration file (defaults to 'config.json')`
			`--storage {s3,local,gd}`
			`which storage to use [execution.storage in config.json]`
			`--sheet SHEET the name of the google sheets document [execution.sheet in config.json]`
			`--header HEADER 1-based index for the header row [execution.header in config.json]`
			`--s3-private Store content without public access permission (only for storage=s3) [secrets.s3.private in config.json]`
			`--col-url URL the name of the column to READ url FROM (default='link')`
			`--col-folder FOLDER the name of the column to READ folder FROM (default='destination folder')`
			`--col-archive ARCHIVE`
			`the name of the column to FILL WITH archive (default='archive location')`
			`--col-date DATE the name of the column to FILL WITH date (default='archive date')`
			`--col-status STATUS the name of the column to FILL WITH status (default='archive status')`
			`--col-thumbnail THUMBNAIL`
			`the name of the column to FILL WITH thumbnail (default='thumbnail')`
			`--col-thumbnail_index THUMBNAIL_INDEX`
			`the name of the column to FILL WITH thumbnail_index (default='thumbnail index')`
			`--col-timestamp TIMESTAMP`
			`the name of the column to FILL WITH timestamp (default='upload timestamp')`
			`--col-title TITLE the name of the column to FILL WITH title (default='upload title')`
			`--col-duration DURATION`
			`the name of the column to FILL WITH duration (default='duration')`
			`--col-screenshot SCREENSHOT`
			`the name of the column to FILL WITH screenshot (default='screenshot')`
			`--col-hash HASH the name of the column to FILL WITH hash (default='hash')`
			```
Add readme 2021-02-09 14:19:46 +00:00
README updates 2022-06-07 16:41:43 +00:00			`</details><br/>`
split into multiple files MVP 2022-02-21 13:19:09 +00:00
README updates 2022-06-07 16:41:43 +00:00			`#### Example invocations`
json -> yaml 2022-06-14 19:18:18 +00:00			All the configurations can be specified in the YAML config file, but sometimes it is useful to override only some of those like the sheet that we are running the archival on, here are some examples (possibly prepended by `pipenv run`):
adds README instructions for geckodriver 2022-03-09 10:44:05 +00:00
README updates 2022-06-07 16:41:43 +00:00			```bash
json -> yaml 2022-06-14 19:18:18 +00:00			`# all the configurations come from config.yaml`
README updates 2022-06-07 16:41:43 +00:00			`python auto_archive.py`
Update README.md 2022-05-03 13:45:18 +00:00
json -> yaml 2022-06-14 19:18:18 +00:00			`# all the configurations come from my_config.yaml`
			`python auto_archive.py --config my_config.yaml`
Add readme 2021-02-09 14:19:46 +00:00
README updates 2022-06-07 16:41:43 +00:00			`# reads the configurations but saves archived content to google drive instead`
json -> yaml 2022-06-14 19:18:18 +00:00			`python auto_archive.py --config my_config.yaml --storage gd`
README updates 2022-06-07 16:41:43 +00:00
			`# uses the configurations but for another google docs sheet`
			`# with a header on row 2 and with some different column names`
json -> yaml 2022-06-14 19:18:18 +00:00			`python auto_archive.py --config my_config.yaml --sheet="use it on another sheets doc" --header=2 --col-link="put urls here"`
README updates 2022-06-07 16:41:43 +00:00
json -> yaml 2022-06-14 19:18:18 +00:00			`# all the configurations come from config.yaml and specifies that s3 files should be private`
README updates 2022-06-07 16:41:43 +00:00			`python auto_archive.py --s3-private`
Add readme 2021-02-09 14:19:46 +00:00			```

README updates 2022-06-07 16:41:43 +00:00			`### Extra notes on configuration`
			`#### Google Drive`
json -> yaml 2022-06-14 19:18:18 +00:00			To use Google Drive storage you need the id of the shared folder in the `config.yaml` file which must be shared with the service account eg `autoarchiverservice@auto-archiver-111111.iam.gserviceaccount.com` and then you can use `--storage=gd`
Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 14:39:44 +00:00
README updates 2022-06-07 16:41:43 +00:00			`#### Telethon (Telegrams API Library)`
more verbose about mandatory columns 2022-06-14 17:54:08 +00:00			The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your `anon.session` in the root.
Update README.md 2021-06-04 10:07:03 +00:00
Add readme 2021-02-09 14:19:46 +00:00
README updates 2022-06-07 16:41:43 +00:00			`## Running`
json -> yaml 2022-06-14 19:18:18 +00:00			The `--sheet name` property (or `execution.sheet` in the YAML file) is the name of the Google Sheet to check for URLs.
README updates 2022-06-07 16:41:43 +00:00			This sheet must have been shared with the Google Service account used by `gspread`.
more verbose about mandatory columns 2022-06-14 17:54:08 +00:00			This sheet must also have specific columns (case-insensitive) in the `header` row (see `COLUMN_NAMES` in [gworksheet.py](utils/gworksheet.py)), only the `link` and `status` columns are mandatory:
README updates 2022-06-07 16:41:43 +00:00			* `Link` (required): the location of the media to be archived. This is the only column that should be supplied with data initially
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			* `Archive status` (required): the status of the auto archiver script. Any row with text in this column will be skipped automatically.
more verbose about mandatory columns 2022-06-14 17:54:08 +00:00			* `Destination folder`: (optional) by default files are saved to a folder called `name-of-sheets-document/name-of-sheets-tab/` using this option you can organize documents into folder from the sheet.
README updates 2022-06-07 16:41:43 +00:00			* `Archive location`: the location of the archived version. For files that were not able to be auto archived, this can be manually updated.
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			* `Archive date`: the date that the auto archiver script ran for this file
			* `Upload timestamp`: the timestamp extracted from the video. (For YouTube, this unfortunately does not currently include the time)
			* `Upload title`: the "title" of the video from the original source
README updates 2022-06-07 16:41:43 +00:00			* `Hash`: a hash of the first video or image found
			* `Screenshot`: a screenshot taken with from a browser view of opening the page
			`* in case of videos`
			* `Duration`: duration in seconds
			* `Thumbnail`: an image thumbnail of the video (resize row height to make this more visible)
			* `Thumbnail index`: a link to a page that shows many thumbnails for the video, useful for quickly seeing video content

Add readme 2021-02-09 14:19:46 +00:00
			`For example, for use with this spreadsheet:`

Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column](docs/demo-before.png)`
Add readme 2021-02-09 14:19:46 +00:00
Rename files to use consistent punctuation 2021-06-01 09:33:20 +00:00			```pipenv run python auto_archive.py --sheet archiver-test```
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`When the auto archiver starts running, it updates the "Archive status" column.`
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`![A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column. The auto archiver has added "archive in progress" to one of the status columns.](docs/demo-progress.png)`
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`The links are downloaded and archived, and the spreadsheet is updated to the following:`
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`![A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.](docs/demo-after.png)`
Update README.md 2021-02-09 14:27:42 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`Live streaming content is recorded in a separate thread.`
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`Note that the first row is skipped, as it is assumed to be a header row. Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.`
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`## Automating`
Add readme 2021-02-09 14:19:46 +00:00
Update auto archiver docs with new header declaration method 2021-05-12 07:01:45 +00:00			`The auto-archiver can be run automatically via cron. An example crontab entry that runs the archiver every minute is as follows.`
Add readme 2021-02-09 14:19:46 +00:00
Rename files to use consistent punctuation 2021-06-01 09:33:20 +00:00			```* * * * * python auto_archive.py --sheet archiver-test```
Add readme 2021-02-09 14:19:46 +00:00
Update README.md 2021-05-14 12:06:41 +00:00			`With this configuration, the archiver should archive and store all media added to the Google Sheet every 60 seconds. Of course, additional logging information, etc. might be required.`

Rename files to use consistent punctuation 2021-06-01 09:33:20 +00:00			`# auto_auto_archiver`
Update README.md 2021-05-14 12:06:41 +00:00
			To make it easier to set up new auto-archiver sheets, the auto-auto-archiver will look at a particular sheet and run the auto-archiver on every sheet name in column A, starting from row 11. (It starts here to support instructional text in the first rows of the sheet, as shown below.) This script takes one command line argument, with `--sheet`, the name of the sheet. It must be shared with the same service account.

			`![A screenshot of a Google Spreadsheet configured to show instructional text and a list of sheet names to check with auto-archiver.](docs/auto-auto.png)`

cleanup and docs 2022-02-23 15:07:58 +00:00			`# Code structure`
			`Code is split into functional concepts:`
			`1. [Archivers](archivers/) - receive a URL that they try to archive`
			`2. [Storages](storages/) - they deal with where the archived files go`
creates utils module and moves gworkseet there 2022-02-23 15:24:59 +00:00			`3. [Utilities](utils/)`
			`1. [GWorksheet](utils/gworksheet.py) - facilitates some of the reading/writing tasks for a Google Worksheet`
cleanup and docs 2022-02-23 15:07:58 +00:00
			`### Current Archivers`
readme updates 2022-06-07 16:43:04 +00:00			`Archivers are tested in a meaningful order with Wayback Machine being the default, that can easily be changed in the code.`
cleanup and docs 2022-02-23 15:07:58 +00:00			```mermaid
			`graph TD`
README updates 2022-06-07 16:41:43 +00:00			`A(Archiver) -->\|parent of\| B(YoutubeDLArchiver)`
cleanup and docs 2022-02-23 15:07:58 +00:00			`A -->\|parent of\| C(TikTokArchiver)`
README updates 2022-06-07 16:41:43 +00:00			`A -->\|parent of\| D(TwitterArchiver)`
			`A -->\|parent of\| E(TelegramArchiver)`
			`A -->\|parent of\| F(TelethonArchiver)`
			`A -->\|parent of\| G(WaybackArchiver)`
cleanup and docs 2022-02-23 15:07:58 +00:00			```
			`### Current Storages`
			```mermaid
			`graph TD`
			`A(BaseStorage) -->\|parent of\| B(S3Storage)`
readme updates 2022-06-07 16:43:04 +00:00			`A(BaseStorage) -->\|parent of\| C(LocalStorage)`
			`A(BaseStorage) -->\|parent of\| D(GoogleDriveStorage)`
cleanup and docs 2022-02-23 15:07:58 +00:00			```
Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 14:39:44 +00:00