Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. The Google Sheets where the links come from is updated with information about the archived content. It can be run manually or on an automated basis.
1. [A Google Service account is necessary for use with `gspread`.](https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account) Credentials for this account should be stored in `service_account.json`, in the same directory as the script.
2. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work.
3. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`.
4. [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`.
5. Internet Archive credentials can be retrieved from https://archive.org/account/s3.php.
Configuration is done via a config.yaml file (see [example.config.yaml](example.config.yaml)) and some properties of that file can be overwritten via command line arguments. Here is the current result from running the `python auto_archive.py --help`:
Automatically archive social media posts, videos, and images from a Google Sheets document. The command line arguments will always override the configurations in the provided JSON config
file (--config), only some high-level options are allowed via the command line and the JSON configuration file is the preferred method.
optional arguments:
-h, --help show this help message and exit
--config CONFIG the filename of the JSON configuration file (defaults to 'config.json')
--storage {s3,local,gd}
which storage to use [execution.storage in config.json]
--sheet SHEET the name of the google sheets document [execution.sheet in config.json]
--header HEADER 1-based index for the header row [execution.header in config.json]
--s3-private Store content without public access permission (only for storage=s3) [secrets.s3.private in config.json]
--col-url URL the name of the column to READ url FROM (default='link')
--col-folder FOLDER the name of the column to READ folder FROM (default='destination folder')
--col-archive ARCHIVE
the name of the column to FILL WITH archive (default='archive location')
--col-date DATE the name of the column to FILL WITH date (default='archive date')
--col-status STATUS the name of the column to FILL WITH status (default='archive status')
--col-thumbnail THUMBNAIL
the name of the column to FILL WITH thumbnail (default='thumbnail')
--col-thumbnail_index THUMBNAIL_INDEX
the name of the column to FILL WITH thumbnail_index (default='thumbnail index')
--col-timestamp TIMESTAMP
the name of the column to FILL WITH timestamp (default='upload timestamp')
--col-title TITLE the name of the column to FILL WITH title (default='upload title')
--col-duration DURATION
the name of the column to FILL WITH duration (default='duration')
--col-screenshot SCREENSHOT
the name of the column to FILL WITH screenshot (default='screenshot')
--col-hash HASH the name of the column to FILL WITH hash (default='hash')
All the configurations can be specified in the YAML config file, but sometimes it is useful to override only some of those like the sheet that we are running the archival on, here are some examples (possibly prepended by `pipenv run`):
To use Google Drive storage you need the id of the shared folder in the `config.yaml` file which must be shared with the service account eg `autoarchiverservice@auto-archiver-111111.iam.gserviceaccount.com` and then you can use `--storage=gd`
The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your `anon.session` in the root.
This sheet must also have specific columns (case-insensitive) in the `header` row (see `COLUMN_NAMES` in [gworksheet.py](utils/gworksheet.py)), only the `link` and `status` columns are mandatory:
*`Destination folder`: (optional) by default files are saved to a folder called `name-of-sheets-document/name-of-sheets-tab/` using this option you can organize documents into folder from the sheet.


Note that the first row is skipped, as it is assumed to be a header row. Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
With this configuration, the archiver should archive and store all media added to the Google Sheet every 60 seconds. Of course, additional logging information, etc. might be required.
To make it easier to set up new auto-archiver sheets, the auto-auto-archiver will look at a particular sheet and run the auto-archiver on every sheet name in column A, starting from row 11. (It starts here to support instructional text in the first rows of the sheet, as shown below.) This script takes one command line argument, with `--sheet`, the name of the sheet. It must be shared with the same service account.
