93cf3a8937 | ||
---|---|---|
archivers | ||
docs | ||
storages | ||
utils | ||
.example.env | ||
.gitignore | ||
LICENSE | ||
Pipfile | ||
Pipfile.lock | ||
README.md | ||
auto_archive.py | ||
auto_auto_archive.py |
README.md
auto-archiver
This Python script will look for links to Youtube, Twitter, etc,. in a specified column of a Google Sheet, uses YoutubeDL to download the media, stores the result in a Digital Ocean space or Google Drive, and updates the Google Sheet with the archive location, status, and date. It can be run manually or on an automated basis.
Setup
If you are using pipenv
(recommended), pipenv install
is sufficient to install Python prerequisites.
A Google Service account is necessary for use with gspread
. Credentials for this account should be stored in service_account.json
, in the same directory as the script.
ffmpeg must also be installed locally for this tool to work.
firefox and geckodriver on a path folder like /usr/local/bin
.
fonts-noto to deal with multiple unicode characters during selenium/geckodriver's screenshots: sudo apt install fonts-noto -y
.
A .env
file is required for saving content to a Digital Ocean space and Google Drive, and for archiving pages to the Internet Archive. This file should also be in the script directory, and should contain the following variables:
DO_SPACES_REGION=
DO_BUCKET=
DO_SPACES_KEY=
DO_SPACES_SECRET=
INTERNET_ARCHIVE_S3_KEY=
INTERNET_ARCHIVE_S3_SECRET=
TELEGRAM_API_ID=
TELEGRAM_API_HASH=
FACEBOOK_COOKIE=
GD_ROOT_FOLDER_ID=
.example.env
is an example of this file
Internet Archive credentials can be retrieved from https://archive.org/account/s3.php.
Running
There is just one necessary command line flag, --sheet name
which the name of the Google Sheet to check for URLs. This sheet must have been shared with the Google Service account used by gspread
. This sheet must also have specific columns in the first row:
Media URL
(required): the location of the media to be archived. This is the only column that should be supplied with data initiallyArchive status
(required): the status of the auto archiver script. Any row with text in this column will be skipped automatically.Archive location
(required): the location of the archived version. For files that were not able to be auto archived, this can be manually updated.Archive date
: the date that the auto archiver script ran for this fileUpload timestamp
: the timestamp extracted from the video. (For YouTube, this unfortunately does not currently include the time)Duration
: the duration of the videoUpload title
: the "title" of the video from the original sourceThumbnail
: an image thumbnail of the video (resize row height to make this more visible)Thumbnail index
: a link to a page that shows many thumbnails for the video, useful for quickly seeing video content
For example, for use with this spreadsheet:
pipenv run python auto_archive.py --sheet archiver-test
When the auto archiver starts running, it updates the "Archive status" column.
The links are downloaded and archived, and the spreadsheet is updated to the following:
Live streaming content is recorded in a separate thread.
Note that the first row is skipped, as it is assumed to be a header row. Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
Automating
The auto-archiver can be run automatically via cron. An example crontab entry that runs the archiver every minute is as follows.
* * * * * python auto_archive.py --sheet archiver-test
With this configuration, the archiver should archive and store all media added to the Google Sheet every 60 seconds. Of course, additional logging information, etc. might be required.
auto_auto_archiver
To make it easier to set up new auto-archiver sheets, the auto-auto-archiver will look at a particular sheet and run the auto-archiver on every sheet name in column A, starting from row 11. (It starts here to support instructional text in the first rows of the sheet, as shown below.) This script takes one command line argument, with --sheet
, the name of the sheet. It must be shared with the same service account.
Code structure
Code is split into functional concepts:
- Archivers - receive a URL that they try to archive
- Storages - they deal with where the archived files go
- Utilities
- GWorksheet - facilitates some of the reading/writing tasks for a Google Worksheet
Current Archivers
graph TD
A(Archiver) -->|parent of| B(TelegramArchiver)
A -->|parent of| C(TikTokArchiver)
A -->|parent of| D(YoutubeDLArchiver)
A -->|parent of| E(WaybackArchiver)
A -->|parent of| F(TwitterArchiver)
Current Storages
graph TD
A(BaseStorage) -->|parent of| B(S3Storage)
Saving into Folders
To use a column from the spreadsheet called File Number
eg SM001234 as a directory on the cloud storage, you need to pass in
python auto_archive.py --sheet 'Sheet Name' --use-filenumber-as-directory
Google Drive
To use Google Drive storage you need the id of the shared folder in the .env
file which must be shared with the service account eg autoarchiverservice@auto-archiver-111111.iam.gserviceaccount.com
python auto_archive.py --sheet 'Sheet Name' --use-filenumber-as-directory --storage='gd'
Note the you must use filenumber for Google Drive Storage.
Telethon (Telegrams API Library)
Put your anon.session
in the root, so that it doesn't stall and ask for authentication