initial commit

2019-10-18 23:16:17 +00:00 · 2019-10-18 23:16:17 +00:00 · f505b83f4b
commit f505b83f4b
--- a/README.md
+++ b/README.md
@ -1,2 +1,91 @@
 # YouTube2PeerTube
-A bot written in Python3 that mirrors YouTube channels to PeerTube channels as videos are released in a YouTube channel.
+
+YouTube2PeerTube is a bot written in Python3 that mirrors YouTube channels to PeerTube channels as videos are released in a YouTube channel.
+
+It checks YouTube channels periodically, when new videos are found, it downloads them with metadata and uploads them to PeerTube instances.
+
+This tool supports multiple channels, and supports mirroring each YouTube channel to a user defined PeerTube channel and instance that can be different for each YouTube channel being mirrored.
+
+This tool does not use YouTube APIs. Instead, it subscribes to channels via RSS. This is a primary feature, this tool will always avoid the YouTube API, and no features will be implemented that require the YouTube API.
+
+If you need to archive a YouTube channel with lots of existing videos, this tool is not for you. This tool starts mirroring channels from the time they are added to the config and will not mirror all historical videos that exist in a YouTube channel. A tool that provides this functionality is available https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/tools.md#peertube-import-videosjs
+
+## Dependencies
+
+This tool depends on:
+
+- pafy https://github.com/mps-youtube/pafy for downloading of YouTube content.
+
+- feedparser for parsing of RSS data
+
+- TOML for the configuration file
+
+- MultipartEncoder from requests_toolbelt
+
+- urllib.request, requests, mimetypes, time, json and os from the Python standard library
+
+It also contains heavily modified components from prismedia https://git.lecygnenoir.info/LecygneNoir/prismedia for uploading videos and metadata to PeerTube.
+
+## Configuration
+
+An example configuration file is found at example_config.toml. Copy this to config.toml and replace the fields with your information, and add channels as necessary.
+
+The configuration file is found at config.toml. It allows you to configure the poll frequency for all YouTube channels, download directory for videos and metadata, whether to keep the videos after upload (for archiving purposes) as well as per channel options such as YouTube channel info, corresponding PeerTube channel info and auth, and appendable tags and descriptions.
+
+Each channel is capable of mirroring to a different PeerTube account and instance, and is capable of appending tags and description information on a per channel basis.
+
+All videos and metadata are stored in <video_download_dir> as defined in the config, in a subdirectory with the same name as the channel <name> in the config, one directory per channel. All videos and metadata are named after the YouTube video ID. For each video, there should be 3 files: a video file, a thumbnail (jpg) and a text file containing metadata.
+
+If <delete_videos> is set to True, videos and metadata will be deleted from the download directory after upload.
+
+## Running the bot
+
+To run the bot, simply run youtube2peertube.py. The bot will run indefinitely until stopped.
+
+If you run it inside of a virtual environment, all dependencies are in venv. If you run it outside of a virtual environment you will need to make sure all dependencies are met.
+
+The first time a channel is found in the config, the most recent videos returned by the youtube RSS endpoint are mirrored, and a new entry is added to channels_timestamps.csv with the timestamp of the last video. Subsequently each channel is checked for an entry in channels_timestamps.csv and only videos later than the last timestamp for the channel's entry are mirrored. The tool decides if it is the first time a channel is found based on whether it has an entry in channels_timestamps.csv. It is designed this way so that the tool can be stopped and restarted without attempting to upload duplicate videos when restarted.
+
+After that, the tool polls all channels in config.toml periodically based on the parameter <poll_frequency> which is in minutes, and mirrors all new videos for each channel as they are found.
+
+Any time the configuration is changed, the bot must be restarted.
+
+## Future improvements
+
+- Auth info for PeerTube channels is currently stored in plaintext in config.toml. This is insecure and needs to be changed.
+
+- It would be better if the tool were to run as a cron job or daemon rather than an ongoing python process.
+
+- implement logging
+
+- Sometimes a YouTube video or its metadata might be updated for some reason or other, it would be nice if the tool were able to update the mirrored video on PeerTube when this happens, and potentially archive or remove the previous version of the video.
+
+- Currently there is no way to abide by the upload cap for PeerTube instances, this can lead to errors. It is recommended for now that you run your own PeerTube instance with no cap, or select PeerTube instances with high enough caps to account for the upload frequency of the YouTube accounts you are mirroring, and lower the upload resolution if needed (resolution preference not implemented yet). I would like to implement queue functionality for videos when the cap is reached, but I am unsure when I will implement this.
+
+- Use language from YouTube video metadata, currently uses the value from the config.
+
+- A TUI would be nice for adding channels and restarting the bot.
+
+- Transcoding might make the tool more useful, changing the resolution, codec and container of a video might be something to consider implementing.
+
+- Optionally use the PeerTube API's http import functionality when not saving videos.
+
+See open issues for more details.
+
+Please open issues if you find any problems or if you have a feature request. Pull requests are always welcome, however feature requests and pull requests will not be implemented if they are out of the scope of the project or if they cause issues with other existing features.
+
+## Thanks!
+
+Thanks to the mps-youtube project https://github.com/mps-youtube for pafy, and thanks to LecygneNoir https://git.lecygnenoir.info/LecygneNoir for the prismedia project. Thank you Tom for TOML and as always, Guido and the Python team.
+
+If you find this tool useful and would like to donate, the following donation options are available:
+
+XMR: 4AeufJrhpQn7LGW5dZ9tH4FFAtfmRwEDvhYrH5GQDbNxQ9VyWKmdycb5naWcvRTqbm3fkyqrDi23x453stDKzu5YVgPfcbj
+BTC legacy: 141HaN7bq781BaB2PRP8mkUndebZXjxiFU
+BTC segwit compatible: bc1qx2fa50av3j9hrjnszsnpflmtxqnz08936mq4xx
+BCH: qzr9gk7tv274x9u9sft243m729zrjnq0cvpzlelapt
+LTC: ltc1qa8re5eh2dklzfhg2x03tswsr5wae68qfxjzacd
+ETH: 0x18304c5ed37dacefc920b291f39b06545b5fc258
+ETC: 0xee3947eec103346ed42302221d99027a59bfa061
+
+Buy me a cup of coffee!
--- a/example_config.toml
+++ b/example_config.toml
@ -0,0 +1,65 @@
+# This is a TOML document
+# This document is a configuration file for yt2pt (Youtube to Peertube) mirror bot
+[global]
+video_download_dir = "/home/m/Desktop/yt2pt_videos/" # must be absolute path and user writable directory
+delete_videos = "false" # Delete videos and metadata after upload to peertube, lowercase string
+poll_frequency = 180 # poll frequency in minutes
+
+# For every channel, a new [channel.x] entry must be added in [channel] in sequential numerical order starting at 0
+# Each [channel] entry must have a name (does not have to mach YT channel name),
+# YT channel ID, peertube instance URL, channel, username, password.
+# If you do not wish to append tags or descriptions then leave the quotes empty for those fields.
+[channel]
+    [channel.0]
+    name = "channel_name"
+    channel_id = "channel ID" # YT channel ID at the end of the url youtube.com/channel/<channel_id>
+    peertube_instance = "https://peertube.url" # URL of peertube instance
+    peertube_channel = "peertube_channel" # peertube channel handle to upload video to
+    peertube_username = "user" # peertube username
+    peertube_password = "password" # peertube password WARNING this file needs to be secure
+    pt_channel_category = "10" # category of channel contents. see yt_pt_languages_categories.txt for categories
+    pt_tags = "" # tags to be added to uploaded video in Peertube, comma separated, max 5, between 2 and 30 char each (incomplete)
+    default_lang = "en" # language of the channel, see yt_pt_languages_categories.txt for languages
+    nsfw = "false" # lowercase string, is this channel NSFW?
+    comments_enabled = "true" # lowercase string, do you want comments enabled in this channel?
+    pt_privacy = 1 # 1 = public, 2 = unlisted, 3 = private, privacy for entire channel, default public
+    description_prefix = "" # This description will be added to the beginning of the YT description
+    description_suffix = "" # This description will be appended to the end of the YT description
+    preferred_extension = "mp4" # preferred extension of download and upload
+    max_resolution = "360" # maximum resolution of videos to download (incomplete)
+
+    [channel.1]
+    name = "channel_name"
+    channel_id = "channel ID" # YT channel ID at the end of the url youtube.com/channel/<channel_id>
+    peertube_instance = "https://peertube.url" # URL of peertube instance
+    peertube_channel = "peertube_channel" # peertube channel handle to upload video to
+    peertube_username = "user" # peertube username
+    peertube_password = "password" # peertube password WARNING this file needs to be secure
+    pt_channel_category = "10" # category of channel contents. see yt_pt_languages_categories.txt for categories
+    pt_tags = "" # tags to be added to uploaded video in Peertube, comma separated, max 5, between 2 and 30 char each (incomplete)
+    default_lang = "en" # language of the channel, see yt_pt_languages_categories.txt for languages
+    nsfw = "false" # lowercase string, is this channel NSFW?
+    comments_enabled = "true" # lowercase string, do you want comments enabled in this channel?
+    pt_privacy = 1 # 1 = public, 2 = unlisted, 3 = private, privacy for entire channel, default public
+    description_prefix = "" # This description will be added to the beginning of the YT description
+    description_suffix = "" # This description will be appended to the end of the YT description
+    preferred_extension = "mp4" # preferred extension of download and upload
+    max_resolution = "360" # maximum resolution of videos to download (incomplete)
+
+    [channel.2]
+    name = "channel_name"
+    channel_id = "channel ID" # YT channel ID at the end of the url youtube.com/channel/<channel_id>
+    peertube_instance = "https://peertube.url" # URL of peertube instance
+    peertube_channel = "peertube_channel" # peertube channel handle to upload video to
+    peertube_username = "user" # peertube username
+    peertube_password = "password" # peertube password WARNING this file needs to be secure
+    pt_channel_category = "10" # category of channel contents. see yt_pt_languages_categories.txt for categories
+    pt_tags = "" # tags to be added to uploaded video in Peertube, comma separated, max 5, between 2 and 30 char each (incomplete)
+    default_lang = "en" # language of the channel, see yt_pt_languages_categories.txt for languages
+    nsfw = "false" # lowercase string, is this channel NSFW?
+    comments_enabled = "true" # lowercase string, do you want comments enabled in this channel?
+    pt_privacy = 1 # 1 = public, 2 = unlisted, 3 = private, privacy for entire channel, default public
+    description_prefix = "" # This description will be added to the beginning of the YT description
+    description_suffix = "" # This description will be appended to the end of the YT description
+    preferred_extension = "mp4" # preferred extension of download and upload
+    max_resolution = "360" # maximum resolution of videos to download (incomplete)
--- a/utils.py
+++ b/utils.py
@ -0,0 +1,16 @@
+import toml
+
+def read_conf(conf_file):
+    conf_file = open(conf_file)
+    conf = conf_file.read()
+    conf = toml.loads(conf)
+    conf_file.close()
+    return conf
+
+def convert_timestamp(timestamp):
+    timestamp = timestamp.split('T')
+    date = timestamp[0].split('-')
+    time = timestamp[1].split('+')
+    time = time[0].split(':')
+    timestamp = int(date[0] + date[1] + date[2] + time[0] + time[1] + time[2])
+    return timestamp
--- a/youtube2peertube.py
+++ b/youtube2peertube.py
@ -0,0 +1,245 @@
+#!/usr/bin/python3
+
+import pafy
+import feedparser as fp
+from urllib.request import urlretrieve
+import requests
+import json
+from time import sleep
+from os import mkdir, path
+from shutil import rmtree
+import mimetypes
+from requests_toolbelt.multipart.encoder import MultipartEncoder
+import utils
+
+def get_video_data(channel_id):
+    yt_rss_url = "https://www.youtube.com/feeds/videos.xml?channel_id=" + channel_id
+    feed = fp.parse(yt_rss_url)
+    entries = feed["entries"]
+    channels_timestamps = "channels_timestamps.csv"
+    # clear any existing queue before start
+    queue = []
+    # read contents of channels_timestamps.csv, create list object of contents
+    ct = open(channels_timestamps, "r")
+    ctr = ct.read().split("\n")
+    ct.close()
+    ctr_line = []
+    channel_found = False
+    # check if channel ID is found in channels_timestamps.csv
+    for line in ctr:
+        line_list = line.split(',')
+        if channel_id == line_list[0]:
+            channel_found = True
+            ctr_line = line
+            break
+    if not channel_found:
+        print("new channel added to config: " + channel_id)
+    print(channel_id)
+    # iterate through video entries for channel, parse data into objects for use
+    for pos, i in enumerate(reversed(entries)):
+        published = i["published"]
+        updated = i["updated"]
+        if not channel_found:
+            # add the video to the queue
+            queue.append(i)
+            ctr_line = str(channel_id + "," + published + "," + updated + '\n')
+            # add the new line to ctr for adding to channels_timestamps later
+            ctr.append(ctr_line)
+            channel_found = True
+        # if the channel exists in channels_timestamps, update "published" time in the channel line
+        else:
+            published_int = utils.convert_timestamp(published)
+            ctr_line_list = ctr_line.split(",")
+            line_published_int = utils.convert_timestamp(ctr_line_list[1])
+            if published_int > line_published_int:
+                # update the timestamp in the line for the channel in channels_timestamps,
+                ctr.remove(ctr_line)
+                ctr_line = str(channel_id + "," + published + "," + updated + '\n')
+                ctr.append(ctr_line)
+                # and add current videos to queue.
+                queue.append(i)
+        print(published)
+    # write the new channels and timestamps line to channels_timestamps.csv
+    ct = open(channels_timestamps, "w")
+    for line in ctr:
+        if line != '':
+            ct.write(line + "\n")
+    ct.close()
+    return queue
+
+def download_yt_video(e, dl_dir, channel_conf):
+    url = e["link"]
+    dl_dir = dl_dir + channel_conf["name"]
+    try:
+        video = pafy.new(url)
+        streams = video.streams
+        #for s in streams:
+            #print(s.resolution, s.extension, s.get_filesize, s.url)
+        best = video.getbest(preftype=channel_conf["preferred_extension"])
+        filepath = dl_dir + "/"+ e["yt_videoid"] + "." + channel_conf["preferred_extension"]
+        #TODO: implement resolution logic from config, currently downloading best resolution
+        best.download(filepath=filepath, quiet=False)
+
+    except:
+        pass
+        # TODO: check YT alternate URL for video availability
+        # TODO: print and log exceptions
+
+def save_metadata(e, dl_dir, channel_conf):
+    dl_dir = dl_dir + channel_conf["name"]
+    link = e["link"]
+    title = e["title"]
+    description = e["summary"]
+    author = e["author"]
+    published = e["published"]
+    metadata_file = dl_dir + "/" + e["yt_videoid"] + ".txt"
+    metadata = open(metadata_file, "w+")
+    # save relevant metadata as semicolon separated easy to read values to text file
+    metadata.write('title: "' + title + '";\n\nlink: "' + link + '";\n\nauthor: "' + author + '";\n\npublished: "' +
+                   published + '";\n\ndescription: "' + description + '"\n\n;')
+    # save raw metadata JSON string
+    metadata.write(str(e))
+    metadata.close()
+
+def save_thumbnail(e, dl_dir, channel_conf):
+    dl_dir = dl_dir + channel_conf["name"]
+    thumb = str(e["media_thumbnail"][0]["url"])
+    extension = thumb.split(".")[-1]
+    thumb_file = dl_dir + "/" + e["yt_videoid"] + "." + extension
+    # download the thumbnail
+    urlretrieve(thumb, thumb_file)
+    return extension
+
+def get_pt_auth(channel_conf):
+    # get variables from channel_conf
+    pt_api = channel_conf["peertube_instance"] + "/api/v1"
+    pt_uname = channel_conf["peertube_username"]
+    pt_passwd = channel_conf["peertube_password"]
+    # get client ID and secret from peertube instance
+    id_secret = json.loads(str(requests.get(pt_api + "/oauth-clients/local").content).split("'")[1])
+    client_id = id_secret["client_id"]
+    client_secret = id_secret["client_secret"]
+    # construct JSON for post request to get access token
+    auth_json = {'client_id': client_id,
+                 'client_secret': client_secret,
+                 'grant_type': 'password',
+                 'response_type': 'code',
+                 'username': pt_uname,
+                 'password': pt_passwd
+                 }
+    # get access token
+    auth_result = json.loads(str(requests.post(pt_api + "/users/token", data=auth_json).content).split("'")[1])
+    access_token = auth_result["access_token"]
+    return access_token
+
+def get_pt_channel_id(channel_conf):
+    pt_api = channel_conf["peertube_instance"] + "/api/v1"
+    post_url = pt_api + "/video-channels/" + channel_conf["peertube_channel"] + "/"
+    returned_json = json.loads(requests.get(post_url).content)
+    channel_id = returned_json["id"]
+    return channel_id
+
+def upload_to_pt(dl_dir, channel_conf, e, access_token, thumb_extension):
+    # Adapted from Prismedia https://git.lecygnenoir.info/LecygneNoir/prismedia
+    pt_api = channel_conf["peertube_instance"] + "/api/v1"
+    video_file = dl_dir + channel_conf["name"] + "/" + e["yt_videoid"] + "." + \
+                 channel_conf["preferred_extension"]
+    thumb_file = dl_dir + channel_conf["name"] + "/" + e["yt_videoid"] + "." + thumb_extension
+
+    def get_file(file_path):
+        mimetypes.init()
+        return (path.basename(file_path), open(path.abspath(file_path), 'rb'),
+                mimetypes.types_map[path.splitext(file_path)[1]])
+
+    description = channel_conf["description_prefix"] + "\n\n" + e["summary"] + "\n\n" + channel_conf["description_suffix"]
+    channel_id = str(get_pt_channel_id(channel_conf))
+    # We need to transform fields into tuple to deal with tags as
+    # MultipartEncoder does not support list refer
+    # https://github.com/requests/toolbelt/issues/190 and
+    # https://github.com/requests/toolbelt/issues/205
+    fields = [
+        ("name", e["title"]),
+        ("licence", "1"),
+        ("description", description),
+        ("nsfw", channel_conf["nsfw"]),
+        ("channelId", channel_id),
+        ("originallyPublishedAt", e["published"]),
+        ("category", channel_conf["pt_channel_category"]),
+        ("lanmguage", channel_conf["default_lang"]),
+        ("privacy", str(channel_conf["pt_privacy"])),
+        ("commentsEnabled", channel_conf["comments_enabled"]),
+        ("videofile", get_file(video_file)),
+        ("thumbnailfile", get_file(thumb_file)),
+        ("previewfile", get_file(thumb_file)),
+        ("waitTranscoding", 'false')
+    ]
+
+    if channel_conf["pt_tags"] != "":
+        fields.append(("tags", "[" + channel_conf["pt_tags"] + "]"))
+    else:
+        print("you have no tags in your configuration file for this channel")
+    multipart_data = MultipartEncoder(fields)
+    headers = {
+        'Content-Type': multipart_data.content_type,
+        'Authorization': "Bearer " + access_token
+    }
+    print(requests.post(pt_api + "/videos/upload", data=multipart_data, headers=headers).content)
+
+def run_steps(conf):
+    # TODO: logging
+    channel = conf["channel"]
+    # run loop for every channel in the configuration file
+    global_conf = conf["global"]
+    if conf["global"]["delete_videos"] == "true":
+        delete_videos = True
+    else:
+        delete_videos = False
+    dl_dir = global_conf["video_download_dir"]
+    if not path.exists(dl_dir):
+        mkdir(dl_dir)
+    channel_counter = 0
+    for c in channel:
+        print("\n")
+        channel_id = channel[c]["channel_id"]
+        channel_conf = channel[str(channel_counter)]
+        queue = get_video_data(channel_id)
+        if len(queue) > 0:
+            if not path.exists(dl_dir + "/" + channel_conf["name"]):
+                mkdir(dl_dir + "/" + channel_conf["name"])
+            # download videos, metadata and thumbnails from youtube
+            for item in queue:
+                print("downloading " + item["yt_videoid"] + " from YouTube...")
+                download_yt_video(item, dl_dir, channel_conf)
+                print("done.")
+                # TODO: download closest to config specified resolution instead of best resolution
+                thumb_extension = save_thumbnail(item, dl_dir, channel_conf)
+                # only save metadata to text file if archiving videos
+                if not delete_videos:
+                    print("saving video metadata...")
+                    save_metadata(item, dl_dir, channel_conf)
+                    print("done.")
+            access_token = get_pt_auth(channel_conf)
+            # upload videos, metadata and thumbnails to peertube
+            for item in queue:
+                print("uploading " + item["yt_videoid"] + " to Peertube...")
+                upload_to_pt(dl_dir, channel_conf, item, access_token, thumb_extension)
+                print("done.")
+            if delete_videos:
+                print("deleting videos...")
+                rmtree(dl_dir + "/" + channel_conf["name"], ignore_errors=True)
+                print("done")
+        channel_counter += 1
+
+def run(run_once=True):
+    #TODO: turn this into a cron job
+    conf = utils.read_conf("config.toml")
+    if run_once:
+        run_steps(conf)
+    else:
+        while True:
+            poll_frequency = int(conf["global"]["poll_frequency"]) * 60
+            run_steps(conf)
+            sleep(poll_frequency)
+
+if __name__ == "__main__":
+    run(run_once=False)
--- a/yt_pt_languages_categories.txt
+++ b/yt_pt_languages_categories.txt
@ -0,0 +1,77 @@
+# Adapted from Prismedia https://git.lecygnenoir.info/LecygneNoir/prismedia
+# currently this does nothing. It is for user reference when setting peertube categories and languages in the config.
+# set the category and language in each channel config to the value corresponding to the language or category here.
+
+### CATEGORIES ###
+YOUTUBE_CATEGORY = {
+    "music": 10,
+    "films": 1,
+    "vehicles": 2,
+    "sport": 17,
+    "travels": 19,
+    "gaming": 20,
+    "people": 22,
+    "comedy": 23,
+    "entertainment": 24,
+    "news": 25,
+    "how to": 26,
+    "education": 27,
+    "activism": 29,
+    "science & technology": 28,
+    "science": 28,
+    "technology": 28,
+    "animals": 15
+}
+# for now, use these values in the config file corresponding to the category you want
+PEERTUBE_CATEGORY = {
+    "music": 1,
+    "films": 2,
+    "vehicles": 3,
+    "sport": 5,
+    "travels": 6,
+    "gaming": 7,
+    "people": 8,
+    "comedy": 9,
+    "entertainment": 10,
+    "news": 11,
+    "how to": 12,
+    "education": 13,
+    "activism": 14,
+    "science & technology": 15,
+    "science": 15,
+    "technology": 15,
+    "animals": 16
+}
+
+### LANGUAGES ###
+YOUTUBE_LANGUAGE = {
+    "arabic": 'ar',
+    "english": 'en',
+    "french": 'fr',
+    "german": 'de',
+    "hindi": 'hi',
+    "italian": 'it',
+    "japanese": 'ja',
+    "korean": 'ko',
+    "mandarin": 'zh-CN',
+    "portuguese": 'pt-PT',
+    "punjabi": 'pa',
+    "russian": 'ru',
+    "spanish": 'es'
+}
+# for now, use these values in the config for the language you want
+PEERTUBE_LANGUAGE = {
+    "arabic": "ar",
+    "english": "en",
+    "french": "fr",
+    "german": "de",
+    "hindi": "hi",
+    "italian": "it",
+    "japanese": "ja",
+    "korean": "ko",
+    "mandarin": "zh",
+    "portuguese": "pt",
+    "punjabi": "pa",
+    "russian": "ru",
+    "spanish": "es"
+}