Merge branch 'main' into feat/yt-dlp-pots

2025-03-17 16:54:29 +00:00 · 2025-03-17 16:54:29 +00:00 · bbe25537c7
commit bbe25537c7
--- a/.gitignore
+++ b/.gitignore
@ -34,4 +34,5 @@ docs/_build/
 docs/source/autoapi/
 docs/source/modules/autogen/
 scripts/settings_page.html
+scripts/settings/src/schema.json 
 .vite
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
@ -21,7 +21,7 @@ build:
      # generate the config editor page. Schema then HTML
      - VIRTUAL_ENV=$READTHEDOCS_VIRTUALENV_PATH poetry run python scripts/generate_settings_schema.py
      # install node dependencies and build the settings
-      - cd scripts/settings && npm install && npm run build && yes | cp dist/index.html ../../docs/source/installation/settings_base.html && cd ../..
+      - cd scripts/settings && npm install && npm run build && yes | cp -v dist/index.html ../../docs/source/installation/settings.html && cd ../..


 sphinx:
--- a/README.md
+++ b/README.md
@ -29,7 +29,7 @@ View the [Installation Guide](https://auto-archiver.readthedocs.io/en/latest/ins

 To get started quickly using Docker:

-`docker pull bellingcat/auto-archiver && docker run --rm -v secrets:/app/secrets bellingcat/auto-archiver --config secrets/orchestration.yaml`
+`docker pull bellingcat/auto-archiver && docker run -it --rm -v secrets:/app/secrets bellingcat/auto-archiver --config secrets/orchestration.yaml`

 Or pip:

--- a/docs/source/development/docs.md
+++ b/docs/source/development/docs.md
@ -36,3 +36,12 @@ open docs/_build/html/index.html
 sphinx-autobuild docs/source docs/_build/html
 ```

+
+### Managing Readthedocs (RTD) Versions
+
+Version management is done at [https://app.readthedocs.org/projects/auto-archiver/](https://app.readthedocs.org/projects/auto-archiver/)
+(login required). Once logged in, you can create new versions, delete old versions or change visibility of versions. More info on
+[RTD](https://docs.readthedocs.com/platform/stable/versions.html).
+
+Currently, the Auto Archiver project is set up to automatically create a new docs version for each `vX.Y.Z` release. For more on this,
+see the RTD [instructions on automation](https://docs.readthedocs.com/platform/stable/guides/automation-rules.html) or edit the existing automation rule in the project settings.
--- a/docs/source/how_to/gsheets_setup.md
+++ b/docs/source/how_to/gsheets_setup.md
@ -86,7 +86,7 @@ gsheet_feeder_db:

 You can also pass these settings directly on the command line without having to edit the file, here'a an example of how to do that (using docker):

-`docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --gsheet_feeder_db.sheet "My Awesome Sheet 2"`. 
+`docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --gsheet_feeder_db.sheet "My Awesome Sheet 2"`. 

 Here, the sheet name has been overridden/specified in the command line invocation.

--- a/docs/source/installation/faq.md
+++ b/docs/source/installation/faq.md
@ -0,0 +1,60 @@
+# Frequently Asked Questions
+
+
+### Q: What websites does the Auto Archiver support?
+**A:** The Auto Archiver works for a large variety of sites. Firstly, the Auto Archiver can download
+and archive any video website supported by YT-DLP, a powerful video-downloading tool ([full list of of
+sites here](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)). Aside from these sites,
+there are various different 'Extractors' for specific websites. See the full list of extractors that 
+are available on the [extractors](../modules/extractor.md) page. Some sites supported include:
+
+* Twitter
+* Instagram
+* Telegram
+* VKontact
+* Tiktok
+* Bluesky
+
+```{note} What websites the Auto Archiver can archie depends on what extractors you have enabled in
+your configuration. See [configuration](./configurations.md) for more info.
+```
+
+### Q: Does the Auto Archiver only work for social media posts ?
+**A:** No, the Auto Archiver can archive any web page on the internet, not just social media posts.
+However, for social media posts Auto Archiver can extract more relevant/useful information (such as 
+post comments, likes, author etc.) which may not be available for a generic website. If you are looking
+to more generally archive webpages, then you should make sure to enable the [](../modules/autogen/extractor/wacz_extractor_enricher.md)
+and the [](../modules/autogen/extractor/wayback_extractor_enricher.md).
+
+### Q: What kind of data is stored for each webpage that's archived?
+**A:** This depends on the website archived, but more generally, for social media posts any videos and photos in
+the post will be archived. For video sites, the video will be downloaded separately. For most of these sites, additional
+metadata such as published date, uploader/author and ratings/comments will also be saved. Additionally, further data can be
+saved depending on the enrichers that you have enabled. Some other types of data saved are timestamps if you have the 
+[](../modules/autogen/enricher/timestamping_enricher.md) or [](../modules/autogen/enricher/opentimestamps_enricher.md) enabled,
+screenshots of the web page with the [](../modules/autogen/enricher/screenshot_enricher.md), and for videos, thumbnails of the
+video with the [](../modules/autogen/enricher/thumbnail_enricher.md). You can also store things like hashes (SHA256, or pdq hashes)
+with the various hash enrichers.
+
+### Q: Where is my data stored?
+**A:** With the default configuration, data is stored on your local computer in the `local_storage` folder. You can adjust these settings by
+changing the [storage modules](../modules/storage.md) you have enabled. For example, you could choose to store your data in an S3 bucket or 
+on Google Drive. 
+
+```{note}
+You can choose to store your data in multiple places, for example your local drive **and** an S3 bucket for redundancy.
+```
+
+### Q: What should I do is something doesn't work?
+**A:** First, read through the log files to see if you can find a specific reason why something isn't working. Learn more about logging
+and how to enable debug logging in the [Logging Howto](../how_to/logging.md).
+
+If you cannot find an answer in the logs, then try searching this documentation or existing / closed issues on the [Github Issue Tracker](https://github.com/bellingcat/auto-archiver/issues?q=is%3Aissue%20). If you still cannot find an answer, then consider opening an issue on the Github Issue Tracker or asking in the Bellingcat Discord
+'Auto Archiver' group.
+
+#### Common reasons why an archiving might not work:
+
+* The website may have temporarily adjusted its settings - sometimes sites like Telegram or Twitter adjust their scraping protection settings. Often,
+waiting a day or two and then trying again can work.
+* The site requires you to be logged in - you could try using cookies or authentication to bypass any blocks. See [](../installation/authentication.md) for more information.
+* The website you're trying to archive has changed its settings/structure. Make sure you're using the latest version of Auto Archiver and try again.
--- a/docs/source/installation/installation.md
+++ b/docs/source/installation/installation.md
@ -1,5 +1,11 @@
 # Installation

+```{toctree}
+:maxdepth: 1
+
+upgrading.md
+```
+
 There are 3 main ways to use the auto-archiver. We recommend the 'docker' method for most uses. This installs all the requirements in one command.

 1. Easiest (recommended): [via docker](#installing-with-docker)
--- a/docs/source/installation/settings.html
+++ b/docs/source/installation/settings.html
--- a/docs/source/installation/setup.md
+++ b/docs/source/installation/setup.md
@ -1,7 +1,6 @@
 # Getting Started

 ```{toctree}
-:maxdepth: 1
 :hidden:

 installation.md
@ -9,6 +8,7 @@ configurations.md
 config_editor.md
 authentication.md
 requirements.md
+faq.md
 config_cheatsheet.md
 ```

@ -27,17 +27,18 @@ The way you run the Auto Archiver depends on how you installed it (docker instal
 If you installed Auto Archiver using docker, open up your terminal, and copy-paste / type the following command:

 ```bash
-docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
+docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
 ```

 breaking this command down:
   1. `docker run` tells docker to start a new container (an instance of the image)
-   2. `--rm` makes sure this container is removed after execution (less garbage locally)
-   3. `-v $PWD/secrets:/app/secrets` - your secrets folder with settings
+   2. `-it` tells docker to run in 'interactive mode' so that we get nice colour logs
+   3. `--rm` makes sure this container is removed after execution (less garbage locally)
+   4. `-v $PWD/secrets:/app/secrets` - your secrets folder with settings
      1. `-v` is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
      2. `$PWD/secrets` points to a `secrets/` folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
      3. `/app/secrets` points to the path the docker container where this image can be found
-   4.  `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
+   5.  `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
       1.  `-v` same as above, this is a volume instruction
       2.  `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
       3.  `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file 
@ -48,14 +49,14 @@ The invocations below will run the auto-archiver Docker image using a configurat

 ```bash
 # Have auto-archiver run with the default settings, generating a settings file in ./secrets/orchestration.yaml
-docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
+docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver

 # uses the same configuration, but with the `gsheet_feeder`, a header on row 2 and with some different column names
 # Note this expects you to have followed the [Google Sheets setup](how_to/google_sheets.md) and added your service_account.json to the `secrets/` folder
 # notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
-docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --feeders=gsheet_feeder --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
+docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --feeders=gsheet_feeder --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
 # Runs auto-archiver for the first time, but in 'full' mode, enabling all modules to get a full settings file
-docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --mode full
+docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --mode full
 ```

 ------------
--- a/docs/source/installation/upgrading.md
+++ b/docs/source/installation/upgrading.md
@ -0,0 +1,30 @@
+
+# Upgrading
+
+If an update is available, then you will see a message in the logs when you
+run Auto Archiver. Here's what those logs look like:
+
+```{code} bash
+********* IMPORTANT: UPDATE AVAILABLE ********
+A new version of auto-archiver is available (v0.13.6, you have 0.13.4)
+Make sure to update to the latest version using: `pip install --upgrade auto-archiver`
+```
+
+Upgrading Auto Archiver depends on the way you installed it.
+
+## Docker
+
+To upgrade using docker, update the docker image with:
+
+```
+docker pull bellingcat/auto-archiver:latest 
+```
+
+## Pip
+
+To upgrade the pip package, use:
+
+```
+pip install --upgrade auto-archiver
+```
+
--- a/scripts/generate_settings_schema.py
+++ b/scripts/generate_settings_schema.py
@ -59,4 +59,5 @@ output_schema = {
 current_file_dir = os.path.dirname(os.path.abspath(__file__))
 output_file = os.path.join(current_file_dir, "settings/src/schema.json")
 with open(output_file, "w") as file:
+    print(f"Writing schema to {output_file}")
    json.dump(output_schema, file, indent=4, cls=SchemaEncoder)
--- a/scripts/settings/package-lock.json
+++ b/scripts/settings/package-lock.json
@ -12,7 +12,7 @@
        "@dnd-kit/sortable": "^10.0.0",
        "@emotion/react": "latest",
        "@emotion/styled": "latest",
-        "@mui/icons-material": "latest",
+        "@mui/icons-material": "^6.4.7",
        "@mui/material": "latest",
        "react": "19.0.0",
        "react-dom": "19.0.0",
@ -997,9 +997,9 @@
      }
    },
    "node_modules/@mui/core-downloads-tracker": {
-      "version": "6.4.6",
-      "resolved": "https://registry.npmjs.org/@mui/core-downloads-tracker/-/core-downloads-tracker-6.4.6.tgz",
-      "integrity": "sha512-rho5Q4IscbrVmK9rCrLTJmjLjfH6m/NcqKr/mchvck0EIXlyYUB9+Z0oVmkt/+Mben43LMRYBH8q/Uzxj/c4Vw==",
+      "version": "6.4.7",
+      "resolved": "https://registry.npmjs.org/@mui/core-downloads-tracker/-/core-downloads-tracker-6.4.7.tgz",
+      "integrity": "sha512-XjJrKFNt9zAKvcnoIIBquXyFyhfrHYuttqMsoDS7lM7VwufYG4fAPw4kINjBFg++fqXM2BNAuWR9J7XVIuKIKg==",
      "license": "MIT",
      "funding": {
        "type": "opencollective",
@ -1007,9 +1007,9 @@
      }
    },
    "node_modules/@mui/icons-material": {
-      "version": "6.4.6",
-      "resolved": "https://registry.npmjs.org/@mui/icons-material/-/icons-material-6.4.6.tgz",
-      "integrity": "sha512-rGJBvIQQbQAlyKYljHQ8wAQS/K2/uYwvemcpygnAmCizmCI4zSF9HQPuiG8Ql4YLZ6V/uKjA3WHIYmF/8sV+pQ==",
+      "version": "6.4.7",
+      "resolved": "https://registry.npmjs.org/@mui/icons-material/-/icons-material-6.4.7.tgz",
+      "integrity": "sha512-Rk8cs9ufQoLBw582Rdqq7fnSXXZTqhYRbpe1Y5SAz9lJKZP3CIdrj0PfG8HJLGw1hrsHFN/rkkm70IDzhJsG1g==",
      "license": "MIT",
      "dependencies": {
        "@babel/runtime": "^7.26.0"
@ -1022,7 +1022,7 @@
        "url": "https://opencollective.com/mui-org"
      },
      "peerDependencies": {
-        "@mui/material": "^6.4.6",
+        "@mui/material": "^6.4.7",
        "@types/react": "^17.0.0 || ^18.0.0 || ^19.0.0",
        "react": "^17.0.0 || ^18.0.0 || ^19.0.0"
      },
@ -1033,14 +1033,14 @@
      }
    },
    "node_modules/@mui/material": {
-      "version": "6.4.6",
-      "resolved": "https://registry.npmjs.org/@mui/material/-/material-6.4.6.tgz",
-      "integrity": "sha512-6UyAju+DBOdMogfYmLiT3Nu7RgliorimNBny1pN/acOjc+THNFVE7hlxLyn3RDONoZJNDi/8vO4AQQr6dLAXqA==",
+      "version": "6.4.7",
+      "resolved": "https://registry.npmjs.org/@mui/material/-/material-6.4.7.tgz",
+      "integrity": "sha512-K65StXUeGAtFJ4ikvHKtmDCO5Ab7g0FZUu2J5VpoKD+O6Y3CjLYzRi+TMlI3kaL4CL158+FccMoOd/eaddmeRQ==",
      "license": "MIT",
      "dependencies": {
        "@babel/runtime": "^7.26.0",
-        "@mui/core-downloads-tracker": "^6.4.6",
-        "@mui/system": "^6.4.6",
+        "@mui/core-downloads-tracker": "^6.4.7",
+        "@mui/system": "^6.4.7",
        "@mui/types": "^7.2.21",
        "@mui/utils": "^6.4.6",
        "@popperjs/core": "^2.11.8",
@ -1061,7 +1061,7 @@
      "peerDependencies": {
        "@emotion/react": "^11.5.0",
        "@emotion/styled": "^11.3.0",
-        "@mui/material-pigment-css": "^6.4.6",
+        "@mui/material-pigment-css": "^6.4.7",
        "@types/react": "^17.0.0 || ^18.0.0 || ^19.0.0",
        "react": "^17.0.0 || ^18.0.0 || ^19.0.0",
        "react-dom": "^17.0.0 || ^18.0.0 || ^19.0.0"
@ -1143,9 +1143,9 @@
      }
    },
    "node_modules/@mui/system": {
-      "version": "6.4.6",
-      "resolved": "https://registry.npmjs.org/@mui/system/-/system-6.4.6.tgz",
-      "integrity": "sha512-FQjWwPec7pMTtB/jw5f9eyLynKFZ6/Ej9vhm5kGdtmts1z5b7Vyn3Rz6kasfYm1j2TfrfGnSXRvvtwVWxjpz6g==",
+      "version": "6.4.7",
+      "resolved": "https://registry.npmjs.org/@mui/system/-/system-6.4.7.tgz",
+      "integrity": "sha512-7wwc4++Ak6tGIooEVA9AY7FhH2p9fvBMORT4vNLMAysH3Yus/9B9RYMbrn3ANgsOyvT3Z7nE+SP8/+3FimQmcg==",
      "license": "MIT",
      "dependencies": {
        "@babel/runtime": "^7.26.0",
--- a/scripts/settings/package.json
+++ b/scripts/settings/package.json
@ -13,7 +13,7 @@
    "@dnd-kit/sortable": "^10.0.0",
    "@emotion/react": "latest",
    "@emotion/styled": "latest",
-    "@mui/icons-material": "latest",
+    "@mui/icons-material": "^6.4.7",
    "@mui/material": "latest",
    "react": "19.0.0",
    "react-dom": "19.0.0",
--- a/scripts/settings/src/App.tsx
+++ b/scripts/settings/src/App.tsx
@ -4,7 +4,7 @@ import Container from '@mui/material/Container';
 import Typography from '@mui/material/Typography';
 import Box from '@mui/material/Box';
 import FileUploadIcon from '@mui/icons-material/FileUpload';
-// 
+
 import {
  DndContext,
  closestCenter,
@ -204,7 +204,7 @@ function ModuleTypes({ stepType, setEnabledModules, enabledModules, configValues
          {stepType}
        </Typography>
        <Typography variant="body1" >
-          Select the <a href="<a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`}" target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
+          Select the <a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`} target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
        </Typography>
      </Box>
      {showError ? <Typography variant="body1" color="error" >Only one {stepType.slice(0,-1)} can be enabled at a time.</Typography> : null}
--- a/scripts/settings/src/schema.json
+++ b/scripts/settings/src/schema.json
--- a/scripts/settings/vite.config.ts
+++ b/scripts/settings/vite.config.ts
@ -6,7 +6,7 @@ import { viteSingleFile } from "vite-plugin-singlefile"
 export default defineConfig({
  plugins: [react(), viteSingleFile()],
  build: {
-    minify: false,
-    sourcemap: true,
+    // minify: false,
+    // sourcemap: true,
  }
 });
--- a/src/auto_archiver/core/config.py
+++ b/src/auto_archiver/core/config.py
@ -8,6 +8,7 @@ flexible setup in various environments.
 import argparse
 from ruamel.yaml import YAML, CommentedMap
 import json
+import os

 from loguru import logger

@ -230,6 +231,10 @@ def read_yaml(yaml_filename: str) -> CommentedMap:
 def store_yaml(config: CommentedMap, yaml_filename: str) -> None:
    config_to_save = deepcopy(config)

+    ## if the save path is the default location (secrets) then create the 'secrets' folder
+    if os.path.dirname(yaml_filename) == "secrets":
+        os.makedirs("secrets", exist_ok=True)
+
    auth_dict = config_to_save.get("authentication", {})
    if auth_dict and auth_dict.get("load_from_file"):
        # remove all other values from the config, don't want to store it in the config file
--- a/src/auto_archiver/core/orchestrator.py
+++ b/src/auto_archiver/core/orchestrator.py
@ -112,7 +112,7 @@ class ArchivingOrchestrator:
    def check_steps(self, config):
        for module_type in MODULE_TYPES:
            if not config["steps"].get(f"{module_type}s", []):
-                if module_type == "feeder" or module_type == "formatter" and config["steps"].get(f"{module_type}"):
+                if (module_type == "feeder" or module_type == "formatter") and config["steps"].get(f"{module_type}"):
                    raise SetupError(
                        f"It appears you have '{module_type}' set under 'steps' in your configuration file, but as of version 0.13.0 of Auto Archiver, you must use '{module_type}s'. Change this in your configuration file and try again. \
 Here's how that would look: \n\nsteps:\n  {module_type}s:\n  - [your_{module_type}_name_here]\n  {'extractors:...' if module_type == 'feeder' else '...'}\n"
@ -377,7 +377,8 @@ Here's how that would look: \n\nsteps:\n  extractors:\n  - [your_extractor_name_
                try:
                    loaded_module: BaseModule = self.module_factory.get_module(module, self.config)
                except (KeyboardInterrupt, Exception) as e:
-                    logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
+                    if not isinstance(e, KeyboardInterrupt) and not isinstance(e, SetupError):
+                        logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
                    if loaded_module and module_type == "extractor":
                        loaded_module.cleanup()
                    raise e
--- a/src/auto_archiver/modules/cli_feeder/cli_feeder.py
+++ b/src/auto_archiver/modules/cli_feeder/cli_feeder.py
@ -2,13 +2,14 @@ from loguru import logger

 from auto_archiver.core.feeder import Feeder
 from auto_archiver.core.metadata import Metadata
+from auto_archiver.core.consts import SetupError


 class CLIFeeder(Feeder):
    def setup(self) -> None:
        self.urls = self.config["urls"]
        if not self.urls:
-            raise ValueError(
+            raise SetupError(
                "No URLs provided. Please provide at least one URL via the command line, or set up an alternative feeder. Use --help for more information."
            )

--- a/src/auto_archiver/modules/generic_extractor/manifest.py
+++ b/src/auto_archiver/modules/generic_extractor/manifest.py
@ -15,6 +15,9 @@ supported by `yt-dlp`, such as YouTube, Facebook, and others. It provides functi
 for retrieving videos, subtitles, comments, and other metadata, and it integrates with
 the broader archiving framework.

+For a full list of video platforms supported by `yt-dlp`, see the
+[official documentation](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)
+
 ### Features
 - Supports downloading videos and playlists.
 - Retrieves metadata like titles, descriptions, upload dates, and durations.
--- a/src/auto_archiver/modules/generic_extractor/dropin.py
+++ b/src/auto_archiver/modules/generic_extractor/dropin.py
@ -1,3 +1,4 @@
+from typing import Type
 from yt_dlp.extractor.common import InfoExtractor
 from auto_archiver.core.metadata import Metadata
 from auto_archiver.core.extractor import Extractor
@ -24,6 +25,8 @@ class GenericDropin:

    """

+    extractor: Type[Extractor] = None
+
    def extract_post(self, url: str, ie_instance: InfoExtractor):
        """
        This method should return the post data from the url.
@ -55,3 +58,10 @@ class GenericDropin:
        This method should download any additional media from the post.
        """
        return metadata
+
+    def is_suitable(self, url, info_extractor: InfoExtractor):
+        """
+        Used to override the InfoExtractor's 'is_suitable' method. Dropins should override this method to return True if the url is suitable for the extractor
+        (based on being able to parse other URLs)
+        """
+        return False
--- a/src/auto_archiver/modules/generic_extractor/facebook.py
+++ b/src/auto_archiver/modules/generic_extractor/facebook.py
@ -1,17 +1,154 @@
+import re
 from .dropin import GenericDropin
+from auto_archiver.core.metadata import Metadata
+from yt_dlp.extractor.facebook import FacebookIE
+
+# TODO: Remove if / when  https://github.com/yt-dlp/yt-dlp/pull/12275 is merged
+from yt_dlp.utils import (
+    clean_html,
+    get_element_by_id,
+    traverse_obj,
+    get_first,
+    merge_dicts,
+    int_or_none,
+    parse_count,
+)
+
+
+def _extract_metadata(self, webpage, video_id):
+    post_data = [
+        self._parse_json(j, video_id, fatal=False)
+        for j in re.findall(r"data-sjs>({.*?ScheduledServerJS.*?})</script>", webpage)
+    ]
+    post = (
+        traverse_obj(
+            post_data,
+            (..., "require", ..., ..., ..., "__bbox", "require", ..., ..., ..., "__bbox", "result", "data"),
+            expected_type=dict,
+        )
+        or []
+    )
+    media = traverse_obj(
+        post,
+        (
+            ...,
+            "attachments",
+            ...,
+            lambda k, v: (k == "media" and str(v["id"]) == video_id and v["__typename"] == "Video"),
+        ),
+        expected_type=dict,
+    )
+    title = get_first(media, ("title", "text"))
+    description = get_first(media, ("creation_story", "comet_sections", "message", "story", "message", "text"))
+    page_title = title or self._html_search_regex(
+        (
+            r'<h2\s+[^>]*class="uiHeaderTitle"[^>]*>(?P<content>[^<]*)</h2>',
+            r'(?s)<span class="fbPhotosPhotoCaption".*?id="fbPhotoPageCaption"><span class="hasCaption">(?P<content>.*?)</span>',
+            self._meta_regex("og:title"),
+            self._meta_regex("twitter:title"),
+            r"<title>(?P<content>.+?)</title>",
+        ),
+        webpage,
+        "title",
+        default=None,
+        group="content",
+    )
+    description = description or self._html_search_meta(
+        ["description", "og:description", "twitter:description"], webpage, "description", default=None
+    )
+    uploader_data = (
+        get_first(media, ("owner", {dict}))
+        or get_first(
+            post, ("video", "creation_story", "attachments", ..., "media", lambda k, v: k == "owner" and v["name"])
+        )
+        or get_first(post, (..., "video", lambda k, v: k == "owner" and v["name"]))
+        or get_first(post, ("node", "actors", ..., {dict}))
+        or get_first(post, ("event", "event_creator", {dict}))
+        or get_first(post, ("video", "creation_story", "short_form_video_context", "video_owner", {dict}))
+        or {}
+    )
+    uploader = uploader_data.get("name") or (
+        clean_html(get_element_by_id("fbPhotoPageAuthorName", webpage))
+        or self._search_regex(
+            (r'ownerName\s*:\s*"([^"]+)"', *self._og_regexes("title")), webpage, "uploader", fatal=False
+        )
+    )
+    timestamp = int_or_none(self._search_regex(r'<abbr[^>]+data-utime=["\'](\d+)', webpage, "timestamp", default=None))
+    thumbnail = self._html_search_meta(["og:image", "twitter:image"], webpage, "thumbnail", default=None)
+    # some webpages contain unretrievable thumbnail urls
+    # like https://lookaside.fbsbx.com/lookaside/crawler/media/?media_id=10155168902769113&get_thumbnail=1
+    # in https://www.facebook.com/yaroslav.korpan/videos/1417995061575415/
+    if thumbnail and not re.search(r"\.(?:jpg|png)", thumbnail):
+        thumbnail = None
+    info_dict = {
+        "description": description,
+        "uploader": uploader,
+        "uploader_id": uploader_data.get("id"),
+        "timestamp": timestamp,
+        "thumbnail": thumbnail,
+        "view_count": parse_count(
+            self._search_regex(
+                (r'\bviewCount\s*:\s*["\']([\d,.]+)', r'video_view_count["\']\s*:\s*(\d+)'),
+                webpage,
+                "view count",
+                default=None,
+            )
+        ),
+        "concurrent_view_count": get_first(
+            post, (("video", (..., ..., "attachments", ..., "media")), "liveViewerCount", {int_or_none})
+        ),
+        **traverse_obj(
+            post,
+            (
+                lambda _, v: video_id in v["url"],
+                "feedback",
+                {
+                    "like_count": ("likers", "count", {int}),
+                    "comment_count": ("total_comment_count", {int}),
+                    "repost_count": ("share_count_reduced", {parse_count}),
+                },
+            ),
+            get_all=False,
+        ),
+    }
+
+    info_json_ld = self._search_json_ld(webpage, video_id, default={})
+    info_json_ld["title"] = (
+        re.sub(r"\s*\|\s*Facebook$", "", title or info_json_ld.get("title") or page_title or "")
+        or (description or "").replace("\n", " ")
+        or f"Facebook video #{video_id}"
+    )
+    return merge_dicts(info_json_ld, info_dict)


 class Facebook(GenericDropin):
-    def extract_post(self, url: str, ie_instance):
-        video_id = ie_instance._match_valid_url(url).group("id")
-        ie_instance._download_webpage(url.replace("://m.facebook.com/", "://www.facebook.com/"), video_id)
-        webpage = ie_instance._download_webpage(url, ie_instance._match_valid_url(url).group("id"))
+    def extract_post(self, url: str, ie_instance: FacebookIE):
+        post_id_regex = r"(?P<id>pfbid[A-Za-z0-9]+|\d+|t\.(\d+\/\d+))"
+        post_id = re.search(post_id_regex, url).group("id")
+        webpage = ie_instance._download_webpage(url.replace("://m.facebook.com/", "://www.facebook.com/"), post_id)

-        # TODO: fix once https://github.com/yt-dlp/yt-dlp/pull/12275 is merged
-        post_data = ie_instance._extract_metadata(webpage)
+        # TODO: For long posts, this _extract_metadata only seems to return the first 100 or so characters, followed by ...
+
+        # TODO: If/when https://github.com/yt-dlp/yt-dlp/pull/12275 is merged, uncomment next line and delete the one after
+        # post_data = ie_instance._extract_metadata(webpage, post_id)
+        post_data = _extract_metadata(ie_instance, webpage, post_id)
        return post_data

-    def create_metadata(self, post: dict, ie_instance, archiver, url):
-        metadata = archiver.create_metadata(url)
-        metadata.set_title(post.get("title")).set_content(post.get("description")).set_post_data(post)
-        return metadata
+    def create_metadata(self, post: dict, ie_instance: FacebookIE, archiver, url):
+        result = Metadata()
+        result.set_content(post.get("description", ""))
+        result.set_title(post.get("title", ""))
+        result.set("author", post.get("uploader", ""))
+        result.set_url(url)
+        return result
+
+    def is_suitable(self, url, info_extractor: FacebookIE):
+        regex = r"(?:https?://(?:[\w-]+\.)?(?:facebook\.com||facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd\.onion)/)"
+        return re.match(regex, url)
+
+    def skip_ytdlp_download(self, url: str, is_instance: FacebookIE):
+        """
+        Skip using the ytdlp download method for Facebook *photo* posts, they have a URL with an id of t.XXXXX/XXXXX
+        """
+        if re.search(r"/t.\d+/\d+", url):
+            return True
--- a/src/auto_archiver/modules/generic_extractor/generic_extractor.py
+++ b/src/auto_archiver/modules/generic_extractor/generic_extractor.py
@ -67,8 +67,18 @@ class GenericExtractor(Extractor):
        """
        Returns a list of valid extractors for the given URL"""
        for info_extractor in yt_dlp.YoutubeDL()._ies.values():
-            if info_extractor.suitable(url) and info_extractor.working():
+            if not info_extractor.working():
+                continue
+
+            # check if there's a dropin and see if that declares whether it's suitable
+            dropin = self.dropin_for_name(info_extractor.ie_key())
+            if dropin and dropin.is_suitable(url, info_extractor):
                yield info_extractor
+                continue
+
+            if info_extractor.suitable(url):
+                yield info_extractor
+                continue

    def suitable(self, url: str) -> bool:
        """
@ -188,9 +198,13 @@ class GenericExtractor(Extractor):
        result = self.download_additional_media(video_data, info_extractor, result)

        # keep both 'title' and 'fulltitle', but prefer 'title', falling back to 'fulltitle' if it doesn't exist
-        result.set_title(video_data.pop("title", video_data.pop("fulltitle", "")))
-        result.set_url(url)
-        if "description" in video_data:
+        if not result.get_title():
+            result.set_title(video_data.pop("title", video_data.pop("fulltitle", "")))
+
+        if not result.get("url"):
+            result.set_url(url)
+
+        if "description" in video_data and not result.get_content():
            result.set_content(video_data["description"])
        # extract comments if enabled
        if self.comments:
@ -207,10 +221,10 @@ class GenericExtractor(Extractor):
            )

        # then add the common metadata
-        if timestamp := video_data.pop("timestamp", None):
+        if timestamp := video_data.pop("timestamp", None) and not result.get("timestamp"):
            timestamp = datetime.datetime.fromtimestamp(timestamp, tz=datetime.timezone.utc).isoformat()
            result.set_timestamp(timestamp)
-        if upload_date := video_data.pop("upload_date", None):
+        if upload_date := video_data.pop("upload_date", None) and not result.get("upload_date"):
            upload_date = datetime.datetime.strptime(upload_date, "%Y%m%d").replace(tzinfo=datetime.timezone.utc)
            result.set("upload_date", upload_date)

@ -240,7 +254,8 @@ class GenericExtractor(Extractor):
            return False

        post_data = dropin.extract_post(url, ie_instance)
-        return dropin.create_metadata(post_data, ie_instance, self, url)
+        result = dropin.create_metadata(post_data, ie_instance, self, url)
+        return self.add_metadata(post_data, info_extractor, url, result)

    def get_metadata_for_video(
        self, data: dict, info_extractor: Type[InfoExtractor], url: str, ydl: yt_dlp.YoutubeDL
@ -296,6 +311,7 @@ class GenericExtractor(Extractor):

        def _load_dropin(dropin):
            dropin_class = getattr(dropin, dropin_class_name)()
+            dropin.extractor = self
            return self._dropins.setdefault(dropin_name, dropin_class)

        try:
@ -340,7 +356,7 @@ class GenericExtractor(Extractor):
        dropin_submodule = self.dropin_for_name(info_extractor.ie_key())

        try:
-            if dropin_submodule and dropin_submodule.skip_ytdlp_download(info_extractor, url):
+            if dropin_submodule and dropin_submodule.skip_ytdlp_download(url, info_extractor):
                logger.debug(f"Skipping using ytdlp to download files for {info_extractor.ie_key()}")
                raise SkipYtdlp()

@ -359,7 +375,7 @@ class GenericExtractor(Extractor):

            if not isinstance(e, SkipYtdlp):
                logger.debug(
-                    f'Issue using "{info_extractor.IE_NAME}" extractor to download video (error: {repr(e)}), attempting to use extractor to get post data instead'
+                    f'Issue using "{info_extractor.IE_NAME}" extractor to download video (error: {repr(e)}), attempting to use dropin to get post data instead'
                )

            try:
--- a/src/auto_archiver/modules/generic_extractor/tiktok.py
+++ b/src/auto_archiver/modules/generic_extractor/tiktok.py
@ -38,6 +38,9 @@ class Tiktok(GenericDropin):
        api_data["video_url"] = video_url
        return api_data

+    def keys_to_clean(self, video_data: dict, info_extractor):
+        return ["video_url", "title", "create_time", "author", "cover", "origin_cover", "ai_dynamic_cover", "duration"]
+
    def create_metadata(self, post: dict, ie_instance, archiver, url):
        # prepare result, start by downloading video
        result = Metadata()
@ -54,17 +57,17 @@ class Tiktok(GenericDropin):
            logger.error(f"failed to download video from {video_url}")
            return False
        video_media = Media(video_downloaded)
-        if duration := post.pop("duration", None):
+        if duration := post.get("duration", None):
            video_media.set("duration", duration)
        result.add_media(video_media)

        # add remaining metadata
-        result.set_title(post.pop("title", ""))
+        result.set_title(post.get("title", ""))

-        if created_at := post.pop("create_time", None):
+        if created_at := post.get("create_time", None):
            result.set_timestamp(datetime.fromtimestamp(created_at, tz=timezone.utc))

-        if author := post.pop("author", None):
+        if author := post.get("author", None):
            result.set("author", author)

        result.set("api_data", post)
--- a/src/auto_archiver/modules/local_storage/manifest.py
+++ b/src/auto_archiver/modules/local_storage/manifest.py
@ -20,7 +20,7 @@
        "save_absolute": {
            "default": False,
            "type": "bool",
-            "help": "whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)",
+            "help": "whether the path to the stored file is absolute or relative in the output result inc. formatters (Warning: saving an absolute path will show your computer's file structure)",
        },
    },
    "description": """
--- a/src/auto_archiver/utils/webdriver.py
+++ b/src/auto_archiver/utils/webdriver.py
@ -49,7 +49,7 @@ class CookieSettingDriver(webdriver.Firefox):
                        self.driver.add_cookie({"name": name, "value": value})
            elif self.cookiejar:
                domain = urlparse(url).netloc
-                regex = re.compile(f"(www)?\.?{domain}$")
+                regex = re.compile(f"(www)?.?{domain}$")
                for cookie in self.cookiejar:
                    if regex.match(cookie.domain):
                        try:
--- a/tests/extractors/test_generic_extractor.py
+++ b/tests/extractors/test_generic_extractor.py
@ -40,6 +40,22 @@ class TestGenericExtractor(TestExtractorBase):
        path = os.path.join(dirname(dirname(__file__)), "data/")
        assert self.extractor.dropin_for_name("dropin", additional_paths=[path])

+    @pytest.mark.parametrize(
+        "url, suitable_extractors",
+        [
+            ("https://www.youtube.com/watch?v=5qap5aO4i9A", ["youtube"]),
+            ("https://www.tiktok.com/@funnycats0ftiktok/video/7345101300750748970?lang=en", ["tiktok"]),
+            ("https://www.instagram.com/p/CU1J9JYJ9Zz/", ["instagram"]),
+            ("https://www.facebook.com/nytimes/videos/10160796550110716", ["facebook"]),
+            ("https://www.facebook.com/BylineFest/photos/t.100057299682816/927879487315946/", ["facebook"]),
+        ],
+    )
+    def test_suitable_extractors(self, url, suitable_extractors):
+        suitable_extractors = suitable_extractors + ["generic"]  # the generic is valid for all
+        extractors = list(self.extractor.suitable_extractors(url))
+        assert len(extractors) == len(suitable_extractors)
+        assert [e.ie_key().lower() for e in extractors] == suitable_extractors
+
    @pytest.mark.parametrize(
        "url, is_suitable",
        [
@ -55,7 +71,7 @@ class TestGenericExtractor(TestExtractorBase):
            ("https://google.com", True),
        ],
    )
-    def test_suitable_urls(self, make_item, url, is_suitable):
+    def test_suitable_urls(self, url, is_suitable):
        """
        Note: expected behaviour is to return True for all URLs, as YoutubeDLArchiver should be able to handle all URLs
        This behaviour may be changed in the future (e.g. if we want the youtubedl archiver to just handle URLs it has extractors for,
@ -245,3 +261,32 @@ class TestGenericExtractor(TestExtractorBase):
        self.assertValidResponseMetadata(post, title, timestamp)
        assert len(post.media) == 1
        assert post.media[0].hash == image_hash
+
+    @pytest.mark.download
+    def test_download_facebook_video(self, make_item):
+        post = self.extractor.download(make_item("https://www.facebook.com/bellingcat/videos/588371253839133"))
+        assert len(post.media) == 2
+        assert post.media[0].filename.endswith("588371253839133.mp4")
+        assert post.media[0].mimetype == "video/mp4"
+
+        assert post.media[1].filename.endswith(".jpg")
+        assert post.media[1].mimetype == "image/jpeg"
+
+        assert "Bellingchat Premium is with Kolina Koltai" in post.get_title()
+
+    @pytest.mark.download
+    def test_download_facebook_image(self, make_item):
+        post = self.extractor.download(
+            make_item("https://www.facebook.com/BylineFest/photos/t.100057299682816/927879487315946/")
+        )
+
+        assert len(post.media) == 1
+        assert post.media[0].filename.endswith(".png")
+        assert "Byline Festival - BylineFest Partner" == post.get_title()
+
+    @pytest.mark.download
+    def test_download_facebook_text_only(self, make_item):
+        url = "https://www.facebook.com/bellingcat/posts/pfbid02rzpwZxAZ8bLkAX8NvHv4DWAidFaqAUfJMbo9vWkpwxL7uMUWzWMiizXLWRSjwihVl"
+        post = self.extractor.download(make_item(url))
+        assert "Bellingcat researcher Kolina Koltai delves deeper into Clothoff" in post.get("content")
+        assert post.get_title() == "Bellingcat"