kopia lustrzana https://github.com/bellingcat/auto-archiver
Merge branch 'main' into feat/yt-dlp-pots
commit
bbe25537c7
|
@ -34,4 +34,5 @@ docs/_build/
|
|||
docs/source/autoapi/
|
||||
docs/source/modules/autogen/
|
||||
scripts/settings_page.html
|
||||
scripts/settings/src/schema.json
|
||||
.vite
|
||||
|
|
|
@ -21,7 +21,7 @@ build:
|
|||
# generate the config editor page. Schema then HTML
|
||||
- VIRTUAL_ENV=$READTHEDOCS_VIRTUALENV_PATH poetry run python scripts/generate_settings_schema.py
|
||||
# install node dependencies and build the settings
|
||||
- cd scripts/settings && npm install && npm run build && yes | cp dist/index.html ../../docs/source/installation/settings_base.html && cd ../..
|
||||
- cd scripts/settings && npm install && npm run build && yes | cp -v dist/index.html ../../docs/source/installation/settings.html && cd ../..
|
||||
|
||||
|
||||
sphinx:
|
||||
|
|
|
@ -29,7 +29,7 @@ View the [Installation Guide](https://auto-archiver.readthedocs.io/en/latest/ins
|
|||
|
||||
To get started quickly using Docker:
|
||||
|
||||
`docker pull bellingcat/auto-archiver && docker run --rm -v secrets:/app/secrets bellingcat/auto-archiver --config secrets/orchestration.yaml`
|
||||
`docker pull bellingcat/auto-archiver && docker run -it --rm -v secrets:/app/secrets bellingcat/auto-archiver --config secrets/orchestration.yaml`
|
||||
|
||||
Or pip:
|
||||
|
||||
|
|
|
@ -36,3 +36,12 @@ open docs/_build/html/index.html
|
|||
sphinx-autobuild docs/source docs/_build/html
|
||||
```
|
||||
|
||||
|
||||
### Managing Readthedocs (RTD) Versions
|
||||
|
||||
Version management is done at [https://app.readthedocs.org/projects/auto-archiver/](https://app.readthedocs.org/projects/auto-archiver/)
|
||||
(login required). Once logged in, you can create new versions, delete old versions or change visibility of versions. More info on
|
||||
[RTD](https://docs.readthedocs.com/platform/stable/versions.html).
|
||||
|
||||
Currently, the Auto Archiver project is set up to automatically create a new docs version for each `vX.Y.Z` release. For more on this,
|
||||
see the RTD [instructions on automation](https://docs.readthedocs.com/platform/stable/guides/automation-rules.html) or edit the existing automation rule in the project settings.
|
|
@ -86,7 +86,7 @@ gsheet_feeder_db:
|
|||
|
||||
You can also pass these settings directly on the command line without having to edit the file, here'a an example of how to do that (using docker):
|
||||
|
||||
`docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --gsheet_feeder_db.sheet "My Awesome Sheet 2"`.
|
||||
`docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --gsheet_feeder_db.sheet "My Awesome Sheet 2"`.
|
||||
|
||||
Here, the sheet name has been overridden/specified in the command line invocation.
|
||||
|
||||
|
|
|
@ -0,0 +1,60 @@
|
|||
# Frequently Asked Questions
|
||||
|
||||
|
||||
### Q: What websites does the Auto Archiver support?
|
||||
**A:** The Auto Archiver works for a large variety of sites. Firstly, the Auto Archiver can download
|
||||
and archive any video website supported by YT-DLP, a powerful video-downloading tool ([full list of of
|
||||
sites here](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)). Aside from these sites,
|
||||
there are various different 'Extractors' for specific websites. See the full list of extractors that
|
||||
are available on the [extractors](../modules/extractor.md) page. Some sites supported include:
|
||||
|
||||
* Twitter
|
||||
* Instagram
|
||||
* Telegram
|
||||
* VKontact
|
||||
* Tiktok
|
||||
* Bluesky
|
||||
|
||||
```{note} What websites the Auto Archiver can archie depends on what extractors you have enabled in
|
||||
your configuration. See [configuration](./configurations.md) for more info.
|
||||
```
|
||||
|
||||
### Q: Does the Auto Archiver only work for social media posts ?
|
||||
**A:** No, the Auto Archiver can archive any web page on the internet, not just social media posts.
|
||||
However, for social media posts Auto Archiver can extract more relevant/useful information (such as
|
||||
post comments, likes, author etc.) which may not be available for a generic website. If you are looking
|
||||
to more generally archive webpages, then you should make sure to enable the [](../modules/autogen/extractor/wacz_extractor_enricher.md)
|
||||
and the [](../modules/autogen/extractor/wayback_extractor_enricher.md).
|
||||
|
||||
### Q: What kind of data is stored for each webpage that's archived?
|
||||
**A:** This depends on the website archived, but more generally, for social media posts any videos and photos in
|
||||
the post will be archived. For video sites, the video will be downloaded separately. For most of these sites, additional
|
||||
metadata such as published date, uploader/author and ratings/comments will also be saved. Additionally, further data can be
|
||||
saved depending on the enrichers that you have enabled. Some other types of data saved are timestamps if you have the
|
||||
[](../modules/autogen/enricher/timestamping_enricher.md) or [](../modules/autogen/enricher/opentimestamps_enricher.md) enabled,
|
||||
screenshots of the web page with the [](../modules/autogen/enricher/screenshot_enricher.md), and for videos, thumbnails of the
|
||||
video with the [](../modules/autogen/enricher/thumbnail_enricher.md). You can also store things like hashes (SHA256, or pdq hashes)
|
||||
with the various hash enrichers.
|
||||
|
||||
### Q: Where is my data stored?
|
||||
**A:** With the default configuration, data is stored on your local computer in the `local_storage` folder. You can adjust these settings by
|
||||
changing the [storage modules](../modules/storage.md) you have enabled. For example, you could choose to store your data in an S3 bucket or
|
||||
on Google Drive.
|
||||
|
||||
```{note}
|
||||
You can choose to store your data in multiple places, for example your local drive **and** an S3 bucket for redundancy.
|
||||
```
|
||||
|
||||
### Q: What should I do is something doesn't work?
|
||||
**A:** First, read through the log files to see if you can find a specific reason why something isn't working. Learn more about logging
|
||||
and how to enable debug logging in the [Logging Howto](../how_to/logging.md).
|
||||
|
||||
If you cannot find an answer in the logs, then try searching this documentation or existing / closed issues on the [Github Issue Tracker](https://github.com/bellingcat/auto-archiver/issues?q=is%3Aissue%20). If you still cannot find an answer, then consider opening an issue on the Github Issue Tracker or asking in the Bellingcat Discord
|
||||
'Auto Archiver' group.
|
||||
|
||||
#### Common reasons why an archiving might not work:
|
||||
|
||||
* The website may have temporarily adjusted its settings - sometimes sites like Telegram or Twitter adjust their scraping protection settings. Often,
|
||||
waiting a day or two and then trying again can work.
|
||||
* The site requires you to be logged in - you could try using cookies or authentication to bypass any blocks. See [](../installation/authentication.md) for more information.
|
||||
* The website you're trying to archive has changed its settings/structure. Make sure you're using the latest version of Auto Archiver and try again.
|
|
@ -1,5 +1,11 @@
|
|||
# Installation
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
upgrading.md
|
||||
```
|
||||
|
||||
There are 3 main ways to use the auto-archiver. We recommend the 'docker' method for most uses. This installs all the requirements in one command.
|
||||
|
||||
1. Easiest (recommended): [via docker](#installing-with-docker)
|
||||
|
|
File diff suppressed because one or more lines are too long
|
@ -1,7 +1,6 @@
|
|||
# Getting Started
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
:hidden:
|
||||
|
||||
installation.md
|
||||
|
@ -9,6 +8,7 @@ configurations.md
|
|||
config_editor.md
|
||||
authentication.md
|
||||
requirements.md
|
||||
faq.md
|
||||
config_cheatsheet.md
|
||||
```
|
||||
|
||||
|
@ -27,17 +27,18 @@ The way you run the Auto Archiver depends on how you installed it (docker instal
|
|||
If you installed Auto Archiver using docker, open up your terminal, and copy-paste / type the following command:
|
||||
|
||||
```bash
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
|
||||
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
|
||||
```
|
||||
|
||||
breaking this command down:
|
||||
1. `docker run` tells docker to start a new container (an instance of the image)
|
||||
2. `--rm` makes sure this container is removed after execution (less garbage locally)
|
||||
3. `-v $PWD/secrets:/app/secrets` - your secrets folder with settings
|
||||
2. `-it` tells docker to run in 'interactive mode' so that we get nice colour logs
|
||||
3. `--rm` makes sure this container is removed after execution (less garbage locally)
|
||||
4. `-v $PWD/secrets:/app/secrets` - your secrets folder with settings
|
||||
1. `-v` is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
|
||||
2. `$PWD/secrets` points to a `secrets/` folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
|
||||
3. `/app/secrets` points to the path the docker container where this image can be found
|
||||
4. `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
|
||||
5. `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
|
||||
1. `-v` same as above, this is a volume instruction
|
||||
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
|
||||
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
|
||||
|
@ -48,14 +49,14 @@ The invocations below will run the auto-archiver Docker image using a configurat
|
|||
|
||||
```bash
|
||||
# Have auto-archiver run with the default settings, generating a settings file in ./secrets/orchestration.yaml
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
|
||||
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
|
||||
|
||||
# uses the same configuration, but with the `gsheet_feeder`, a header on row 2 and with some different column names
|
||||
# Note this expects you to have followed the [Google Sheets setup](how_to/google_sheets.md) and added your service_account.json to the `secrets/` folder
|
||||
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --feeders=gsheet_feeder --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --feeders=gsheet_feeder --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||
# Runs auto-archiver for the first time, but in 'full' mode, enabling all modules to get a full settings file
|
||||
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --mode full
|
||||
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --mode full
|
||||
```
|
||||
|
||||
------------
|
||||
|
|
|
@ -0,0 +1,30 @@
|
|||
|
||||
# Upgrading
|
||||
|
||||
If an update is available, then you will see a message in the logs when you
|
||||
run Auto Archiver. Here's what those logs look like:
|
||||
|
||||
```{code} bash
|
||||
********* IMPORTANT: UPDATE AVAILABLE ********
|
||||
A new version of auto-archiver is available (v0.13.6, you have 0.13.4)
|
||||
Make sure to update to the latest version using: `pip install --upgrade auto-archiver`
|
||||
```
|
||||
|
||||
Upgrading Auto Archiver depends on the way you installed it.
|
||||
|
||||
## Docker
|
||||
|
||||
To upgrade using docker, update the docker image with:
|
||||
|
||||
```
|
||||
docker pull bellingcat/auto-archiver:latest
|
||||
```
|
||||
|
||||
## Pip
|
||||
|
||||
To upgrade the pip package, use:
|
||||
|
||||
```
|
||||
pip install --upgrade auto-archiver
|
||||
```
|
||||
|
|
@ -59,4 +59,5 @@ output_schema = {
|
|||
current_file_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
output_file = os.path.join(current_file_dir, "settings/src/schema.json")
|
||||
with open(output_file, "w") as file:
|
||||
print(f"Writing schema to {output_file}")
|
||||
json.dump(output_schema, file, indent=4, cls=SchemaEncoder)
|
||||
|
|
|
@ -12,7 +12,7 @@
|
|||
"@dnd-kit/sortable": "^10.0.0",
|
||||
"@emotion/react": "latest",
|
||||
"@emotion/styled": "latest",
|
||||
"@mui/icons-material": "latest",
|
||||
"@mui/icons-material": "^6.4.7",
|
||||
"@mui/material": "latest",
|
||||
"react": "19.0.0",
|
||||
"react-dom": "19.0.0",
|
||||
|
@ -997,9 +997,9 @@
|
|||
}
|
||||
},
|
||||
"node_modules/@mui/core-downloads-tracker": {
|
||||
"version": "6.4.6",
|
||||
"resolved": "https://registry.npmjs.org/@mui/core-downloads-tracker/-/core-downloads-tracker-6.4.6.tgz",
|
||||
"integrity": "sha512-rho5Q4IscbrVmK9rCrLTJmjLjfH6m/NcqKr/mchvck0EIXlyYUB9+Z0oVmkt/+Mben43LMRYBH8q/Uzxj/c4Vw==",
|
||||
"version": "6.4.7",
|
||||
"resolved": "https://registry.npmjs.org/@mui/core-downloads-tracker/-/core-downloads-tracker-6.4.7.tgz",
|
||||
"integrity": "sha512-XjJrKFNt9zAKvcnoIIBquXyFyhfrHYuttqMsoDS7lM7VwufYG4fAPw4kINjBFg++fqXM2BNAuWR9J7XVIuKIKg==",
|
||||
"license": "MIT",
|
||||
"funding": {
|
||||
"type": "opencollective",
|
||||
|
@ -1007,9 +1007,9 @@
|
|||
}
|
||||
},
|
||||
"node_modules/@mui/icons-material": {
|
||||
"version": "6.4.6",
|
||||
"resolved": "https://registry.npmjs.org/@mui/icons-material/-/icons-material-6.4.6.tgz",
|
||||
"integrity": "sha512-rGJBvIQQbQAlyKYljHQ8wAQS/K2/uYwvemcpygnAmCizmCI4zSF9HQPuiG8Ql4YLZ6V/uKjA3WHIYmF/8sV+pQ==",
|
||||
"version": "6.4.7",
|
||||
"resolved": "https://registry.npmjs.org/@mui/icons-material/-/icons-material-6.4.7.tgz",
|
||||
"integrity": "sha512-Rk8cs9ufQoLBw582Rdqq7fnSXXZTqhYRbpe1Y5SAz9lJKZP3CIdrj0PfG8HJLGw1hrsHFN/rkkm70IDzhJsG1g==",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"@babel/runtime": "^7.26.0"
|
||||
|
@ -1022,7 +1022,7 @@
|
|||
"url": "https://opencollective.com/mui-org"
|
||||
},
|
||||
"peerDependencies": {
|
||||
"@mui/material": "^6.4.6",
|
||||
"@mui/material": "^6.4.7",
|
||||
"@types/react": "^17.0.0 || ^18.0.0 || ^19.0.0",
|
||||
"react": "^17.0.0 || ^18.0.0 || ^19.0.0"
|
||||
},
|
||||
|
@ -1033,14 +1033,14 @@
|
|||
}
|
||||
},
|
||||
"node_modules/@mui/material": {
|
||||
"version": "6.4.6",
|
||||
"resolved": "https://registry.npmjs.org/@mui/material/-/material-6.4.6.tgz",
|
||||
"integrity": "sha512-6UyAju+DBOdMogfYmLiT3Nu7RgliorimNBny1pN/acOjc+THNFVE7hlxLyn3RDONoZJNDi/8vO4AQQr6dLAXqA==",
|
||||
"version": "6.4.7",
|
||||
"resolved": "https://registry.npmjs.org/@mui/material/-/material-6.4.7.tgz",
|
||||
"integrity": "sha512-K65StXUeGAtFJ4ikvHKtmDCO5Ab7g0FZUu2J5VpoKD+O6Y3CjLYzRi+TMlI3kaL4CL158+FccMoOd/eaddmeRQ==",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"@babel/runtime": "^7.26.0",
|
||||
"@mui/core-downloads-tracker": "^6.4.6",
|
||||
"@mui/system": "^6.4.6",
|
||||
"@mui/core-downloads-tracker": "^6.4.7",
|
||||
"@mui/system": "^6.4.7",
|
||||
"@mui/types": "^7.2.21",
|
||||
"@mui/utils": "^6.4.6",
|
||||
"@popperjs/core": "^2.11.8",
|
||||
|
@ -1061,7 +1061,7 @@
|
|||
"peerDependencies": {
|
||||
"@emotion/react": "^11.5.0",
|
||||
"@emotion/styled": "^11.3.0",
|
||||
"@mui/material-pigment-css": "^6.4.6",
|
||||
"@mui/material-pigment-css": "^6.4.7",
|
||||
"@types/react": "^17.0.0 || ^18.0.0 || ^19.0.0",
|
||||
"react": "^17.0.0 || ^18.0.0 || ^19.0.0",
|
||||
"react-dom": "^17.0.0 || ^18.0.0 || ^19.0.0"
|
||||
|
@ -1143,9 +1143,9 @@
|
|||
}
|
||||
},
|
||||
"node_modules/@mui/system": {
|
||||
"version": "6.4.6",
|
||||
"resolved": "https://registry.npmjs.org/@mui/system/-/system-6.4.6.tgz",
|
||||
"integrity": "sha512-FQjWwPec7pMTtB/jw5f9eyLynKFZ6/Ej9vhm5kGdtmts1z5b7Vyn3Rz6kasfYm1j2TfrfGnSXRvvtwVWxjpz6g==",
|
||||
"version": "6.4.7",
|
||||
"resolved": "https://registry.npmjs.org/@mui/system/-/system-6.4.7.tgz",
|
||||
"integrity": "sha512-7wwc4++Ak6tGIooEVA9AY7FhH2p9fvBMORT4vNLMAysH3Yus/9B9RYMbrn3ANgsOyvT3Z7nE+SP8/+3FimQmcg==",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"@babel/runtime": "^7.26.0",
|
||||
|
|
|
@ -13,7 +13,7 @@
|
|||
"@dnd-kit/sortable": "^10.0.0",
|
||||
"@emotion/react": "latest",
|
||||
"@emotion/styled": "latest",
|
||||
"@mui/icons-material": "latest",
|
||||
"@mui/icons-material": "^6.4.7",
|
||||
"@mui/material": "latest",
|
||||
"react": "19.0.0",
|
||||
"react-dom": "19.0.0",
|
||||
|
|
|
@ -4,7 +4,7 @@ import Container from '@mui/material/Container';
|
|||
import Typography from '@mui/material/Typography';
|
||||
import Box from '@mui/material/Box';
|
||||
import FileUploadIcon from '@mui/icons-material/FileUpload';
|
||||
//
|
||||
|
||||
import {
|
||||
DndContext,
|
||||
closestCenter,
|
||||
|
@ -204,7 +204,7 @@ function ModuleTypes({ stepType, setEnabledModules, enabledModules, configValues
|
|||
{stepType}
|
||||
</Typography>
|
||||
<Typography variant="body1" >
|
||||
Select the <a href="<a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`}" target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
|
||||
Select the <a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`} target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
|
||||
</Typography>
|
||||
</Box>
|
||||
{showError ? <Typography variant="body1" color="error" >Only one {stepType.slice(0,-1)} can be enabled at a time.</Typography> : null}
|
||||
|
|
Plik diff jest za duży
Load Diff
|
@ -6,7 +6,7 @@ import { viteSingleFile } from "vite-plugin-singlefile"
|
|||
export default defineConfig({
|
||||
plugins: [react(), viteSingleFile()],
|
||||
build: {
|
||||
minify: false,
|
||||
sourcemap: true,
|
||||
// minify: false,
|
||||
// sourcemap: true,
|
||||
}
|
||||
});
|
||||
|
|
|
@ -8,6 +8,7 @@ flexible setup in various environments.
|
|||
import argparse
|
||||
from ruamel.yaml import YAML, CommentedMap
|
||||
import json
|
||||
import os
|
||||
|
||||
from loguru import logger
|
||||
|
||||
|
@ -230,6 +231,10 @@ def read_yaml(yaml_filename: str) -> CommentedMap:
|
|||
def store_yaml(config: CommentedMap, yaml_filename: str) -> None:
|
||||
config_to_save = deepcopy(config)
|
||||
|
||||
## if the save path is the default location (secrets) then create the 'secrets' folder
|
||||
if os.path.dirname(yaml_filename) == "secrets":
|
||||
os.makedirs("secrets", exist_ok=True)
|
||||
|
||||
auth_dict = config_to_save.get("authentication", {})
|
||||
if auth_dict and auth_dict.get("load_from_file"):
|
||||
# remove all other values from the config, don't want to store it in the config file
|
||||
|
|
|
@ -112,7 +112,7 @@ class ArchivingOrchestrator:
|
|||
def check_steps(self, config):
|
||||
for module_type in MODULE_TYPES:
|
||||
if not config["steps"].get(f"{module_type}s", []):
|
||||
if module_type == "feeder" or module_type == "formatter" and config["steps"].get(f"{module_type}"):
|
||||
if (module_type == "feeder" or module_type == "formatter") and config["steps"].get(f"{module_type}"):
|
||||
raise SetupError(
|
||||
f"It appears you have '{module_type}' set under 'steps' in your configuration file, but as of version 0.13.0 of Auto Archiver, you must use '{module_type}s'. Change this in your configuration file and try again. \
|
||||
Here's how that would look: \n\nsteps:\n {module_type}s:\n - [your_{module_type}_name_here]\n {'extractors:...' if module_type == 'feeder' else '...'}\n"
|
||||
|
@ -377,7 +377,8 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
|
|||
try:
|
||||
loaded_module: BaseModule = self.module_factory.get_module(module, self.config)
|
||||
except (KeyboardInterrupt, Exception) as e:
|
||||
logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
|
||||
if not isinstance(e, KeyboardInterrupt) and not isinstance(e, SetupError):
|
||||
logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
|
||||
if loaded_module and module_type == "extractor":
|
||||
loaded_module.cleanup()
|
||||
raise e
|
||||
|
|
|
@ -2,13 +2,14 @@ from loguru import logger
|
|||
|
||||
from auto_archiver.core.feeder import Feeder
|
||||
from auto_archiver.core.metadata import Metadata
|
||||
from auto_archiver.core.consts import SetupError
|
||||
|
||||
|
||||
class CLIFeeder(Feeder):
|
||||
def setup(self) -> None:
|
||||
self.urls = self.config["urls"]
|
||||
if not self.urls:
|
||||
raise ValueError(
|
||||
raise SetupError(
|
||||
"No URLs provided. Please provide at least one URL via the command line, or set up an alternative feeder. Use --help for more information."
|
||||
)
|
||||
|
||||
|
|
|
@ -15,6 +15,9 @@ supported by `yt-dlp`, such as YouTube, Facebook, and others. It provides functi
|
|||
for retrieving videos, subtitles, comments, and other metadata, and it integrates with
|
||||
the broader archiving framework.
|
||||
|
||||
For a full list of video platforms supported by `yt-dlp`, see the
|
||||
[official documentation](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)
|
||||
|
||||
### Features
|
||||
- Supports downloading videos and playlists.
|
||||
- Retrieves metadata like titles, descriptions, upload dates, and durations.
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
from typing import Type
|
||||
from yt_dlp.extractor.common import InfoExtractor
|
||||
from auto_archiver.core.metadata import Metadata
|
||||
from auto_archiver.core.extractor import Extractor
|
||||
|
@ -24,6 +25,8 @@ class GenericDropin:
|
|||
|
||||
"""
|
||||
|
||||
extractor: Type[Extractor] = None
|
||||
|
||||
def extract_post(self, url: str, ie_instance: InfoExtractor):
|
||||
"""
|
||||
This method should return the post data from the url.
|
||||
|
@ -55,3 +58,10 @@ class GenericDropin:
|
|||
This method should download any additional media from the post.
|
||||
"""
|
||||
return metadata
|
||||
|
||||
def is_suitable(self, url, info_extractor: InfoExtractor):
|
||||
"""
|
||||
Used to override the InfoExtractor's 'is_suitable' method. Dropins should override this method to return True if the url is suitable for the extractor
|
||||
(based on being able to parse other URLs)
|
||||
"""
|
||||
return False
|
||||
|
|
|
@ -1,17 +1,154 @@
|
|||
import re
|
||||
from .dropin import GenericDropin
|
||||
from auto_archiver.core.metadata import Metadata
|
||||
from yt_dlp.extractor.facebook import FacebookIE
|
||||
|
||||
# TODO: Remove if / when https://github.com/yt-dlp/yt-dlp/pull/12275 is merged
|
||||
from yt_dlp.utils import (
|
||||
clean_html,
|
||||
get_element_by_id,
|
||||
traverse_obj,
|
||||
get_first,
|
||||
merge_dicts,
|
||||
int_or_none,
|
||||
parse_count,
|
||||
)
|
||||
|
||||
|
||||
def _extract_metadata(self, webpage, video_id):
|
||||
post_data = [
|
||||
self._parse_json(j, video_id, fatal=False)
|
||||
for j in re.findall(r"data-sjs>({.*?ScheduledServerJS.*?})</script>", webpage)
|
||||
]
|
||||
post = (
|
||||
traverse_obj(
|
||||
post_data,
|
||||
(..., "require", ..., ..., ..., "__bbox", "require", ..., ..., ..., "__bbox", "result", "data"),
|
||||
expected_type=dict,
|
||||
)
|
||||
or []
|
||||
)
|
||||
media = traverse_obj(
|
||||
post,
|
||||
(
|
||||
...,
|
||||
"attachments",
|
||||
...,
|
||||
lambda k, v: (k == "media" and str(v["id"]) == video_id and v["__typename"] == "Video"),
|
||||
),
|
||||
expected_type=dict,
|
||||
)
|
||||
title = get_first(media, ("title", "text"))
|
||||
description = get_first(media, ("creation_story", "comet_sections", "message", "story", "message", "text"))
|
||||
page_title = title or self._html_search_regex(
|
||||
(
|
||||
r'<h2\s+[^>]*class="uiHeaderTitle"[^>]*>(?P<content>[^<]*)</h2>',
|
||||
r'(?s)<span class="fbPhotosPhotoCaption".*?id="fbPhotoPageCaption"><span class="hasCaption">(?P<content>.*?)</span>',
|
||||
self._meta_regex("og:title"),
|
||||
self._meta_regex("twitter:title"),
|
||||
r"<title>(?P<content>.+?)</title>",
|
||||
),
|
||||
webpage,
|
||||
"title",
|
||||
default=None,
|
||||
group="content",
|
||||
)
|
||||
description = description or self._html_search_meta(
|
||||
["description", "og:description", "twitter:description"], webpage, "description", default=None
|
||||
)
|
||||
uploader_data = (
|
||||
get_first(media, ("owner", {dict}))
|
||||
or get_first(
|
||||
post, ("video", "creation_story", "attachments", ..., "media", lambda k, v: k == "owner" and v["name"])
|
||||
)
|
||||
or get_first(post, (..., "video", lambda k, v: k == "owner" and v["name"]))
|
||||
or get_first(post, ("node", "actors", ..., {dict}))
|
||||
or get_first(post, ("event", "event_creator", {dict}))
|
||||
or get_first(post, ("video", "creation_story", "short_form_video_context", "video_owner", {dict}))
|
||||
or {}
|
||||
)
|
||||
uploader = uploader_data.get("name") or (
|
||||
clean_html(get_element_by_id("fbPhotoPageAuthorName", webpage))
|
||||
or self._search_regex(
|
||||
(r'ownerName\s*:\s*"([^"]+)"', *self._og_regexes("title")), webpage, "uploader", fatal=False
|
||||
)
|
||||
)
|
||||
timestamp = int_or_none(self._search_regex(r'<abbr[^>]+data-utime=["\'](\d+)', webpage, "timestamp", default=None))
|
||||
thumbnail = self._html_search_meta(["og:image", "twitter:image"], webpage, "thumbnail", default=None)
|
||||
# some webpages contain unretrievable thumbnail urls
|
||||
# like https://lookaside.fbsbx.com/lookaside/crawler/media/?media_id=10155168902769113&get_thumbnail=1
|
||||
# in https://www.facebook.com/yaroslav.korpan/videos/1417995061575415/
|
||||
if thumbnail and not re.search(r"\.(?:jpg|png)", thumbnail):
|
||||
thumbnail = None
|
||||
info_dict = {
|
||||
"description": description,
|
||||
"uploader": uploader,
|
||||
"uploader_id": uploader_data.get("id"),
|
||||
"timestamp": timestamp,
|
||||
"thumbnail": thumbnail,
|
||||
"view_count": parse_count(
|
||||
self._search_regex(
|
||||
(r'\bviewCount\s*:\s*["\']([\d,.]+)', r'video_view_count["\']\s*:\s*(\d+)'),
|
||||
webpage,
|
||||
"view count",
|
||||
default=None,
|
||||
)
|
||||
),
|
||||
"concurrent_view_count": get_first(
|
||||
post, (("video", (..., ..., "attachments", ..., "media")), "liveViewerCount", {int_or_none})
|
||||
),
|
||||
**traverse_obj(
|
||||
post,
|
||||
(
|
||||
lambda _, v: video_id in v["url"],
|
||||
"feedback",
|
||||
{
|
||||
"like_count": ("likers", "count", {int}),
|
||||
"comment_count": ("total_comment_count", {int}),
|
||||
"repost_count": ("share_count_reduced", {parse_count}),
|
||||
},
|
||||
),
|
||||
get_all=False,
|
||||
),
|
||||
}
|
||||
|
||||
info_json_ld = self._search_json_ld(webpage, video_id, default={})
|
||||
info_json_ld["title"] = (
|
||||
re.sub(r"\s*\|\s*Facebook$", "", title or info_json_ld.get("title") or page_title or "")
|
||||
or (description or "").replace("\n", " ")
|
||||
or f"Facebook video #{video_id}"
|
||||
)
|
||||
return merge_dicts(info_json_ld, info_dict)
|
||||
|
||||
|
||||
class Facebook(GenericDropin):
|
||||
def extract_post(self, url: str, ie_instance):
|
||||
video_id = ie_instance._match_valid_url(url).group("id")
|
||||
ie_instance._download_webpage(url.replace("://m.facebook.com/", "://www.facebook.com/"), video_id)
|
||||
webpage = ie_instance._download_webpage(url, ie_instance._match_valid_url(url).group("id"))
|
||||
def extract_post(self, url: str, ie_instance: FacebookIE):
|
||||
post_id_regex = r"(?P<id>pfbid[A-Za-z0-9]+|\d+|t\.(\d+\/\d+))"
|
||||
post_id = re.search(post_id_regex, url).group("id")
|
||||
webpage = ie_instance._download_webpage(url.replace("://m.facebook.com/", "://www.facebook.com/"), post_id)
|
||||
|
||||
# TODO: fix once https://github.com/yt-dlp/yt-dlp/pull/12275 is merged
|
||||
post_data = ie_instance._extract_metadata(webpage)
|
||||
# TODO: For long posts, this _extract_metadata only seems to return the first 100 or so characters, followed by ...
|
||||
|
||||
# TODO: If/when https://github.com/yt-dlp/yt-dlp/pull/12275 is merged, uncomment next line and delete the one after
|
||||
# post_data = ie_instance._extract_metadata(webpage, post_id)
|
||||
post_data = _extract_metadata(ie_instance, webpage, post_id)
|
||||
return post_data
|
||||
|
||||
def create_metadata(self, post: dict, ie_instance, archiver, url):
|
||||
metadata = archiver.create_metadata(url)
|
||||
metadata.set_title(post.get("title")).set_content(post.get("description")).set_post_data(post)
|
||||
return metadata
|
||||
def create_metadata(self, post: dict, ie_instance: FacebookIE, archiver, url):
|
||||
result = Metadata()
|
||||
result.set_content(post.get("description", ""))
|
||||
result.set_title(post.get("title", ""))
|
||||
result.set("author", post.get("uploader", ""))
|
||||
result.set_url(url)
|
||||
return result
|
||||
|
||||
def is_suitable(self, url, info_extractor: FacebookIE):
|
||||
regex = r"(?:https?://(?:[\w-]+\.)?(?:facebook\.com||facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd\.onion)/)"
|
||||
return re.match(regex, url)
|
||||
|
||||
def skip_ytdlp_download(self, url: str, is_instance: FacebookIE):
|
||||
"""
|
||||
Skip using the ytdlp download method for Facebook *photo* posts, they have a URL with an id of t.XXXXX/XXXXX
|
||||
"""
|
||||
if re.search(r"/t.\d+/\d+", url):
|
||||
return True
|
||||
|
|
|
@ -67,8 +67,18 @@ class GenericExtractor(Extractor):
|
|||
"""
|
||||
Returns a list of valid extractors for the given URL"""
|
||||
for info_extractor in yt_dlp.YoutubeDL()._ies.values():
|
||||
if info_extractor.suitable(url) and info_extractor.working():
|
||||
if not info_extractor.working():
|
||||
continue
|
||||
|
||||
# check if there's a dropin and see if that declares whether it's suitable
|
||||
dropin = self.dropin_for_name(info_extractor.ie_key())
|
||||
if dropin and dropin.is_suitable(url, info_extractor):
|
||||
yield info_extractor
|
||||
continue
|
||||
|
||||
if info_extractor.suitable(url):
|
||||
yield info_extractor
|
||||
continue
|
||||
|
||||
def suitable(self, url: str) -> bool:
|
||||
"""
|
||||
|
@ -188,9 +198,13 @@ class GenericExtractor(Extractor):
|
|||
result = self.download_additional_media(video_data, info_extractor, result)
|
||||
|
||||
# keep both 'title' and 'fulltitle', but prefer 'title', falling back to 'fulltitle' if it doesn't exist
|
||||
result.set_title(video_data.pop("title", video_data.pop("fulltitle", "")))
|
||||
result.set_url(url)
|
||||
if "description" in video_data:
|
||||
if not result.get_title():
|
||||
result.set_title(video_data.pop("title", video_data.pop("fulltitle", "")))
|
||||
|
||||
if not result.get("url"):
|
||||
result.set_url(url)
|
||||
|
||||
if "description" in video_data and not result.get_content():
|
||||
result.set_content(video_data["description"])
|
||||
# extract comments if enabled
|
||||
if self.comments:
|
||||
|
@ -207,10 +221,10 @@ class GenericExtractor(Extractor):
|
|||
)
|
||||
|
||||
# then add the common metadata
|
||||
if timestamp := video_data.pop("timestamp", None):
|
||||
if timestamp := video_data.pop("timestamp", None) and not result.get("timestamp"):
|
||||
timestamp = datetime.datetime.fromtimestamp(timestamp, tz=datetime.timezone.utc).isoformat()
|
||||
result.set_timestamp(timestamp)
|
||||
if upload_date := video_data.pop("upload_date", None):
|
||||
if upload_date := video_data.pop("upload_date", None) and not result.get("upload_date"):
|
||||
upload_date = datetime.datetime.strptime(upload_date, "%Y%m%d").replace(tzinfo=datetime.timezone.utc)
|
||||
result.set("upload_date", upload_date)
|
||||
|
||||
|
@ -240,7 +254,8 @@ class GenericExtractor(Extractor):
|
|||
return False
|
||||
|
||||
post_data = dropin.extract_post(url, ie_instance)
|
||||
return dropin.create_metadata(post_data, ie_instance, self, url)
|
||||
result = dropin.create_metadata(post_data, ie_instance, self, url)
|
||||
return self.add_metadata(post_data, info_extractor, url, result)
|
||||
|
||||
def get_metadata_for_video(
|
||||
self, data: dict, info_extractor: Type[InfoExtractor], url: str, ydl: yt_dlp.YoutubeDL
|
||||
|
@ -296,6 +311,7 @@ class GenericExtractor(Extractor):
|
|||
|
||||
def _load_dropin(dropin):
|
||||
dropin_class = getattr(dropin, dropin_class_name)()
|
||||
dropin.extractor = self
|
||||
return self._dropins.setdefault(dropin_name, dropin_class)
|
||||
|
||||
try:
|
||||
|
@ -340,7 +356,7 @@ class GenericExtractor(Extractor):
|
|||
dropin_submodule = self.dropin_for_name(info_extractor.ie_key())
|
||||
|
||||
try:
|
||||
if dropin_submodule and dropin_submodule.skip_ytdlp_download(info_extractor, url):
|
||||
if dropin_submodule and dropin_submodule.skip_ytdlp_download(url, info_extractor):
|
||||
logger.debug(f"Skipping using ytdlp to download files for {info_extractor.ie_key()}")
|
||||
raise SkipYtdlp()
|
||||
|
||||
|
@ -359,7 +375,7 @@ class GenericExtractor(Extractor):
|
|||
|
||||
if not isinstance(e, SkipYtdlp):
|
||||
logger.debug(
|
||||
f'Issue using "{info_extractor.IE_NAME}" extractor to download video (error: {repr(e)}), attempting to use extractor to get post data instead'
|
||||
f'Issue using "{info_extractor.IE_NAME}" extractor to download video (error: {repr(e)}), attempting to use dropin to get post data instead'
|
||||
)
|
||||
|
||||
try:
|
||||
|
|
|
@ -38,6 +38,9 @@ class Tiktok(GenericDropin):
|
|||
api_data["video_url"] = video_url
|
||||
return api_data
|
||||
|
||||
def keys_to_clean(self, video_data: dict, info_extractor):
|
||||
return ["video_url", "title", "create_time", "author", "cover", "origin_cover", "ai_dynamic_cover", "duration"]
|
||||
|
||||
def create_metadata(self, post: dict, ie_instance, archiver, url):
|
||||
# prepare result, start by downloading video
|
||||
result = Metadata()
|
||||
|
@ -54,17 +57,17 @@ class Tiktok(GenericDropin):
|
|||
logger.error(f"failed to download video from {video_url}")
|
||||
return False
|
||||
video_media = Media(video_downloaded)
|
||||
if duration := post.pop("duration", None):
|
||||
if duration := post.get("duration", None):
|
||||
video_media.set("duration", duration)
|
||||
result.add_media(video_media)
|
||||
|
||||
# add remaining metadata
|
||||
result.set_title(post.pop("title", ""))
|
||||
result.set_title(post.get("title", ""))
|
||||
|
||||
if created_at := post.pop("create_time", None):
|
||||
if created_at := post.get("create_time", None):
|
||||
result.set_timestamp(datetime.fromtimestamp(created_at, tz=timezone.utc))
|
||||
|
||||
if author := post.pop("author", None):
|
||||
if author := post.get("author", None):
|
||||
result.set("author", author)
|
||||
|
||||
result.set("api_data", post)
|
||||
|
|
|
@ -20,7 +20,7 @@
|
|||
"save_absolute": {
|
||||
"default": False,
|
||||
"type": "bool",
|
||||
"help": "whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)",
|
||||
"help": "whether the path to the stored file is absolute or relative in the output result inc. formatters (Warning: saving an absolute path will show your computer's file structure)",
|
||||
},
|
||||
},
|
||||
"description": """
|
||||
|
|
|
@ -49,7 +49,7 @@ class CookieSettingDriver(webdriver.Firefox):
|
|||
self.driver.add_cookie({"name": name, "value": value})
|
||||
elif self.cookiejar:
|
||||
domain = urlparse(url).netloc
|
||||
regex = re.compile(f"(www)?\.?{domain}$")
|
||||
regex = re.compile(f"(www)?.?{domain}$")
|
||||
for cookie in self.cookiejar:
|
||||
if regex.match(cookie.domain):
|
||||
try:
|
||||
|
|
|
@ -40,6 +40,22 @@ class TestGenericExtractor(TestExtractorBase):
|
|||
path = os.path.join(dirname(dirname(__file__)), "data/")
|
||||
assert self.extractor.dropin_for_name("dropin", additional_paths=[path])
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"url, suitable_extractors",
|
||||
[
|
||||
("https://www.youtube.com/watch?v=5qap5aO4i9A", ["youtube"]),
|
||||
("https://www.tiktok.com/@funnycats0ftiktok/video/7345101300750748970?lang=en", ["tiktok"]),
|
||||
("https://www.instagram.com/p/CU1J9JYJ9Zz/", ["instagram"]),
|
||||
("https://www.facebook.com/nytimes/videos/10160796550110716", ["facebook"]),
|
||||
("https://www.facebook.com/BylineFest/photos/t.100057299682816/927879487315946/", ["facebook"]),
|
||||
],
|
||||
)
|
||||
def test_suitable_extractors(self, url, suitable_extractors):
|
||||
suitable_extractors = suitable_extractors + ["generic"] # the generic is valid for all
|
||||
extractors = list(self.extractor.suitable_extractors(url))
|
||||
assert len(extractors) == len(suitable_extractors)
|
||||
assert [e.ie_key().lower() for e in extractors] == suitable_extractors
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"url, is_suitable",
|
||||
[
|
||||
|
@ -55,7 +71,7 @@ class TestGenericExtractor(TestExtractorBase):
|
|||
("https://google.com", True),
|
||||
],
|
||||
)
|
||||
def test_suitable_urls(self, make_item, url, is_suitable):
|
||||
def test_suitable_urls(self, url, is_suitable):
|
||||
"""
|
||||
Note: expected behaviour is to return True for all URLs, as YoutubeDLArchiver should be able to handle all URLs
|
||||
This behaviour may be changed in the future (e.g. if we want the youtubedl archiver to just handle URLs it has extractors for,
|
||||
|
@ -245,3 +261,32 @@ class TestGenericExtractor(TestExtractorBase):
|
|||
self.assertValidResponseMetadata(post, title, timestamp)
|
||||
assert len(post.media) == 1
|
||||
assert post.media[0].hash == image_hash
|
||||
|
||||
@pytest.mark.download
|
||||
def test_download_facebook_video(self, make_item):
|
||||
post = self.extractor.download(make_item("https://www.facebook.com/bellingcat/videos/588371253839133"))
|
||||
assert len(post.media) == 2
|
||||
assert post.media[0].filename.endswith("588371253839133.mp4")
|
||||
assert post.media[0].mimetype == "video/mp4"
|
||||
|
||||
assert post.media[1].filename.endswith(".jpg")
|
||||
assert post.media[1].mimetype == "image/jpeg"
|
||||
|
||||
assert "Bellingchat Premium is with Kolina Koltai" in post.get_title()
|
||||
|
||||
@pytest.mark.download
|
||||
def test_download_facebook_image(self, make_item):
|
||||
post = self.extractor.download(
|
||||
make_item("https://www.facebook.com/BylineFest/photos/t.100057299682816/927879487315946/")
|
||||
)
|
||||
|
||||
assert len(post.media) == 1
|
||||
assert post.media[0].filename.endswith(".png")
|
||||
assert "Byline Festival - BylineFest Partner" == post.get_title()
|
||||
|
||||
@pytest.mark.download
|
||||
def test_download_facebook_text_only(self, make_item):
|
||||
url = "https://www.facebook.com/bellingcat/posts/pfbid02rzpwZxAZ8bLkAX8NvHv4DWAidFaqAUfJMbo9vWkpwxL7uMUWzWMiizXLWRSjwihVl"
|
||||
post = self.extractor.download(make_item(url))
|
||||
assert "Bellingcat researcher Kolina Koltai delves deeper into Clothoff" in post.get("content")
|
||||
assert post.get_title() == "Bellingcat"
|
||||
|
|
Ładowanie…
Reference in New Issue