Finish how to on authentication

2025-02-20 15:33:50 +00:00 · 2025-02-20 15:33:50 +00:00 · 0bec71d203
commit 0bec71d203
--- a/docs/source/how_to/authentication_how_to.md
+++ b/docs/source/how_to/authentication_how_to.md
@ -1,6 +1,110 @@
-# How to login (authenticate) to websites
+# Logging in to sites

-This how-to guide shows you how you can add authentication to Auto Archiver for a site you are trying to archive. In this example, we will authenticate on use Twitter/X.com using cookies, and on XXXX using username/password.
+This how-to guide shows you how you can use various authentication methods to allow you to login to a site you are trying to archive. This is useful for websites that require a user to be logged in to browse them, or for sites that restrict bots.

-```{note} This page is still under construction 🚧
-```
+In this How-To, we will authenticate on use Twitter/X.com using cookies, and on XXXX using username/password.
+
+
+
+## Using cookies to authenticate on Twitter/X
+
+It can be useful to archive tweets after logging in, since some tweets are only visible to authenticated users. One case is Tweets marked as 'Sensitive'.
+
+Take this tweet as an example: [https://x.com/SozinhoRamalho/status/1876710769913450647](https://x.com/SozinhoRamalho/status/1876710769913450647)
+
+This tweet has been marked as sensitive, so a normal run of Auto Archiver without a logged in session will fail to extract the tweet:
+
+```{code-block} console
+:emphasize-lines: 3,4,5,6
+
+>>> auto-archiver https://x.com/SozinhoRamalho/status/1876710769913450647                                                                                     ✭ ✱
+ ...
+ERROR: [twitter] 1876710769913450647: NSFW tweet requires authentication. Use --cookies, 
+--cookies-from-browser, --username and --password, --netrc-cmd, or --netrc (twitter) to
+ provide account credentials. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp 
+ for how to manually pass cookies
+[twitter] 1876710769913450647: Downloading guest token
+[twitter] 1876710769913450647: Downloading GraphQL JSON
+2025-02-20 15:06:13.362 | ERROR    | auto_archiver.modules.generic_extractor.generic_extractor:download_for_extractor:248 - Error downloading metadata for post: NSFW tweet requires authentication. Use --cookies, --cookies-from-browser, --username and --password, --netrc-cmd, or --netrc (twitter) to provide account credentials. See  https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp  for how to manually pass cookies
+[generic] Extracting URL: https://x.com/SozinhoRamalho/status/1876710769913450647
+[generic] 1876710769913450647: Downloading webpage
+WARNING: [generic] Falling back on generic information extractor
+[generic] 1876710769913450647: Extracting information
+ERROR: Unsupported URL: https://x.com/SozinhoRamalho/status/1876710769913450647
+2025-02-20 15:06:13.744 | INFO     | auto_archiver.core.orchestrator:archive:483 - Trying extractor telegram_extractor for https://x.com/SozinhoRamalho/status/1876710769913450647
+2025-02-20 15:06:13.744 | SUCCESS  | auto_archiver.modules.console_db.console_db:done:23 - DONE Metadata(status='nothing archived', metadata={'_processed_at': datetime.datetime(2025, 2, 20, 15, 6, 12, 473979, tzinfo=datetime.timezone.utc), 'url': 'https://x.com/SozinhoRamalho/status/1876710769913450647'}, media=[])
+...
+```
+
+To get round this limitation, we can use **cookies** (information about a logged in user) to mimic being logged in to Twitter. There are two ways to pass cookies to Auto Archiver. One is from a file, and the other is from a browser profile on your computer.
+
+In this tutorial, we will export the Twitter cookies from our browser and add them to Auto Archiver
+
+**1. Installing a cookie exporter extension**
+
+First, we need to install an extension in our browser to export the cookies for a certain site. The [FAQ on yt-dlp](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp) provides some suggestions: Get [cookies.txt LOCALLY](https://chromewebstore.google.com/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc) for Chrome or [cookies.txt](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) for Firefox.
+
+**2. Export the cookies**
+
+```{note} See the note [here](../installation/authentication.md#recommendations-for-authentication) on why you shouldn't use your own personal account for achiving.
+```
+
+Once the extension is installed in your preferred browser, login to Twitter in this browser, and then activate the extension and export the cookies. You can choose to export all your cookies for your browser, or just cookies for this specific site. In the image below, we're only exporting cookies for Twitter/x.com:
+
+![extract cookies](extract_cookies.png)
+
+
+**3. Adding the cookies file to Auto Archiver**
+
+You now will have a file called `cookies.txt` (tip: name it `twitter_cookies.txt` if you only exported cookies for Twitter), which needs to be added to Auto Archiver.
+
+Do this by going into your Auto Archiver configuration file, and editing the `authentication` section. We will add the `cookies_file` option for the site `x.com,twitter.com`.
+
+```{note} For websites that have multiple URLs (like x.com and twitter.com) you can 'reuse' the same login information without duplicating it using a comma separated list of domain names.
+```
+
+I've saved my `twitter_cookies.txt` file in a `secrets` folder, so here's how my authentication section looks now:
+
+```{code} yaml
+:caption: orchestration.yaml
+
+...
+
+authentication:
+   x.com,twitter.com:
+      cookies_file: secrets/twitter_cookies.txt
+...
+```
+
+**4. Re-run your archiving with the cookies enabled**
+
+Now, the next time we re-run Auto Archiver, the cookies from our logged-in session will be used by Auto Archiver, and restricted/sensitive tweets can be downloaded!
+
+```{code} console
+>>> auto-archiver https://x.com/SozinhoRamalho/status/1876710769913450647                                                                                   ✭ ✱ ◼
+...
+2025-02-20 15:27:46.785 | WARNING  | auto_archiver.modules.console_db.console_db:started:13 - STARTED Metadata(status='no archiver', metadata={'_processed_at': datetime.datetime(2025, 2, 20, 15, 27, 46, 785304, tzinfo=datetime.timezone.utc), 'url': 'https://x.com/SozinhoRamalho/status/1876710769913450647'}, media=[])
+2025-02-20 15:27:46.785 | INFO     | auto_archiver.core.orchestrator:archive:483 - Trying extractor generic_extractor for https://x.com/SozinhoRamalho/status/1876710769913450647
+[twitter] Extracting URL: https://x.com/SozinhoRamalho/status/1876710769913450647
+...
+2025-02-20 15:27:53.134 | INFO     | auto_archiver.modules.local_storage.local_storage:upload:26 - ./local_archive/https-x-com-sozinhoramalho-status-1876710769913450647/06e8bacf27ac4bb983bf6280.html
+2025-02-20 15:27:53.135 | SUCCESS  | auto_archiver.modules.console_db.console_db:done:23 - DONE Metadata(status='yt-dlp_Twitter: success', 
+metadata={'_processed_at': datetime.datetime(2025, 2, 20, 15, 27, 48, 564738, tzinfo=datetime.timezone.utc), 'url': 
+'https://x.com/SozinhoRamalho/status/1876710769913450647', 'title': 'ignore tweet, testing sensitivity warning nudity https://t.co/t3u0hQsSB1', 
+...
+```
+
+
+### Finishing Touches
+
+You've now successfully exported your cookies from a logged-in session in your browser, and used them to authenticate with Twitter and download a sensitive tweet. Congratulations!
+
+Finally,Some important things to remember:
+
+1. It's best not to use your own personal account for archiving. [Here's why](../installation/authentication.md#recommendations-for-authentication).
+2. Cookies can be short-lived, so may need updating. Sometimes, a website session may 'expire' or a website may force you to login again. In these instances, you'll need to repeat the export step (step 2) after logging in again to update your cookies.
+
+## Authenticating on XXXX site with username/password
+
+```{note} This section is still under construction 🚧
+```
--- a/docs/source/how_to/extract_cookies.png
+++ b/docs/source/how_to/extract_cookies.png
--- a/docs/source/how_to/new_config_format.md
+++ b/docs/source/how_to/new_config_format.md
@ -1,4 +1,4 @@
-# Upgrading to 0.13 Configuration Format
+# Upgrading to v0.13

 ```{note} This how-to is only relevant for people who used Auto Archiver before February 2025 (versions prior to 0.13).

@ -11,13 +11,16 @@ Version 0.13 of Auto Archiver has breaking changes in the configuration format,

 There are two simple ways to check if you need to update your format:

-1. When you try and run auto-archiver using your existing configuration file, you get an error like the following:
+1. When you try and run auto-archiver using your existing configuration file, you get an error about no feeders or formatters being configured, like:

-```AssertionError: No feeders were configured. Make sure to set at least one feeder in your configuration file or on the command line (using --feeders)
+```{code} console
+AssertionError: No feeders were configured. Make sure to set at least one feeder in
+your configuration file or on the command line (using --feeders)
 ```

 2. Within your configuration file, you have a `feeder:` option. This is the old format. An example old format:
-```{yaml}
+```{code} yaml
+
 steps:
  feeder: gsheet_feeder
 ...
@ -31,12 +34,12 @@ To update your configuration file, you can either:

 This is recommended if you want to keep all your old settings. Follow the steps below to change the relevant settings:

-1. Feeder & Formatter Steps Settings
+#### a) Feeder & Formatter Steps Settings

 The feeder and formatter settings have been changed from a single string to a list.

-`steps.feeder (string)` → `steps.feeders (list)`
-`steps.formatter (string)` → `steps.formatters (list)`
+- `steps.feeder (string)` → `steps.feeders (list)`
+- `steps.formatter (string)` → `steps.formatters (list)`

 Example:
 ```{yaml}
@ -58,17 +61,18 @@ steps:
 ```{note} Auto Archiver still only supports one feeder and formatter, but from v0.13 onwards they must be added to the configuration file as a list.
 ```

-2. Extractor (formerly Archiver) Steps Settings
+#### b) Extractor (formerly Archiver) Steps Settings

 With v0.13 of Auto Archiver, the `archivers` have been renamed to `extractors` to reflect the work they actually do - extract information from a URL. Change the configuration by renaming:

-`steps.archivers` → `steps.extractors`
+- `steps.archivers` → `steps.extractors`

 The names of the actual modules have also changed, so for any extractor modules you have enabled, you will need to rename the `archiver` part to `extractor`. Some examples:

-`telethon_archiver` → `telethon_extractor`
-`wacz_archiver_enricher` → `wacz_extractor_enricher`
-`vk_archiver` → `vk_extractor`
+- `telethon_archiver` → `telethon_extractor`
+- `wacz_archiver_enricher` → `wacz_extractor_enricher`
+- `wayback_archiver_enricher` → `wayback_extractor_enricher`
+- `vk_archiver` → `vk_extractor`

 Additionally, the `youtube_archiver` has been renamed to `generic_extractor` and should be considere the default/fallback extractor. Read more about the [generic extractor](../modules/autogen/extractor/generic_extractor.md).

@ -91,16 +95,13 @@ steps:

 ```

-3. Redundant / Obsolete Modules
+#### c) Redundant / Obsolete Modules

-With v0.13 of Auto Archiver, the following modules have been removed and their features have been built in to the generic_extractor:
+With v0.13 of Auto Archiver, the following modules have been removed and their features have been built in to the generic_extractor. You should remove them from the 'steps' section of your configuration file:

 * `twitter_archiver` - use the `generic_extractor` for general extraction, or the `twitter_api_extractor` for API access.
 * `tiktok_archiver` - use the `generic_extractor` to extract TikTok videos.

-If you have either of these set in your configuration under `steps:` you should remove them.
-
-

 ### 2. Auto-generate a new config, then copy over your settings.

--- a/docs/source/installation/installation.md
+++ b/docs/source/installation/installation.md
@ -1,7 +1,7 @@
 # Installing Auto Archiver

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :hidden:

 configurations.md
--- a/docs/source/modules/database.md
+++ b/docs/source/modules/database.md
@ -8,7 +8,7 @@ The default (enabled) databases are the CSV Database and the Console Database.
 ```

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :hidden:
 :glob:
 autogen/database/*
--- a/docs/source/modules/enricher.md
+++ b/docs/source/modules/enricher.md
@ -7,7 +7,7 @@ Enricher modules are used to add additional information to the items  that have
 ```

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :hidden:
 :glob:
 autogen/enricher/*
--- a/docs/source/modules/extractor.md
+++ b/docs/source/modules/extractor.md
@ -11,7 +11,7 @@ Extractors that are able to extract content from a wide range of websites includ
 ```

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :hidden:
 :glob:
 autogen/extractor/*
--- a/docs/source/modules/feeder.md
+++ b/docs/source/modules/feeder.md
@ -13,7 +13,7 @@ auto-archiver [options] -- URL1 URL2 ...
 ```

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :glob:
 :hidden:
 autogen/feeder/*
--- a/docs/source/modules/formatter.md
+++ b/docs/source/modules/formatter.md
@ -6,7 +6,7 @@ Formatter modules are used to format the data extracted from a URL into a specif
 ```

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :hidden:
 :glob:
 autogen/formatter/*
--- a/docs/source/modules/storage.md
+++ b/docs/source/modules/storage.md
@ -8,7 +8,7 @@ The default is to store the files downloaded (e.g. images, videos) in a local di
 ```

 ```{toctree}
-:depth: 1
+:maxdepth: 1
 :hidden:
 :glob:
 autogen/storage/*
--- a/src/auto_archiver/core/base_module.py
+++ b/src/auto_archiver/core/base_module.py
@ -85,6 +85,8 @@ class BaseModule(ABC):
        * api_key: str - the API key to use for login\n
        * api_secret: str - the API secret to use for login\n
        * cookie: str - a cookie string to use for login (specific to this site)\n
+        * cookies_file: str - the path to a cookies file to use for login (specific to this site)\n
+        * cookies_from_browser: str - the name of the browser to extract cookies from (specitic for this site)\n
        """
        # TODO: think about if/how we can deal with sites that have multiple domains (main one is x.com/twitter.com)
        # for now the user must enter them both, like "x.com,twitter.com" in their config. Maybe we just hard-code?
--- a/src/auto_archiver/core/orchestrator.py
+++ b/src/auto_archiver/core/orchestrator.py
@ -527,6 +527,7 @@ class ArchivingOrchestrator:
        for key, val in copy(authentication).items():
            if "," in key:
                for site in key.split(","):
+                    site = site.strip()
                    authentication[site] = val
                del authentication[key]