Wykres commitów

21 Commity (58b6bcef87ed8e1877edcd590e89d412188ddd1a)

Autor SHA1 Wiadomość Data
msramalho 629cd586db adds session_file for missing archivers 2022-11-08 13:59:09 +00:00
msramalho 22363cb8b9 adds information on browsertrix usage 2022-10-20 11:59:23 +01:00
msramalho 6c80a5b82d session file logic 2022-10-18 17:35:59 +01:00
msramalho 93be1af93f adds instagram post/profile 2022-10-18 15:45:10 +01:00
msramalho f0f844a569 improves browsertrix configurations 2022-10-18 11:21:10 +01:00
msramalho 57464f1506 refactors for edges in browsertrix and s3 upload, adds timeout parameter 2022-10-17 14:07:31 +01:00
Ed Summers c34fb9cf10
Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers 3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho 961dcdb4ef Merge branch 'dev' into oauth 2022-07-25 14:58:56 +01:00
msramalho 6124bc5f72 refactored and simplified obtaining credentials 2022-07-25 14:52:50 +01:00
msramalho 2d7d8c4e08 renaming and making default SHA-256 2022-07-25 12:12:43 +01:00
Dave Mateer 524b40b869 Added Google OAuth flow for Google Drive so can use a real user and not a service account to save files 2022-07-18 13:39:00 +01:00
Dave Mateer 363a8ef67a Added hash_algorithm to config to choose between SHA256 and SHA3_512 2022-07-18 13:15:48 +01:00
msramalho 90cb080c81 refactoring and renaming 2022-07-14 18:10:02 +02:00
Dave Mateer 42172566f2 Added whitelist and blacklist for workwheets (not spreadsheet) 2022-07-12 12:53:59 +01:00
msramalho 34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
msramalho 7ab8d0e825 tmp folder randomly created in folder 2022-06-16 19:58:26 +02:00
msramalho 2f02336403 example config 2022-06-15 17:18:47 +02:00
msramalho 6872d8e103 check if exists to configuration, save_logs to command line 2022-06-14 21:37:02 +02:00
msramalho dc60bb1558 json -> yaml 2022-06-14 21:18:18 +02:00