auto-archiver

Wykres commitów

Autor	SHA1	Wiadomość	Data
msramalho	629cd586db	adds session_file for missing archivers	2022-11-08 13:59:09 +00:00
msramalho	22363cb8b9	adds information on browsertrix usage	2022-10-20 11:59:23 +01:00
msramalho	6c80a5b82d	session file logic	2022-10-18 17:35:59 +01:00
msramalho	93be1af93f	adds instagram post/profile	2022-10-18 15:45:10 +01:00
msramalho	f0f844a569	improves browsertrix configurations	2022-10-18 11:21:10 +01:00
msramalho	57464f1506	refactors for edges in browsertrix and s3 upload, adds timeout parameter	2022-10-17 14:07:31 +01:00
Ed Summers	c34fb9cf10	Add browsertrix profile config option This commit adds a browsertrix profile option to the configuration. In order to not require the passing of the browsertrix config to every Archiver, the Archiver constructors (include the base) were modified to accept a Storage and Config instance. Some of the constructors them pick out the pieces they need from the Config, in addition to calling the parent constructor. In order to avoid a circular import that this created the Config object now defines the default hash function to use, rather than having it be a static property of the Archiver class.	2022-10-11 16:21:42 -04:00
Ed Summers	3b87dffe6b	Add browsertrix-crawler capture The [browsertrix-crawler] utility is a browser-based crawler that can crawl one or more pages. browsertrix-crawler creates archives in the [WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web component, or unzipped to get the original WARC data (the ISO standard format used by the Internet Archive Wayback Machine). This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here: https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0 browsertrix-crawler requires Docker to be installed. If Docker is not installed an error message will be logged and things continue as normal. [browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler [WACZ]: https://specs.webrecorder.net/wacz/latest/ [ReplayWeb.page]: https://replayweb.page	2022-09-25 19:46:29 +00:00
msramalho	961dcdb4ef	Merge branch 'dev' into oauth	2022-07-25 14:58:56 +01:00
msramalho	6124bc5f72	refactored and simplified obtaining credentials	2022-07-25 14:52:50 +01:00
msramalho	2d7d8c4e08	renaming and making default SHA-256	2022-07-25 12:12:43 +01:00
Dave Mateer	524b40b869	Added Google OAuth flow for Google Drive so can use a real user and not a service account to save files	2022-07-18 13:39:00 +01:00
Dave Mateer	363a8ef67a	Added hash_algorithm to config to choose between SHA256 and SHA3_512	2022-07-18 13:15:48 +01:00
msramalho	90cb080c81	refactoring and renaming	2022-07-14 18:10:02 +02:00
Dave Mateer	42172566f2	Added whitelist and blacklist for workwheets (not spreadsheet)	2022-07-12 12:53:59 +01:00
msramalho	34536e7f14	added explanation for 2 twitter archivers	2022-06-27 11:17:23 +02:00
msramalho	ffe1c425a0	new archiver, new hack, ready	2022-06-27 01:07:55 +02:00
msramalho	7ab8d0e825	tmp folder randomly created in folder	2022-06-16 19:58:26 +02:00
msramalho	2f02336403	example config	2022-06-15 17:18:47 +02:00
msramalho	6872d8e103	check if exists to configuration, save_logs to command line	2022-06-14 21:37:02 +02:00
msramalho	dc60bb1558	json -> yaml	2022-06-14 21:18:18 +02:00

21 Commity (58b6bcef87ed8e1877edcd590e89d412188ddd1a)