Wykres commitów

89 Commity (rtd_docs)

Autor SHA1 Wiadomość Data
msramalho 65dd155c90 WIP refactor logic 2022-11-15 15:00:52 +00:00
msramalho 22363cb8b9 adds information on browsertrix usage 2022-10-20 11:59:23 +01:00
msramalho ac4f1b6132 readme updates 2022-10-19 11:37:04 +01:00
msramalho 26903190fd adds wacz link 2022-10-17 14:41:34 +01:00
msramalho 57464f1506 refactors for edges in browsertrix and s3 upload, adds timeout parameter 2022-10-17 14:07:31 +01:00
Ed Summers c34fb9cf10
Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers 3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho 0bdd06f641
Update README.md 2022-09-22 15:58:41 +02:00
msramalho 34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
Miguel Sozinho Ramalho 3d9a2622c3
Update README.md 2022-06-16 16:23:53 +01:00
msramalho 14add43923 fixing auto_auto_archive 2022-06-16 17:17:25 +02:00
msramalho 3bffee41a0 README updates 2022-06-14 21:40:04 +02:00
msramalho dc60bb1558 json -> yaml 2022-06-14 21:18:18 +02:00
msramalho bd753b27ed numbers in markdown 2022-06-14 20:55:30 +02:00
msramalho c11a208253 more verbose about mandatory columns 2022-06-14 19:54:08 +02:00
msramalho 3019778b8f readme updates 2022-06-07 18:52:19 +02:00
msramalho 3791afc94c readme updates 2022-06-07 18:43:04 +02:00
msramalho d46b8e1157 README updates 2022-06-07 18:41:43 +02:00
msramalho 10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho 159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
msramalho 03aa02e88b diagram 2022-05-25 12:23:59 +02:00
Dave Mateer dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
msramalho b459f36dda C 2022-05-09 18:23:01 +02:00
Miguel Sozinho Ramalho 8f62e8b7c6
Update README.md 2022-05-03 14:45:18 +01:00
msramalho 07bbf443ca improves documentation 2022-03-13 12:05:09 +01:00
msramalho 39ec190e56 adds README instructions for geckodriver 2022-03-09 11:44:05 +01:00
msramalho 1d62009c4f creates utils module and moves gworkseet there 2022-02-23 16:24:59 +01:00
msramalho 9a264a7dfe cleanup and docs 2022-02-23 16:07:58 +01:00
msramalho f3ce226665 split into multiple files MVP 2022-02-21 14:19:09 +01:00
James Arnall 1e6b504c7a Changed instructions on creating credential files
The original instructions specified putting the gspread credentials file in the default location, but the code looks for a file in another place. The README now reflects the code's behavior.
2021-06-12 11:37:09 -05:00
Logan Williams 3aa2a083ba
Update README.md 2021-06-04 12:07:03 +02:00
Logan Williams 5b0fe4212a Rename files to use consistent punctuation 2021-06-01 09:33:20 +00:00
Logan Williams 4d76867b5e
Update README.md 2021-05-14 14:06:41 +02:00
Logan Williams 339f62fade Update auto archiver docs with new header declaration method 2021-05-12 09:01:45 +02:00
Logan Williams a87b9a7b30
Update README.md 2021-02-09 15:27:42 +01:00
Logan Williams ad883c9232 Add intro 2021-02-09 15:22:58 +01:00
Logan Williams 8c7a3387bd Use correct image for after picture 2021-02-09 15:20:19 +01:00
Logan Williams 0d1dc42654 Add readme 2021-02-09 15:19:46 +01:00