Wykres commitów

19 Commity (c8fa077df7d01de3c62c2b64e3fc2830dc9b22fd)

Autor SHA1 Wiadomość Data
Ed Summers c34fb9cf10
Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers 3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho f87acb6d1d refactor 2022-06-07 18:41:58 +02:00
msramalho 10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho 159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
Dave Mateer dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
msramalho d469967c03 fix index out of range for empty sheets 2022-05-10 22:24:21 +02:00
msramalho 07bbf443ca improves documentation 2022-03-13 12:05:09 +01:00
msramalho 4c54926548 offset fix 2022-03-12 20:29:43 +01:00
msramalho d8d9cf17dc fix offset 2022-03-12 20:25:52 +01:00
msramalho f121c9dab7 enable tolower 2022-03-12 20:14:16 +01:00
msramalho 67b16064bb offby1 2022-03-12 20:11:38 +01:00
msramalho ec4ae84487 case-insensitive is a bad idea 2022-03-12 20:06:31 +01:00
msramalho 69483d432c adds logs 2022-03-12 20:04:08 +01:00
msramalho 6e5e7212c2 fixes header offset 2022-03-12 19:56:00 +01:00
msramalho 6c5d6f521e implements fresh status retrieval if needed 2022-03-10 19:00:02 +01:00
msramalho ff874fe0d3 simplifies access to google sheets, single get_values 2022-03-09 12:17:51 +01:00
Logan Williams 63a2847ac9 Add header argument; set up webdriver 2022-02-25 16:09:35 +01:00
msramalho 1d62009c4f creates utils module and moves gworkseet there 2022-02-23 16:24:59 +01:00