auto-archiver

Wykres commitów

Autor	SHA1	Wiadomość	Data
R. Miles McCain	f603400d0d	Add direct Atlos integration (#137 ) * Add Atlos feeder * Add Atlos db * Add Atlos storage * Fix Atlos storages * Fix Atlos feeder * Only include URLs in Atlos feeder once they're processed * Remove print * Add Atlos documentation to README * Formatting fixes * Don't archive existing material * avoid KeyError in atlos_db * version bump --------- Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>	2024-04-15 19:25:17 +01:00
Miguel Sozinho Ramalho	7a21ae96af	V0.9.0 - closes several open issues: new enrichers and bug fixes (#133 ) * clean orchestrator code, add archiver cleanup logic * improves documentation for database.py * telethon archivers isolate sessions into copied files * closes #127 * closes #125 * closes #84 * meta enricher applies to all media * closes #61 adds subtitles and comments * minor update * minor fixes to yt-dlp subtitles and comments * closes #17 but logic is imperfect. * closes #85 ssl enhancer * minimifies html, JS refactor for preview of certificates * closes #91 adds freetsa timestamp authority * version bump * simplify download_url method * skip ssl if nothing archived * html preview improvements * adds retrying lib * manual download archiver improvements * meta only runs when relevant data available * new metadata convenience method * html template improvements * removes debug message * does not close #91 yet, will need a few more certificate chaing logging * adds verbosity config * new instagram api archiver * adds proxy support we * adds proxy/end support and bug fix for yt-dlp * proxy support for webdriver * adds socks proxy to wacz_enricher * refactor recursivity in inner media and display * infinite recursive display * foolproofing timestamping authortities * version to 0.9.0 * minor fixes from code-review	2024-02-20 18:05:29 +00:00
Tomas Apodaca	590d3fe824	Fix typo in readme (#121 )	2024-01-24 21:17:31 +00:00
msramalho	b7889a182d	readme update	2023-06-26 18:18:46 +01:00
msramalho	04f827f183	Bump version to v0.5.25 for release	2023-06-26 18:15:45 +01:00
Miguel Sozinho Ramalho	cc03ad7c49	Update README.md	2023-05-11 13:55:28 +01:00
Logan Williams	6d2aa3dd7a	Add invocation example	2023-05-11 14:32:23 +02:00
Logan Williams	f2e580de4e	Update README images	2023-05-11 14:30:27 +02:00
Logan Williams	80ea912d0e	Update README	2023-05-11 11:32:46 +02:00
Logan Williams	26373d4545	Re-order README slightly	2023-05-10 11:48:34 +02:00
Miguel Sozinho Ramalho	b67a7b818a	Merge pull request #75 from bellingcat/feature/browsertrix	2023-05-10 10:14:40 +01:00
Logan Williams	2e63cb8411	Update README with new entrypoint	2023-05-10 11:13:47 +02:00
msramalho	e150370657	updates docker instructions	2023-05-10 09:51:53 +01:00
msramalho	ae3e607705	fix: depreacating thumbnail_index	2023-05-09 11:29:05 +01:00
msramalho	876988b587	detect invalid url messages instagram bot	2023-02-20 12:22:52 +00:00
msramalho	d1e4574c6c	readme updates	2023-02-17 16:30:50 +00:00
msramalho	f35875a94c	name fix	2023-02-17 15:46:05 +00:00
msramalho	224ebe7ee8	links	2023-02-08 22:27:56 +00:00
msramalho	54a1bc2172	update readme	2023-02-08 22:26:24 +00:00
msramalho	77948207d1	update	2023-02-08 22:24:40 +00:00
msramalho	60552ae0ea	update readme	2023-02-08 22:23:25 +00:00
msramalho	f255271ecb	update README	2023-02-08 22:17:22 +00:00
msramalho	2a7ece5dcc	cleanups and docs	2023-02-08 22:13:19 +00:00
msramalho	d31b3dda52	Bump version to v0.2.17 for release	2023-02-07 23:56:42 +00:00
msramalho	f81ff14faa	license to publish	2023-02-07 23:43:50 +00:00
msramalho	5ed38ffaab	clean readme	2023-02-07 23:37:53 +00:00
msramalho	9b4a41e654	Bump version to v0.2.0 for release	2023-02-07 22:07:23 +00:00
msramalho	b3860cfec1	telethon join channels working	2022-12-14 14:01:39 +00:00
msramalho	65dd155c90	WIP refactor logic	2022-11-15 15:00:52 +00:00
msramalho	22363cb8b9	adds information on browsertrix usage	2022-10-20 11:59:23 +01:00
msramalho	ac4f1b6132	readme updates	2022-10-19 11:37:04 +01:00
msramalho	26903190fd	adds wacz link	2022-10-17 14:41:34 +01:00
msramalho	57464f1506	refactors for edges in browsertrix and s3 upload, adds timeout parameter	2022-10-17 14:07:31 +01:00
Ed Summers	c34fb9cf10	Add browsertrix profile config option This commit adds a browsertrix profile option to the configuration. In order to not require the passing of the browsertrix config to every Archiver, the Archiver constructors (include the base) were modified to accept a Storage and Config instance. Some of the constructors them pick out the pieces they need from the Config, in addition to calling the parent constructor. In order to avoid a circular import that this created the Config object now defines the default hash function to use, rather than having it be a static property of the Archiver class.	2022-10-11 16:21:42 -04:00
Ed Summers	3b87dffe6b	Add browsertrix-crawler capture The [browsertrix-crawler] utility is a browser-based crawler that can crawl one or more pages. browsertrix-crawler creates archives in the [WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web component, or unzipped to get the original WARC data (the ISO standard format used by the Internet Archive Wayback Machine). This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here: https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0 browsertrix-crawler requires Docker to be installed. If Docker is not installed an error message will be logged and things continue as normal. [browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler [WACZ]: https://specs.webrecorder.net/wacz/latest/ [ReplayWeb.page]: https://replayweb.page	2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho	0bdd06f641	Update README.md	2022-09-22 15:58:41 +02:00
msramalho	34536e7f14	added explanation for 2 twitter archivers	2022-06-27 11:17:23 +02:00
msramalho	ffe1c425a0	new archiver, new hack, ready	2022-06-27 01:07:55 +02:00
Miguel Sozinho Ramalho	3d9a2622c3	Update README.md	2022-06-16 16:23:53 +01:00
msramalho	14add43923	fixing auto_auto_archive	2022-06-16 17:17:25 +02:00
msramalho	3bffee41a0	README updates	2022-06-14 21:40:04 +02:00
msramalho	dc60bb1558	json -> yaml	2022-06-14 21:18:18 +02:00
msramalho	bd753b27ed	numbers in markdown	2022-06-14 20:55:30 +02:00
msramalho	c11a208253	more verbose about mandatory columns	2022-06-14 19:54:08 +02:00
msramalho	3019778b8f	readme updates	2022-06-07 18:52:19 +02:00
msramalho	3791afc94c	readme updates	2022-06-07 18:43:04 +02:00
msramalho	d46b8e1157	README updates	2022-06-07 18:41:43 +02:00
msramalho	10f03cb888	Merge branch 'dev' into refactor-configs	2022-06-02 17:30:47 +02:00
msramalho	159adf9afe	refactoring filenumber into subfolder	2022-05-26 19:18:29 +02:00
msramalho	03aa02e88b	diagram	2022-05-25 12:23:59 +02:00

1 2

67 Commity (main)