Wykres commitów

67 Commity (main)

Autor SHA1 Wiadomość Data
R. Miles McCain f603400d0d
Add direct Atlos integration (#137)
* Add Atlos feeder

* Add Atlos db

* Add Atlos storage

* Fix Atlos storages

* Fix Atlos feeder

* Only include URLs in Atlos feeder once they're processed

* Remove print

* Add Atlos documentation to README

* Formatting fixes

* Don't archive existing material

* avoid KeyError in atlos_db

* version bump

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2024-04-15 19:25:17 +01:00
Miguel Sozinho Ramalho 7a21ae96af
V0.9.0 - closes several open issues: new enrichers and bug fixes (#133)
* clean orchestrator code, add archiver cleanup logic

* improves documentation for database.py

* telethon archivers isolate sessions into copied files

* closes #127

* closes #125

* closes #84

* meta enricher applies to all media

* closes #61 adds subtitles and comments

* minor update

* minor fixes to yt-dlp subtitles and comments

* closes #17 but logic is imperfect.

* closes #85 ssl enhancer

* minimifies html, JS refactor for preview of certificates

* closes #91 adds freetsa timestamp authority

* version bump

* simplify download_url method

* skip ssl if nothing archived

* html preview improvements

* adds retrying lib

* manual download archiver improvements

* meta only runs when relevant data available

* new metadata convenience method

* html template improvements

* removes debug message

* does not close #91 yet, will need a few more certificate chaing logging

* adds verbosity config

* new instagram api archiver

* adds proxy support we

* adds proxy/end support and bug fix for yt-dlp

* proxy support for webdriver

* adds socks proxy to wacz_enricher

* refactor recursivity in inner media and display

* infinite recursive display

* foolproofing timestamping authortities

* version to 0.9.0

* minor fixes from code-review
2024-02-20 18:05:29 +00:00
Tomas Apodaca 590d3fe824
Fix typo in readme (#121) 2024-01-24 21:17:31 +00:00
msramalho b7889a182d readme update 2023-06-26 18:18:46 +01:00
msramalho 04f827f183 Bump version to v0.5.25 for release 2023-06-26 18:15:45 +01:00
Miguel Sozinho Ramalho cc03ad7c49
Update README.md 2023-05-11 13:55:28 +01:00
Logan Williams 6d2aa3dd7a Add invocation example 2023-05-11 14:32:23 +02:00
Logan Williams f2e580de4e Update README images 2023-05-11 14:30:27 +02:00
Logan Williams 80ea912d0e Update README 2023-05-11 11:32:46 +02:00
Logan Williams 26373d4545 Re-order README slightly 2023-05-10 11:48:34 +02:00
Miguel Sozinho Ramalho b67a7b818a
Merge pull request #75 from bellingcat/feature/browsertrix 2023-05-10 10:14:40 +01:00
Logan Williams 2e63cb8411 Update README with new entrypoint 2023-05-10 11:13:47 +02:00
msramalho e150370657 updates docker instructions 2023-05-10 09:51:53 +01:00
msramalho ae3e607705 fix: depreacating thumbnail_index 2023-05-09 11:29:05 +01:00
msramalho 876988b587 detect invalid url messages instagram bot 2023-02-20 12:22:52 +00:00
msramalho d1e4574c6c readme updates 2023-02-17 16:30:50 +00:00
msramalho f35875a94c name fix 2023-02-17 15:46:05 +00:00
msramalho 224ebe7ee8 links 2023-02-08 22:27:56 +00:00
msramalho 54a1bc2172 update readme 2023-02-08 22:26:24 +00:00
msramalho 77948207d1 update 2023-02-08 22:24:40 +00:00
msramalho 60552ae0ea update readme 2023-02-08 22:23:25 +00:00
msramalho f255271ecb update README 2023-02-08 22:17:22 +00:00
msramalho 2a7ece5dcc cleanups and docs 2023-02-08 22:13:19 +00:00
msramalho d31b3dda52 Bump version to v0.2.17 for release 2023-02-07 23:56:42 +00:00
msramalho f81ff14faa license to publish 2023-02-07 23:43:50 +00:00
msramalho 5ed38ffaab clean readme 2023-02-07 23:37:53 +00:00
msramalho 9b4a41e654 Bump version to v0.2.0 for release 2023-02-07 22:07:23 +00:00
msramalho b3860cfec1 telethon join channels working 2022-12-14 14:01:39 +00:00
msramalho 65dd155c90 WIP refactor logic 2022-11-15 15:00:52 +00:00
msramalho 22363cb8b9 adds information on browsertrix usage 2022-10-20 11:59:23 +01:00
msramalho ac4f1b6132 readme updates 2022-10-19 11:37:04 +01:00
msramalho 26903190fd adds wacz link 2022-10-17 14:41:34 +01:00
msramalho 57464f1506 refactors for edges in browsertrix and s3 upload, adds timeout parameter 2022-10-17 14:07:31 +01:00
Ed Summers c34fb9cf10
Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers 3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho 0bdd06f641
Update README.md 2022-09-22 15:58:41 +02:00
msramalho 34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
Miguel Sozinho Ramalho 3d9a2622c3
Update README.md 2022-06-16 16:23:53 +01:00
msramalho 14add43923 fixing auto_auto_archive 2022-06-16 17:17:25 +02:00
msramalho 3bffee41a0 README updates 2022-06-14 21:40:04 +02:00
msramalho dc60bb1558 json -> yaml 2022-06-14 21:18:18 +02:00
msramalho bd753b27ed numbers in markdown 2022-06-14 20:55:30 +02:00
msramalho c11a208253 more verbose about mandatory columns 2022-06-14 19:54:08 +02:00
msramalho 3019778b8f readme updates 2022-06-07 18:52:19 +02:00
msramalho 3791afc94c readme updates 2022-06-07 18:43:04 +02:00
msramalho d46b8e1157 README updates 2022-06-07 18:41:43 +02:00
msramalho 10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho 159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
msramalho 03aa02e88b diagram 2022-05-25 12:23:59 +02:00