ZipPy: Fast method to classify text as AI or human-generated

Go to file

Jacob Torrey 28de021d03 Added crossplag results Signed-off-by: Jacob Torrey <jacob@thinkst.com>		2023-06-20 21:08:51 -06:00
.github/workflows	rename to zippy.py	2023-06-09 03:46:31 -06:00
inch	Create README.md	2023-06-09 03:46:31 -06:00
nlzmadetect	Added browser extension support to Nim code	2023-06-09 03:46:31 -06:00
samples	Added ChatGPT sample	2023-06-09 03:46:30 -06:00
.gitignore	Initial commit	2023-06-09 03:44:42 -06:00
.gitmodules	Added code to make Nim compile to CLI and web	2023-06-09 03:46:28 -06:00
LICENSE	Initial commit	2023-06-09 03:44:42 -06:00
README.md	Added crossplag results	2023-06-20 21:08:51 -06:00
ai-generated.txt	Completed a 500/set test with CHEAT	2023-06-09 03:46:30 -06:00
ai_detect_roc.png	Added crossplag results	2023-06-20 21:08:51 -06:00
burstiness.py	Initial commit of burstiness analysis	2023-06-09 03:46:29 -06:00
crossplag-report.xml	Added crossplag results	2023-06-20 21:08:51 -06:00
crossplag_detect.py	Added initial crossplag harness	2023-06-20 21:08:51 -06:00
gptzero-report.xml	Completed a 500/set test with CHEAT	2023-06-09 03:46:30 -06:00
gptzero_detect.py	Added GPTZero API for testing and comparison	2023-06-09 03:46:29 -06:00
openai-report.xml	Completed a 500/set test with CHEAT	2023-06-09 03:46:30 -06:00
openai_detect.py	Added OpenAI's detector and all the test run reports along with a ROC diagram	2023-06-09 03:46:29 -06:00
plot_rocs.py	Added crossplag results	2023-06-20 21:08:51 -06:00
roberta-report.xml	Rerun with fixed Roberta script	2023-06-09 03:46:31 -06:00
roberta_detect.py	Add CUDA support for Roberta (local) and fix an alignment issue	2023-06-15 10:47:50 -06:00
roberta_local.py	Add CUDA support for Roberta (local) and fix an alignment issue	2023-06-15 10:47:50 -06:00
test_crossplag_detect.py	Added crossplag results	2023-06-20 21:08:51 -06:00
test_gptzero_detect.py	Completed a 500/set test with CHEAT	2023-06-09 03:46:30 -06:00
test_openai_detect.py	Completed a 500/set test with CHEAT	2023-06-09 03:46:30 -06:00
test_roberta_detect.py	Fix typo in CHEAT tests	2023-06-09 03:46:30 -06:00
test_zippy_detect.py	rename to zippy.py	2023-06-09 03:46:31 -06:00
zippy-report.xml	rename to zippy.py	2023-06-09 03:46:31 -06:00
zippy.py	Fix preset back to 2	2023-06-10 18:46:14 -06:00

README.md

<<<<<<< HEAD

ZipPy: Fast method to classify text as AI or human-generated

This is a research repo for fast AI detection using compression. While there are a number of existing LLM detection systems, they all use a large model trained on either an LLM or its training data to calculate the probability of each word given the preceeding, then calculating a score where the more high-probability tokens are more likely to be AI-originated. Techniques and tools in this repo are looking for faster approximation to be embeddable and more scalable.

LZMA compression detector (`zippy.py` and `nlzmadetect`)

ZipPy uses the LZMA compression ratios as a way to indirectly measure the perplexity of a text. Compression ratios have been used in the past to detect anomalies in network data for intrusion detection, so if perplexity is roughly a measure of anomalous tokens, it may be possible to use compression to detect low-perplexity text. LZMA creates a dictionary of seen tokens, and then uses though in place of future tokens. The dictionary size, token length, etc. are all dynamic (though influenced by the 'preset' of 0-9--with 0 being the fastest but worse compression than 9). The basic idea is to 'seed' an LZMA compression stream with a corpus of AI-generated text (ai-generated.txt) and then measure the compression ratio of just the seed data with that of the sample appended. Samples that follow more closely in word choice, structure, etc. will acheive a higher compression ratio due to the prevalence of similar tokens in the dictionary, novel words, structures, etc. will appear anomalous to the seeded dictionary, resulting in a worse compression ratio.

Current evaluation

The leading LLM detection tools are OpenAI's model detector (v2), GPTZero, and Roberta. Here are each of them compared with the LZMA detector across all the test datasets:

=======

ai-detect

Research repo for fast AI detection methods

3c715ce (Initial commit)

README.md

ZipPy: Fast method to classify text as AI or human-generated

LZMA compression detector (zippy.py and nlzmadetect)

Current evaluation

ai-detect

LZMA compression detector (`zippy.py` and `nlzmadetect`)