Added ability to plot ROC curves for test results

Signed-off-by: Jacob Torrey <jacob@thinkst.com>
2023-05-23 19:44:59 -06:00 · 2023-05-23 19:44:59 -06:00 · 38a5160988
commit 38a5160988
--- a/ai-generated.txt
+++ b/ai-generated.txt
@ -565,32 +565,8 @@ Another aspect of the book that I loved is the magical realism woven throughout

 In conclusion, "The Alchemist" is a captivating tale of self-discovery and following one's dreams. Through Santiago's journey, we learn valuable life lessons about perseverance, listening to our hearts, and embracing the magic of the world around us. Coelho's simple yet evocative writing style and the book's themes of personal growth make it a must-read for anyone seeking inspiration and a deeper understanding of their own journey in life. So grab a copy, embark on this transformative adventure, and discover your own Personal Legend. Happy reading!

-The IDG News Service' privacy team offers an in-depth guide to protecting your privacy when online. See answer text below for tips on how to stay safe.
-
-Republicans narrowly defied a presidential veto, enabling the transportation bill to pass with strong House support on Tuesday in a nod to efforts by New Jersey Republican Rep. Frank Pallone to crack down on data breaches.
-
-The $24 billion package would add new security measures to electronic ticketing systems, and it would also rewrite an electronic system that has converted various records into digital driver's license cards in times of increased rail use.
-
-The bill cleared the House without objections, 113-304, but President Barack Obama vetoed it June 13.
-
-Still, the House vote highlights increasing concern, especially by cybersecurity experts, about attacks on corporations, politicians and other more generalized online threats.
-
-Cognizant of that reality, this week the House cybersecurity subcommittee pledged to reconvene next month to reconsider the fix for the bridge-ticketing system and the agency's use of electromagnetic spectrum for radar.
-
-In the meantime, the Center for Strategic and International Studies (CSIS), the respected bipartisan policy and research outfit, announced that in the fifth annual "Cybersecurity in the United States: Insights and Recommendations" study it had added New Jersey Rep. Pallone as co-author of its 2013 report on cybersecurity.
-
-Rep. Pallone's main House Intelligence and Oversight Subcommittee assigned the report short questionnaire responses and release copyright information for the 13-page publication.
-
-The lengthy document offers an in-depth description of how the New Jersey legislator completed an assessment of the situation for his state, including information on compromise structures completed by the New Jersey National Security Agency (N.J. N.S.A.).
-
-The United States should be "ready to act" with support from the "international community, if necessary," France's President Francois Hollande told parliament Wednesday.
-
-PARIS (Sputnik) — Hollande said when he mentioned the possibility to France's lawmakers that he played a "role in the creation of Isis" and is "diagnosed as such," calling the frequent appearances by American President Donald Trump "incompetent" and "insulting."
-
 Abstract This work explores the capacities of character- based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such ap- proaches to handle productive UGC phenom- ena, which almost by definition, cannot be seen at training time. Within a strict zero- shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of han- dling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this be- havior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation. Introduction. Neural Machine Translation (NMT) models fall far short from being able to translate noisy User- Generated Content (UGC): the quality of their translation is often even worse than that of a traditional phrase-based system (Khayrallah and Koehn, 2018; Rosales NuÃÂnÃÂez et al., 2019). In addi- tion to ambiguous grammatical constructs and pro- fusion of ellipsis, the main difficulty encountered when translating UGC is the high number of out- of-vocabulary tokens (OOVs) resulting from mis- spelled words, emoticons, hashtags, mentions, and all specific constructs used in online forums and social medias (Foster, 2010; Seddah et al., 2012; Eisenstein, 2013; Sanguinetti et al., 2020). Some of those phenomena can be perceived as noise while the others are typical markers of language variation among speakers. Moreover, a certain amount of those same phenomena operate at the lexical level (either at the character, subword or word levels) (Sanguinetti et al., 2020). This is why, focusing more on the noise axis, char-based mod- els appear to offer a natural solution to this prob- lem (Luong and Manning, 2016; Ling et al., 2015): indeed these open vocabulary models are designed specifically to address the OOV problem. In this work, we explore the ability of out-of- the-box character-based NMT models (Lee et al., 2017) to address the challenges raised by UGC translation. While character-based models may seem promising for such task, to the best of our knowledge, they have only been tested either on data sets in which noise has been artificially added through sampling an edited word error data set (Be- linkov and Bisk, 2018; Ebrahimi et al., 2018a) and on canonical data set, in which they prove to be very effective for translating morphologically-rich languages with a high number of OOVs (Luong and Manning, 2016). However, our starting-points experiments show that character-based systems are outperformed by BPE models even when translating noisy UGC. To understand this counter-intuitive result, we con- duct several experiments and analyses. In partic- ular, we manually annotated 400 sentences at the token level using a fine-grained typology, to per- form our analyses. These sentences correspond to the worst and the best translated utterances of two MT systems (a char-based and a transformer-based model). Our results highlight the extreme sensi- bility of character-based models to the vocabulary size, a parameter often overlooked in the literature. Using a simple set of experiments, we thus show that these models are unable to perform an easy copy task due to their poor handling of unknown and rare characters. By adjusting the vocabulary size parameter, we drastically improve the robust- ness of our character-based model without causing a large drop in in-domain performance. Our contributions are as follows: Ã¢ÂÂ¢ we provide an annotated data set that enables in-depth evaluations of the impact of UGC idiosyncrasies; Ã¢ÂÂ¢ we demonstrate that char-based neural machine translation models are extremely sensi- tive to unknown and rare characters on both synthetic data and noisy user-generated content; Ã¢ÂÂ¢ we show how an overlooked hyper-parameter drastically improve char-based MT models ro- bustness to natural noise while maintaining the in-domain level of performance. Conclusion We showed that in zero-shot scenarios, char-based models are not robust to UGC idiosyncrasies. We presented several experiments that explained this counter-intuitive result by an over-sensibility of these models to the vocabulary size. We demonstrated that drastically lowering this param- eterthe robustness of our char-based models when facing noisy UGC while maintaining almost the same level of performance for in-domain and out-of-domain canonical datasets. Interestingly, we noticed a lack in the literature regarding the importance of vocabulary considerations when training character-based NMTmodels.

-Something Wicked This Way Comes was a band that put out two EPs, but disbanded before they could do anything else. The self-titled EP is a more metal oriented affair than One More Day, but it still has a polished, high quality sound. They're known for the metal/rock track Fear and a cover of Badlands by Bruce Springsteen. The latter track is very good. The former track has a somewhat weaker rock sound, and doesn't seem to fit with the other songs on the EP. While I prefer One More Day overall, this EP is definitely worth hearing.
-
 Singer Chris Brown was released from a Los Angeles County jail earlier today. The 27-year-old had been held for several weeks after admitting to a probation violation in May. Brown had been on probation for an assault charge from 2009. He was released early due to overcrowding in the jail. Despite being released from jail, Brown still faces charges in Washington, DC. He was charged with misdemeanor assault in October 2013 after allegedly punching a man outside a hotel. Brown's trial in Washington has been delayed several times, with the latest date set for a few weeks from now. Brown has been in and out of legal trouble since 2009. In addition to the assault charge, he's faced charges for vandalism, hit-and-run, and attacking his former girlfriend, Rihanna. He's also faced several civil lawsuits. Despite his legal issues, Brown has continued to release music, with his most recent album, Royalty, coming out in 2015.

 Former U.S. Senator Harry Reid's wife, Landra, was involved in a car accident on Thursday. The accident left her with multiple injuries, including a broken neck, a vertebra, and her nose. Following surgery, she has regained some of her abilities and is expected to undergo physical therapy before being released from the hospital. According to the attending physician, Landra Reid has made significant progress since her surgery. \"She is able to get out of bed, and she's able to swallow some,\" the doctor said, adding that she can also move her arms and legs. Despite the positive news, the doctor emphasized that Mrs. Reid has a long recovery road ahead of her. The accident occurred on a quiet road in Virginia, according to authorities. Landra Reid was reportedly the only passenger in the car and was wearing her seatbelt at the time of the accident. There is no word yet on what caused the accident, and police are investigating. The Reid family released a statement thanking everyone for their support and asking for privacy during this difficult time. “We appreciate the overwhelming support and kindness shown to Landra and our family during the past few days. We ask that everyone continue to keep her in their thoughts and prayers as she recovers.” Landra Reid is a prominent figure in the political world and has been married to former Senator Harry Reid for more than fifty years. The couple has five children and several grandchildren. Landra has been a strong advocate for healthcare reform and worked closely with her husband during his time in Congress. The news of Landra's accident has generated an outpouring of support from politicians and the public alike. Many have taken to social media to express their concern and offer their thoughts and prayers to the Reid family. Senator Catherine Cortez Masto, who served as Reid's former chief of staff, released a statement expressing her deep concern for Landra's well-being. \"Landra Reid is one of the most gracious and caring people I've ever met,\" she said. \"My thoughts are with her and her family at this difficult time.\" Former Secretary of State Hillary Clinton also expressed her support for the Reid family on Twitter, writing, \"Bill and I are sending our thoughts and prayers to Harry Reid and his family, especially his wife Landra, as they deal with the aftermath of a serious accident. We are with you, and we are sending our love.\" As Landra Reid continues her recovery, the Reid family has asked for continued privacy and respect for their situation. The community continues to rally around the family, offering their support and well wishes during this difficult time.
--- a/lzma_detect.py
+++ b/lzma_detect.py
@ -34,7 +34,7 @@ class LzmaLlmDetector:
    '''Class providing functionality to attempt to detect LLM/generative AI generated text using the LZMA compression algorithm'''
    def __init__(self, prelude_file : Optional[str] = None, fuzziness_digits : int = 3, prelude_str : Optional[str] = None, prelude_ratio : Optional[float] = None) -> None:
        '''Initializes a compression with the passed prelude file, and optionally the number of digits to round to compare prelude vs. sample compression'''
-        self.PRESET : int = 3
+        self.PRESET : int = 2
        self.comp = lzma.LZMACompressor(preset=self.PRESET)
        self.c_buf : List[bytes] = []
        self.in_bytes : int = 0
@ -114,7 +114,7 @@ class LzmaLlmDetector:
            determination = 'Human'
        #if abs(delta * 100) < .1 and determination == 'AI':
        #    print("Very low-confidence determination of: " + determination)
-        return (determination, abs(delta * 100))
+        return (determination, abs(delta * 1000))
        
 def run_on_file(filename : str, fuzziness : int = 3) -> Optional[Tuple[str, float]]:
    '''Given a filename (and an optional number of decimal places to round to) returns the score for the contents of that file'''
--- a/plot_rocs.py
+++ b/plot_rocs.py
@ -0,0 +1,64 @@
+#!/usr/bin/env python3
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.metrics import roc_curve, auc
+from lzma_detect import run_on_file_chunked, PRELUDE_STR, LzmaLlmDetector
+from pathlib import Path
+from itertools import chain
+from math import sqrt
+from junitparser import JUnitXml
+
+MODELS = ['lzma', 'roberta']
+
+plt.figure()
+
+for model in MODELS:
+    xml = JUnitXml.fromfile(f'{model}-report.xml')
+    cases = []
+    for suite in xml:
+        for case in suite:
+            cases.append(case)
+    
+    truths = []
+    scores = []
+    for c in cases:
+        score = float(c._elem.getchildren()[0].getchildren()[0].values()[1])
+        if 'human' in c.name:
+            truths.append(1)
+            if c.is_passed:
+                scores.append(score)
+            else:
+                scores.append(score * -1.0)
+        else:
+            truths.append(-1)
+            if c.is_passed:
+                scores.append(score * -1.0)
+            else:
+                scores.append(score)
+
+    y_true = np.array(truths)
+    y_scores = np.array(scores) 
+
+    # Compute the false positive rate (FPR), true positive rate (TPR), and threshold values
+    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
+    gmeans = np.sqrt(tpr * (1-fpr))
+    ix = np.argmax(gmeans)
+    print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
+    print(thresholds)
+    # calculate the g-mean for each threshold
+    # locate the index of the largest g-mean
+    # Calculate the area under the ROC curve (AUC)
+    roc_auc = auc(fpr, tpr)
+
+    # Plot the ROC curve
+    plt.plot(fpr, tpr, lw=2, label=model.capitalize() + ': ROC curve (AUC = %0.2f)' % roc_auc)
+    plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best @ threshold = %0.2f' % thresholds[ix])
+    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label="Random classifier")
+plt.xlim([0.0, 1.0])
+plt.ylim([0.0, 1.05])
+plt.xlabel('False Positive Rate')
+plt.ylabel('True Positive Rate')
+plt.title('Receiver Operating Characteristic for LLM detection')
+plt.legend(loc="lower right")
+plt.savefig('ai_detect_roc.png')
--- a/test_lzma_detect.py
+++ b/test_lzma_detect.py
@ -7,6 +7,9 @@ from lzma_detect import run_on_file_chunked, run_on_text_chunked, PRELUDE_STR, L
 AI_SAMPLE_DIR = 'samples/llm-generated/'
 HUMAN_SAMPLE_DIR = 'samples/human-generated/'

+MIN_LEN = 150
+NUM_JSONL_SAMPLES = 50
+
 ai_files = os.listdir(AI_SAMPLE_DIR)
 human_files = os.listdir(HUMAN_SAMPLE_DIR)

@ -15,12 +18,15 @@ CONFIDENCE_THRESHOLD : float = 0.00 # What confidence to treat as error vs warni

 PRELUDE_RATIO = LzmaLlmDetector(prelude_str=PRELUDE_STR).prelude_ratio

-def test_training_file():
-    assert run_on_file_chunked('ai-generated.txt')[0] == 'AI', 'The training corpus should always be detected as AI-generated... since it is'
+def test_training_file(record_property):
+    (classification, score) = run_on_file_chunked('ai-generated.txt')
+    record_property("score", str(score))
+    assert classification == 'AI', 'The training corpus should always be detected as AI-generated... since it is'

@pytest.mark.parametrize('f', human_files)
-def test_human_samples(f):
+def test_human_samples(f, record_property):
    (classification, score) = run_on_file_chunked(HUMAN_SAMPLE_DIR + f, fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    record_property("score", str(score))
    if score > CONFIDENCE_THRESHOLD:
        assert classification == 'Human', f + ' is a human-generated file, misclassified as AI-generated with confidence ' + str(round(score, 8))
    else:
@ -30,8 +36,9 @@ def test_human_samples(f):
            warn("Unable to confidently classify: " + f)

@pytest.mark.parametrize('f', ai_files)
-def test_llm_sample(f):
+def test_llm_sample(f, record_property):
   (classification, score) = run_on_file_chunked(AI_SAMPLE_DIR + f, fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+   record_property("score", str(score))
   if score > CONFIDENCE_THRESHOLD:
       assert classification == 'AI', f + ' is an LLM-generated file, misclassified as human-generated with confidence ' + str(round(score, 8))
   else:
@ -40,44 +47,46 @@ def test_llm_sample(f):
       else:
           warn("Unable to confidently classify: " + f)

-MIN_LEN = 150

-# HUMAN_JSONL_FILE = 'samples/webtext.test.jsonl'
-# human_samples = []
-# with jsonlines.open(HUMAN_JSONL_FILE) as reader:
-#     for obj in reader:
-#         if obj.get('length', 0) >= MIN_LEN:
-#             human_samples.append(obj)
+HUMAN_JSONL_FILE = 'samples/webtext.test.jsonl'
+human_samples = []
+with jsonlines.open(HUMAN_JSONL_FILE) as reader:
+    for obj in reader:
+        if obj.get('length', 0) >= MIN_LEN:
+            human_samples.append(obj)

-# @pytest.mark.parametrize('i', human_samples[0:250])
-# def test_human_jsonl(i):
-#     (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
-#     assert classification == 'Human', HUMAN_JSONL_FILE + ':' + str(i.get('id')) + ' (len: ' + str(i.get('length', -1)) + ') is a human-generated sample, misclassified as AI-generated with confidence ' + str(round(score, 8))
+@pytest.mark.parametrize('i', human_samples[0:NUM_JSONL_SAMPLES])
+def test_human_jsonl(i, record_property):
+    (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    record_property("score", str(score))
+    assert classification == 'Human', HUMAN_JSONL_FILE + ':' + str(i.get('id')) + ' (len: ' + str(i.get('length', -1)) + ') is a human-generated sample, misclassified as AI-generated with confidence ' + str(round(score, 8))

-# AI_JSONL_FILE = 'samples/xl-1542M.test.jsonl'
-# ai_samples = []
-# with jsonlines.open(AI_JSONL_FILE) as reader:
-#     for obj in reader:
-#         if obj.get('length', 0) >= MIN_LEN:
-#             ai_samples.append(obj)
+AI_JSONL_FILE = 'samples/xl-1542M.test.jsonl'
+ai_samples = []
+with jsonlines.open(AI_JSONL_FILE) as reader:
+    for obj in reader:
+        if obj.get('length', 0) >= MIN_LEN:
+            ai_samples.append(obj)

-# @pytest.mark.parametrize('i', ai_samples[0:250])
-# def test_gpt2_jsonl(i):
-#     (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
-#     assert classification == 'AI', AI_JSONL_FILE + ':' + str(i.get('id')) + ' (text: ' + i.get('text', "").replace('\n', ' ')[:50] + ') is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))
+@pytest.mark.parametrize('i', ai_samples[0:NUM_JSONL_SAMPLES])
+def test_gpt2_jsonl(i, record_property):
+    (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    record_property("score", str(score))
+    assert classification == 'AI', AI_JSONL_FILE + ':' + str(i.get('id')) + ' (text: ' + i.get('text', "").replace('\n', ' ')[:50] + ') is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))

-# GPT3_JSONL_FILE = 'samples/GPT-3-175b_samples.jsonl'
-# gpt3_samples = []
-# with jsonlines.open(GPT3_JSONL_FILE) as reader:
-#     for o in reader:
-#         for l in o.split('<|endoftext|>'):
-#             if len(l) >= MIN_LEN:
-#                 gpt3_samples.append(l)
+GPT3_JSONL_FILE = 'samples/GPT-3-175b_samples.jsonl'
+gpt3_samples = []
+with jsonlines.open(GPT3_JSONL_FILE) as reader:
+    for o in reader:
+        for l in o.split('<|endoftext|>'):
+            if len(l) >= MIN_LEN:
+                gpt3_samples.append(l)

-# @pytest.mark.parametrize('i', gpt3_samples)
-# def test_gpt3_jsonl(i):
-#     (classification, score) = run_on_text_chunked(i, fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
-#     assert classification == 'AI', GPT3_JSONL_FILE + ' is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))
+@pytest.mark.parametrize('i', gpt3_samples[0:NUM_JSONL_SAMPLES])
+def test_gpt3_jsonl(i, record_property):
+    (classification, score) = run_on_text_chunked(i, fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    record_property("score", str(score))
+    assert classification == 'AI', GPT3_JSONL_FILE + ' is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))

 NEWS_JSONL_FILE = 'samples/news.jsonl'
 news_samples = []
@ -85,12 +94,14 @@ with jsonlines.open(NEWS_JSONL_FILE) as reader:
    for obj in reader:
        news_samples.append(obj)

-@pytest.mark.parametrize('i', news_samples[0:250])
-def test_humannews_jsonl(i):
+@pytest.mark.parametrize('i', news_samples[0:NUM_JSONL_SAMPLES])
+def test_humannews_jsonl(i, record_property):
    (classification, score) = run_on_text_chunked(i.get('human', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    record_property("score", str(score))
    assert classification == 'Human', NEWS_JSONL_FILE + ' is a human-generated sample, misclassified as AI-generated with confidence ' + str(round(score, 8))

-@pytest.mark.parametrize('i', news_samples[0:250])
-def test_chatgptnews_jsonl(i):
+@pytest.mark.parametrize('i', news_samples[0:NUM_JSONL_SAMPLES])
+def test_chatgptnews_jsonl(i, record_property):
    (classification, score) = run_on_text_chunked(i.get('chatgpt', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    record_property("score", str(score))
    assert classification == 'AI', NEWS_JSONL_FILE + ' is a AI-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))