Added ChatGPT-generated news articles

Signed-off-by: Jacob Torrey <jacob@thinkst.com>
2023-05-18 10:28:55 -06:00 · 2023-05-18 10:28:55 -06:00 · 15766880c3
commit 15766880c3
--- a/ai-generated.txt
+++ b/ai-generated.txt
@ -565,4 +565,34 @@ Another aspect of the book that I loved is the magical realism woven throughout

 In conclusion, "The Alchemist" is a captivating tale of self-discovery and following one's dreams. Through Santiago's journey, we learn valuable life lessons about perseverance, listening to our hearts, and embracing the magic of the world around us. Coelho's simple yet evocative writing style and the book's themes of personal growth make it a must-read for anyone seeking inspiration and a deeper understanding of their own journey in life. So grab a copy, embark on this transformative adventure, and discover your own Personal Legend. Happy reading!

-Abstract This work explores the capacities of character- based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such ap- proaches to handle productive UGC phenom- ena, which almost by definition, cannot be seen at training time. Within a strict zero- shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of han- dling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this be- havior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation. Introduction. Neural Machine Translation (NMT) models fall far short from being able to translate noisy User- Generated Content (UGC): the quality of their translation is often even worse than that of a traditional phrase-based system (Khayrallah and Koehn, 2018; Rosales NuÃÂnÃÂez et al., 2019). In addi- tion to ambiguous grammatical constructs and pro- fusion of ellipsis, the main difficulty encountered when translating UGC is the high number of out- of-vocabulary tokens (OOVs) resulting from mis- spelled words, emoticons, hashtags, mentions, and all specific constructs used in online forums and social medias (Foster, 2010; Seddah et al., 2012; Eisenstein, 2013; Sanguinetti et al., 2020). Some of those phenomena can be perceived as noise while the others are typical markers of language variation among speakers. Moreover, a certain amount of those same phenomena operate at the lexical level (either at the character, subword or word levels) (Sanguinetti et al., 2020). This is why, focusing more on the noise axis, char-based mod- els appear to offer a natural solution to this prob- lem (Luong and Manning, 2016; Ling et al., 2015): indeed these open vocabulary models are designed specifically to address the OOV problem. In this work, we explore the ability of out-of- the-box character-based NMT models (Lee et al., 2017) to address the challenges raised by UGC translation. While character-based models may seem promising for such task, to the best of our knowledge, they have only been tested either on data sets in which noise has been artificially added through sampling an edited word error data set (Be- linkov and Bisk, 2018; Ebrahimi et al., 2018a) and on canonical data set, in which they prove to be very effective for translating morphologically-rich languages with a high number of OOVs (Luong and Manning, 2016). However, our starting-points experiments show that character-based systems are outperformed by BPE models even when translating noisy UGC. To understand this counter-intuitive result, we con- duct several experiments and analyses. In partic- ular, we manually annotated 400 sentences at the token level using a fine-grained typology, to per- form our analyses. These sentences correspond to the worst and the best translated utterances of two MT systems (a char-based and a transformer-based model). Our results highlight the extreme sensi- bility of character-based models to the vocabulary size, a parameter often overlooked in the literature. Using a simple set of experiments, we thus show that these models are unable to perform an easy copy task due to their poor handling of unknown and rare characters. By adjusting the vocabulary size parameter, we drastically improve the robust- ness of our character-based model without causing a large drop in in-domain performance. Our contributions are as follows: Ã¢ÂÂ¢ we provide an annotated data set that enables in-depth evaluations of the impact of UGC idiosyncrasies; Ã¢ÂÂ¢ we demonstrate that char-based neural machine translation models are extremely sensi- tive to unknown and rare characters on both synthetic data and noisy user-generated content; Ã¢ÂÂ¢ we show how an overlooked hyper-parameter drastically improve char-based MT models ro- bustness to natural noise while maintaining the in-domain level of performance. Conclusion We showed that in zero-shot scenarios, char-based models are not robust to UGC idiosyncrasies. We presented several experiments that explained this counter-intuitive result by an over-sensibility of these models to the vocabulary size. We demonstrated that drastically lowering this param- eterthe robustness of our char-based models when facing noisy UGC while maintaining almost the same level of performance for in-domain and out-of-domain canonical datasets. Interestingly, we noticed a lack in the literature regarding the importance of vocabulary considerations when training character-based NMTmodels.
+The IDG News Service' privacy team offers an in-depth guide to protecting your privacy when online. See answer text below for tips on how to stay safe.
+
+Republicans narrowly defied a presidential veto, enabling the transportation bill to pass with strong House support on Tuesday in a nod to efforts by New Jersey Republican Rep. Frank Pallone to crack down on data breaches.
+
+The $24 billion package would add new security measures to electronic ticketing systems, and it would also rewrite an electronic system that has converted various records into digital driver's license cards in times of increased rail use.
+
+The bill cleared the House without objections, 113-304, but President Barack Obama vetoed it June 13.
+
+Still, the House vote highlights increasing concern, especially by cybersecurity experts, about attacks on corporations, politicians and other more generalized online threats.
+
+Cognizant of that reality, this week the House cybersecurity subcommittee pledged to reconvene next month to reconsider the fix for the bridge-ticketing system and the agency's use of electromagnetic spectrum for radar.
+
+In the meantime, the Center for Strategic and International Studies (CSIS), the respected bipartisan policy and research outfit, announced that in the fifth annual "Cybersecurity in the United States: Insights and Recommendations" study it had added New Jersey Rep. Pallone as co-author of its 2013 report on cybersecurity.
+
+Rep. Pallone's main House Intelligence and Oversight Subcommittee assigned the report short questionnaire responses and release copyright information for the 13-page publication.
+
+The lengthy document offers an in-depth description of how the New Jersey legislator completed an assessment of the situation for his state, including information on compromise structures completed by the New Jersey National Security Agency (N.J. N.S.A.).
+
+The United States should be "ready to act" with support from the "international community, if necessary," France's President Francois Hollande told parliament Wednesday.
+
+PARIS (Sputnik) — Hollande said when he mentioned the possibility to France's lawmakers that he played a "role in the creation of Isis" and is "diagnosed as such," calling the frequent appearances by American President Donald Trump "incompetent" and "insulting."
+
+Abstract This work explores the capacities of character- based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such ap- proaches to handle productive UGC phenom- ena, which almost by definition, cannot be seen at training time. Within a strict zero- shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of han- dling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this be- havior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation. Introduction. Neural Machine Translation (NMT) models fall far short from being able to translate noisy User- Generated Content (UGC): the quality of their translation is often even worse than that of a traditional phrase-based system (Khayrallah and Koehn, 2018; Rosales NuÃÂnÃÂez et al., 2019). In addi- tion to ambiguous grammatical constructs and pro- fusion of ellipsis, the main difficulty encountered when translating UGC is the high number of out- of-vocabulary tokens (OOVs) resulting from mis- spelled words, emoticons, hashtags, mentions, and all specific constructs used in online forums and social medias (Foster, 2010; Seddah et al., 2012; Eisenstein, 2013; Sanguinetti et al., 2020). Some of those phenomena can be perceived as noise while the others are typical markers of language variation among speakers. Moreover, a certain amount of those same phenomena operate at the lexical level (either at the character, subword or word levels) (Sanguinetti et al., 2020). This is why, focusing more on the noise axis, char-based mod- els appear to offer a natural solution to this prob- lem (Luong and Manning, 2016; Ling et al., 2015): indeed these open vocabulary models are designed specifically to address the OOV problem. In this work, we explore the ability of out-of- the-box character-based NMT models (Lee et al., 2017) to address the challenges raised by UGC translation. While character-based models may seem promising for such task, to the best of our knowledge, they have only been tested either on data sets in which noise has been artificially added through sampling an edited word error data set (Be- linkov and Bisk, 2018; Ebrahimi et al., 2018a) and on canonical data set, in which they prove to be very effective for translating morphologically-rich languages with a high number of OOVs (Luong and Manning, 2016). However, our starting-points experiments show that character-based systems are outperformed by BPE models even when translating noisy UGC. To understand this counter-intuitive result, we con- duct several experiments and analyses. In partic- ular, we manually annotated 400 sentences at the token level using a fine-grained typology, to per- form our analyses. These sentences correspond to the worst and the best translated utterances of two MT systems (a char-based and a transformer-based model). Our results highlight the extreme sensi- bility of character-based models to the vocabulary size, a parameter often overlooked in the literature. Using a simple set of experiments, we thus show that these models are unable to perform an easy copy task due to their poor handling of unknown and rare characters. By adjusting the vocabulary size parameter, we drastically improve the robust- ness of our character-based model without causing a large drop in in-domain performance. Our contributions are as follows: Ã¢ÂÂ¢ we provide an annotated data set that enables in-depth evaluations of the impact of UGC idiosyncrasies; Ã¢ÂÂ¢ we demonstrate that char-based neural machine translation models are extremely sensi- tive to unknown and rare characters on both synthetic data and noisy user-generated content; Ã¢ÂÂ¢ we show how an overlooked hyper-parameter drastically improve char-based MT models ro- bustness to natural noise while maintaining the in-domain level of performance. Conclusion We showed that in zero-shot scenarios, char-based models are not robust to UGC idiosyncrasies. We presented several experiments that explained this counter-intuitive result by an over-sensibility of these models to the vocabulary size. We demonstrated that drastically lowering this param- eterthe robustness of our char-based models when facing noisy UGC while maintaining almost the same level of performance for in-domain and out-of-domain canonical datasets. Interestingly, we noticed a lack in the literature regarding the importance of vocabulary considerations when training character-based NMTmodels.
+
+Something Wicked This Way Comes was a band that put out two EPs, but disbanded before they could do anything else. The self-titled EP is a more metal oriented affair than One More Day, but it still has a polished, high quality sound. They're known for the metal/rock track Fear and a cover of Badlands by Bruce Springsteen. The latter track is very good. The former track has a somewhat weaker rock sound, and doesn't seem to fit with the other songs on the EP. While I prefer One More Day overall, this EP is definitely worth hearing.
+
+Singer Chris Brown was released from a Los Angeles County jail earlier today. The 27-year-old had been held for several weeks after admitting to a probation violation in May. Brown had been on probation for an assault charge from 2009. He was released early due to overcrowding in the jail. Despite being released from jail, Brown still faces charges in Washington, DC. He was charged with misdemeanor assault in October 2013 after allegedly punching a man outside a hotel. Brown's trial in Washington has been delayed several times, with the latest date set for a few weeks from now. Brown has been in and out of legal trouble since 2009. In addition to the assault charge, he's faced charges for vandalism, hit-and-run, and attacking his former girlfriend, Rihanna. He's also faced several civil lawsuits. Despite his legal issues, Brown has continued to release music, with his most recent album, Royalty, coming out in 2015.
+
+Former U.S. Senator Harry Reid's wife, Landra, was involved in a car accident on Thursday. The accident left her with multiple injuries, including a broken neck, a vertebra, and her nose. Following surgery, she has regained some of her abilities and is expected to undergo physical therapy before being released from the hospital. According to the attending physician, Landra Reid has made significant progress since her surgery. \"She is able to get out of bed, and she's able to swallow some,\" the doctor said, adding that she can also move her arms and legs. Despite the positive news, the doctor emphasized that Mrs. Reid has a long recovery road ahead of her. The accident occurred on a quiet road in Virginia, according to authorities. Landra Reid was reportedly the only passenger in the car and was wearing her seatbelt at the time of the accident. There is no word yet on what caused the accident, and police are investigating. The Reid family released a statement thanking everyone for their support and asking for privacy during this difficult time. “We appreciate the overwhelming support and kindness shown to Landra and our family during the past few days. We ask that everyone continue to keep her in their thoughts and prayers as she recovers.” Landra Reid is a prominent figure in the political world and has been married to former Senator Harry Reid for more than fifty years. The couple has five children and several grandchildren. Landra has been a strong advocate for healthcare reform and worked closely with her husband during his time in Congress. The news of Landra's accident has generated an outpouring of support from politicians and the public alike. Many have taken to social media to express their concern and offer their thoughts and prayers to the Reid family. Senator Catherine Cortez Masto, who served as Reid's former chief of staff, released a statement expressing her deep concern for Landra's well-being. \"Landra Reid is one of the most gracious and caring people I've ever met,\" she said. \"My thoughts are with her and her family at this difficult time.\" Former Secretary of State Hillary Clinton also expressed her support for the Reid family on Twitter, writing, \"Bill and I are sending our thoughts and prayers to Harry Reid and his family, especially his wife Landra, as they deal with the aftermath of a serious accident. We are with you, and we are sending our love.\" As Landra Reid continues her recovery, the Reid family has asked for continued privacy and respect for their situation. The community continues to rally around the family, offering their support and well wishes during this difficult time.
+
+Okinawa, Japan's southernmost prefecture, suffered a major hit from a strong typhoon that pounded the island with intense winds and torrential rains, leaving behind a trail of destruction that disrupted the lives of over a hundred thousand people. The typhoon, dubbed Typhoon Maysak, made landfall on Okinawa Island early on Tuesday, September 1, causing severe damage to buildings, roads, and power infrastructure. According to a CNN iReporter, the wind was so loud that it was difficult to hear people in the same room. The typhoon brought winds of up to 260 kilometers per hour, which resulted in the loss of power for over 106,100 electric customers in Okinawa. In addition to the power outage, the typhoon also caused the suspension of transportation services, including flights, trains, and ferry services. Numerous schools, public facilities, and businesses were closed, with authorities warning residents to stay indoors and prepare for potential disasters, including floods and landslides. While Typhoon Maysak's power over Okinawa has decreased, meteorologists warn that the typhoon could still pose a significant threat to the Kyushu region, which is still reeling from the flooding caused by Typhoon Haishen earlier this week. Kyushu has already endured severe damage from the ongoing heavy rains that have triggered floods, mudslides, and evacuations in the region. The authorities are concerned that Maysak's heavy rains could lead to further damage and pose a risk to human lives. In preparation for the oncoming storm, Kyushu Electric Power, one of Japan's largest utility companies, has taken measures to ensure that their systems are prepared for the worst-case scenario. The company has dispatched crews to monitor the condition of power lines and transformers to prevent widespread power outages. Local governments also advised residents in the affected areas to take precautions, such as stocking up on emergency supplies and securing their homes. Moreover, the Japan Meteorological Agency (JMA) has issued an emergency warning to the Kyushu region, where Maysak is expected to hit hardest. The agency warned that the powerful typhoon could bring heavy rainfall totaling up to 400 millimeters in some areas, and coastal areas could see waves as high as fifteen meters. JMA urges residents to keep informed and be prepared to take action if necessary. As Japan prepares to face another natural disaster, the government is working to protect the safety of its citizens. Prime Minister Shinzo Abe has instructed government officials to take all necessary measures to ensure the safety and well-being of people in Okinawa and Kyushu. The country's Self-Defense Forces and the Coast Guard are on standby to provide assistance in case of emergencies. The government has also urged the public to remain vigilant and take all necessary precautions to stay safe. In conclusion, Typhoon Maysak has already wreaked havoc in Okinawa, causing significant damage and disruption to daily life. While the typhoon has considerably weakened, the impact on Kyushu remains unsure. The Japanese government is working to minimize the damages and hazards that the typhoon could bring. It is highly advisable for the residents in the affected areas to take precautions and follow the instructions provided by the authorities to keep themselves and their families safe.
--- a/roberta_detect.py
+++ b/roberta_detect.py
@ -7,7 +7,7 @@ from typing import Optional, Tuple

 from roberta_local import classify_text

-def run_on_file_chunked(filename : str, chunk_size : int = 1024, fuzziness : int = 3) -> Optional[Tuple[str, float]]:
+def run_on_file_chunked(filename : str, chunk_size : int = 1025, fuzziness : int = 3) -> Optional[Tuple[str, float]]:
 	'''
 	Given a filename (and an optional chunk size) returns the score for the contents of that file.
 	This function chunks the file into at most chunk_size parts to score separately, then returns an average. This prevents a very large input
@ -15,6 +15,14 @@ def run_on_file_chunked(filename : str, chunk_size : int = 1024, fuzziness : int
 	'''
 	with open(filename, 'r') as fp:
 		contents = fp.read()
+	return run_on_text_chunked(contents, chunk_size, fuzziness)
+
+def run_on_text_chunked(contents : str, chunk_size : int = 1025, fuzziness : int = 3) -> Optional[Tuple[str, float]]:
+	'''
+	Given a text (and an optional chunk size) returns the score for the contents of that string.
+	This function chunks the string into at most chunk_size parts to score separately, then returns an average. This prevents a very large input
+	overwhelming the model.
+	'''

 	# Remove extra spaces and duplicate newlines.
 	contents = re.sub(' +', ' ', contents)
--- a/samples/news.jsonl
+++ b/samples/news.jsonl
--- a/test_lzma_detect.py
+++ b/test_lzma_detect.py
@ -42,26 +42,55 @@ def test_llm_sample(f):

 MIN_LEN = 150

-HUMAN_JSONL_FILE = 'samples/webtext.test.jsonl'
-human_samples = []
-with jsonlines.open(HUMAN_JSONL_FILE) as reader:
+# HUMAN_JSONL_FILE = 'samples/webtext.test.jsonl'
+# human_samples = []
+# with jsonlines.open(HUMAN_JSONL_FILE) as reader:
+#     for obj in reader:
+#         if obj.get('length', 0) >= MIN_LEN:
+#             human_samples.append(obj)
+
+# @pytest.mark.parametrize('i', human_samples[0:250])
+# def test_human_jsonl(i):
+#     (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+#     assert classification == 'Human', HUMAN_JSONL_FILE + ':' + str(i.get('id')) + ' (len: ' + str(i.get('length', -1)) + ') is a human-generated sample, misclassified as AI-generated with confidence ' + str(round(score, 8))
+
+# AI_JSONL_FILE = 'samples/xl-1542M.test.jsonl'
+# ai_samples = []
+# with jsonlines.open(AI_JSONL_FILE) as reader:
+#     for obj in reader:
+#         if obj.get('length', 0) >= MIN_LEN:
+#             ai_samples.append(obj)
+
+# @pytest.mark.parametrize('i', ai_samples[0:250])
+# def test_gpt2_jsonl(i):
+#     (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+#     assert classification == 'AI', AI_JSONL_FILE + ':' + str(i.get('id')) + ' (text: ' + i.get('text', "").replace('\n', ' ')[:50] + ') is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))
+
+# GPT3_JSONL_FILE = 'samples/GPT-3-175b_samples.jsonl'
+# gpt3_samples = []
+# with jsonlines.open(GPT3_JSONL_FILE) as reader:
+#     for o in reader:
+#         for l in o.split('<|endoftext|>'):
+#             if len(l) >= MIN_LEN:
+#                 gpt3_samples.append(l)
+
+# @pytest.mark.parametrize('i', gpt3_samples)
+# def test_gpt3_jsonl(i):
+#     (classification, score) = run_on_text_chunked(i, fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+#     assert classification == 'AI', GPT3_JSONL_FILE + ' is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))
+
+NEWS_JSONL_FILE = 'samples/news.jsonl'
+news_samples = []
+with jsonlines.open(NEWS_JSONL_FILE) as reader:
    for obj in reader:
-        if obj.get('length', 0) >= MIN_LEN:
-            human_samples.append(obj)
+        news_samples.append(obj)

-@pytest.mark.parametrize('i', human_samples[0:250])
-def test_human_jsonl(i):
-    (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
-    assert classification == 'Human', HUMAN_JSONL_FILE + ':' + str(i.get('id')) + ' (len: ' + str(i.get('length', -1)) + ') is a human-generated sample, misclassified as AI-generated with confidence ' + str(round(score, 8))
+@pytest.mark.parametrize('i', news_samples[0:250])
+def test_humannews_jsonl(i):
+    (classification, score) = run_on_text_chunked(i.get('human', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    assert classification == 'Human', NEWS_JSONL_FILE + ' is a human-generated sample, misclassified as AI-generated with confidence ' + str(round(score, 8))

-AI_JSONL_FILE = 'samples/xl-1542M.test.jsonl'
-ai_samples = []
-with jsonlines.open(AI_JSONL_FILE) as reader:
-    for obj in reader:
-        if obj.get('length', 0) >= MIN_LEN:
-            ai_samples.append(obj)
-
-@pytest.mark.parametrize('i', ai_samples[0:250])
-def test_llm_jsonl(i):
-    (classification, score) = run_on_text_chunked(i.get('text', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
-    assert classification == 'AI', AI_JSONL_FILE + ':' + str(i.get('id')) + ' (text: ' + i.get('text', "").replace('\n', ' ')[:50] + ') is an LLM-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))
+@pytest.mark.parametrize('i', news_samples[0:250])
+def test_chatgptnews_jsonl(i):
+    (classification, score) = run_on_text_chunked(i.get('chatgpt', ''), fuzziness=FUZZINESS, prelude_ratio=PRELUDE_RATIO)
+    assert classification == 'AI', NEWS_JSONL_FILE + ' is a AI-generated sample, misclassified as human-generated with confidence ' + str(round(score, 8))