Ausnahme gefangen: SSL certificate problem: certificate is not yet valid 📌 Introducing a Dataset to Detect GPT-Generated Text

🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security



📚 Introducing a Dataset to Detect GPT-Generated Text


💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

How to create datasets for ChatGPT detection models

Photo by Markus Spiske on Unsplash

With the breakthrough success of large language models like ChatGPT, people are finding innovative ways to use these models in their daily lives. However, this has also led to unintended consequences such as cheating on homework and tests by students, the use of ChatGPT to publish research papers, and even scammers using these models to trick people. To address these issues, there is a growing need for models that can detect text generated by GPT models. One of the crucial requirements for building robust models for detecting GPT-generated text is access to a large dataset of human-written and GPT-generated responses. This article introduces such a dataset, consisting of 150k human-written and GPT-generated responses to Wikipedia topics and outlines a framework for generating similar datasets in the future.

Dataset

GPT-wiki-intro Dataset is available on Hugging Face. This dataset has human written and GPT (Curie) generated introductions for 150,000 Wikipedia topics. The prompt used to generate GPT response is as follows:

f"200 word wikipedia style introduction on '{title}'
{starter_text}"

Where the `title` is the title of Wikipedia page, and `starter_text` is the first 7 words from the introduction paragraph. The dataset also has useful metadata such as title_len, wiki_intro_len, generated_intro_len, prompt_tokens, generated_text_tokens, etc. The schema of the dataset is as follows:

----------------------------------------------------------------------------
|Column |Datatype|Description |
|---------------------|--------|-------------------------------------------|
|id |int64 |ID |
|url |string |Wikipedia URL |
|title |string |Title |
|wiki_intro |string |Introduction paragraph from wikipedia |
|generated_intro |string |Introduction generated by GPT (Curie) model|
|title_len |int64 |Number of words in title |
|wiki_intro_len |int64 |Number of words in wiki_intro |
|generated_intro_len |int64 |Number of words in generated_intro |
|prompt |string |Prompt used to generate intro |
|generated_text |string |Text continued after the prompt |
|prompt_tokens |int64 |Number of tokens in the prompt |
|generated_text_tokens|int64 |Number of tokens in generated text |
----------------------------------------------------------------------------

This dataset is shared under Creative Commons license and can be used to distribute, remix, adapt, and build upon. The code to generate this dataset can be found here.

Framework

This dataset is a good starting point for a general use-case of detecting GPT generated text. But if you have a specific use-case, for example, detecting ChatGPT generated answers to test/homework questions, detecting whether a message is sent by a human or chatbot, or if you need a larger dataset in a specific domain you can use the framework to generate your own dataset. The dataset generation process involves three main steps:

  1. Get the anchor dataset
  2. Clean the anchor dataset
  3. Augment the dataset with human written/ GPT generated data

Get the anchor dataset

In this step, we acquire the anchor dataset. This would be the existing data which is readily available for the specific use-case. For GPT-wiki-intro dataset, the anchor dataset was wikipedia dataset, which contains cleaned articles of all languages. For detecting cheating on tests and homework, the anchor dataset could be the question answer pairs submitted by previous students. If you don’t have well defined anchor dataset, you can explore various open source datasets on Hugging Face and Kaggle which align well with the use-case. The anchor dataset doesn’t have to be human written, we can also use GPT generated data as the anchor data. For example, we can use data from ChatGPT prompt response library.

Clean the anchor dataset

Once we have the anchor dataset, we need to clean the data to keep the most relevant information. ChatGPT detection models are sensitive to the length of text. These models perform poorly on smaller texts. We can set a threshold and filter out any responses shorter than the threshold. For example, in GPT-wiki-intro dataset, we filter out all rows where length of introduction is less than 150 words or greater than 350 words. We also filter out all rows where title is more than three words. At this step we also need to decide the total size of the dataset. As augmenting the data with either human written or GPT generated responses is going to be expensive, we need to figure out what’s the minimum required size of dataset for our use-case.

Augment the dataset with human written/ GPT generated data

This is the final step of dataset generation. At this step, we augment the anchor dataset with human written or GPT generated data. The most important part in this step is coming up with the prompt used for generating the response from GPT or question that will be answered by humans. For finalizing the prompt we can leverage OpenAI Playground to test out different prompts with different models, temperature, frequency penalty and presence penalty. To increase the diversity of the dataset, we can finalize n prompts and use them uniformly to get the responses. In case of human responses, we would want to finalize the question by giving different variations of the questions to small survey population and then checking the results to finalize the n questions. Once the prompts or the questions are finalized, we can use the OpenAI API to generate the GPT generated responses or use service such as Mechanical Turk to get the human written responses.

Conclusion

In conclusion, with the widespread use of large language models like ChatGPT, there is a growing need for models that can detect text generated by these models. This article introduced GPT-wiki-intro dataset and outlined a framework for generating similar datasets. The availability of such datasets will play a critical role in developing robust models for detecting GPT-generated text and address the unintended consequences of the use of these models.

Citation

If you find this work helpful, please consider citing:

@misc {aaditya_bhat_2023,
author = { {Aaditya Bhat} },
title = { GPT-wiki-intro (Revision 0e458f5) },
year = 2023,
url = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro },
doi = { 10.57967/hf/0326 },
publisher = { Hugging Face }
}

Introducing a Dataset to Detect GPT-Generated Text was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...



📌 10 Secret GPT-4 Tips And Tricks (How To Use GPT-4)(GPT-4 Tutorial)


📈 31.14 Punkte

📌 Updates für GPT-3 und GPT-4: GPT im Geschwindigkeitsrausch


📈 31.14 Punkte

📌 What does GPT stand for? Understanding GPT 3.5, GPT 4, and more


📈 31.14 Punkte

📌 GPT-6 SHOCKS Everyone With NEW ABILITIES! (GPT5, GPT-6, GPT-7) Document Reveals ALL!


📈 31.14 Punkte

📌 Introducing GPT Crawler - Turn Any Site Into a Custom GPT With Just a URL


📈 31.02 Punkte

📌 Using GPT-3.5-Turbo and GPT-4 to Apply Text-defined Data Quality Checks on Humanitarian Datasets


📈 28.85 Punkte

📌 Meet G-LLaVA: The Game-Changer in Geometric Problem Solving and Surpasses GPT-4-V with the Innovative Geo170K Dataset


📈 26.55 Punkte

📌 Gretel AI Releases Largest Open Source Text-to-SQL Dataset to Accelerate Artificial Intelligence AI Model Training


📈 24.26 Punkte

📌 Plain Text Editor 1.2.1 - Simple distraction-free text editor without any rich text nonsense.


📈 24.26 Punkte

📌 Wsb-Detect - Tool To Detect If You Are Running In Windows Sandbox ("WSB")


📈 22.87 Punkte

📌 Is GPT-4 going to be Artificial General Intelligence? + GPT-4 Release Date


📈 20.76 Punkte

📌 Chat GPT mit „Leistungen auf menschlichem Niveau“: Das kann die neue Version GPT-4


📈 20.76 Punkte

📌 Chat GPT mit „Leistungen auf menschlichem Niveau“: Das kann die neue Version GPT-4


📈 20.76 Punkte

📌 Open AI's NEW INSANE GPT-4 SHOCKS The Entire Industry! (Microsoft GPT-4 ANNOUNCED!)(Multimodal)


📈 20.76 Punkte

📌 This is Why GPT 5 Will Change The World (CHAT GPT 5)


📈 20.76 Punkte

📌 Meta's NEW INSANE LLaMA GPT : SHOCKS The Entire Industry! (GPT Facebook ANNOUNCED!)


📈 20.76 Punkte

📌 AUTO-GPT: Autonomous GPT 4 Experimental AI | Tech News


📈 20.76 Punkte

📌 Sam Altman zu GPT-5: OpenAI trainiert derzeit keinen GPT-4-Nachfolger


📈 20.76 Punkte

📌 GTP-2, GPT-3, GPT-4: KI-Tools zur Texterstellung im Vergleich


📈 20.76 Punkte

📌 GPT-3.5 VS GPT-4: Comparing AI Bots


📈 20.76 Punkte

📌 GPT-3.5 vs GPT-4: Is ChatGPT Plus worth its subscription fee?


📈 20.76 Punkte

📌 OpenAI slashes GPT-3.5 Turbo's cost as it prepares to ship GPT-4 Turbo with vision


📈 20.76 Punkte

📌 GPT-Auslese: Die besten Tools aus dem OpenAI GPT Store


📈 20.76 Punkte

📌 Experience the Future of Communication With GPT-3: Consume GPT-3 API Through MuleSoft


📈 20.76 Punkte

📌 OpenAI’s GPT-4 is now available with significant improvements from GPT-3.5


📈 20.76 Punkte

📌 This Tech Enables AGI | How to Create Your Own Autonomous GPT-4 Agents with Auto-GPT


📈 20.76 Punkte

📌 Fine-Tune GPT-3 on custom datasets with just 10 lines of code using GPT-Index


📈 20.76 Punkte











matomo