📚 Introducing a Dataset to Detect GPT-Generated Text
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com
How to create datasets for ChatGPT detection models
With the breakthrough success of large language models like ChatGPT, people are finding innovative ways to use these models in their daily lives. However, this has also led to unintended consequences such as cheating on homework and tests by students, the use of ChatGPT to publish research papers, and even scammers using these models to trick people. To address these issues, there is a growing need for models that can detect text generated by GPT models. One of the crucial requirements for building robust models for detecting GPT-generated text is access to a large dataset of human-written and GPT-generated responses. This article introduces such a dataset, consisting of 150k human-written and GPT-generated responses to Wikipedia topics and outlines a framework for generating similar datasets in the future.
Dataset
GPT-wiki-intro Dataset is available on Hugging Face. This dataset has human written and GPT (Curie) generated introductions for 150,000 Wikipedia topics. The prompt used to generate GPT response is as follows:
f"200 word wikipedia style introduction on '{title}'
{starter_text}"
Where the `title` is the title of Wikipedia page, and `starter_text` is the first 7 words from the introduction paragraph. The dataset also has useful metadata such as title_len, wiki_intro_len, generated_intro_len, prompt_tokens, generated_text_tokens, etc. The schema of the dataset is as follows:
----------------------------------------------------------------------------
|Column |Datatype|Description |
|---------------------|--------|-------------------------------------------|
|id |int64 |ID |
|url |string |Wikipedia URL |
|title |string |Title |
|wiki_intro |string |Introduction paragraph from wikipedia |
|generated_intro |string |Introduction generated by GPT (Curie) model|
|title_len |int64 |Number of words in title |
|wiki_intro_len |int64 |Number of words in wiki_intro |
|generated_intro_len |int64 |Number of words in generated_intro |
|prompt |string |Prompt used to generate intro |
|generated_text |string |Text continued after the prompt |
|prompt_tokens |int64 |Number of tokens in the prompt |
|generated_text_tokens|int64 |Number of tokens in generated text |
----------------------------------------------------------------------------
This dataset is shared under Creative Commons license and can be used to distribute, remix, adapt, and build upon. The code to generate this dataset can be found here.
Framework
This dataset is a good starting point for a general use-case of detecting GPT generated text. But if you have a specific use-case, for example, detecting ChatGPT generated answers to test/homework questions, detecting whether a message is sent by a human or chatbot, or if you need a larger dataset in a specific domain you can use the framework to generate your own dataset. The dataset generation process involves three main steps:
- Get the anchor dataset
- Clean the anchor dataset
- Augment the dataset with human written/ GPT generated data
Get the anchor dataset
In this step, we acquire the anchor dataset. This would be the existing data which is readily available for the specific use-case. For GPT-wiki-intro dataset, the anchor dataset was wikipedia dataset, which contains cleaned articles of all languages. For detecting cheating on tests and homework, the anchor dataset could be the question answer pairs submitted by previous students. If you don’t have well defined anchor dataset, you can explore various open source datasets on Hugging Face and Kaggle which align well with the use-case. The anchor dataset doesn’t have to be human written, we can also use GPT generated data as the anchor data. For example, we can use data from ChatGPT prompt response library.
Clean the anchor dataset
Once we have the anchor dataset, we need to clean the data to keep the most relevant information. ChatGPT detection models are sensitive to the length of text. These models perform poorly on smaller texts. We can set a threshold and filter out any responses shorter than the threshold. For example, in GPT-wiki-intro dataset, we filter out all rows where length of introduction is less than 150 words or greater than 350 words. We also filter out all rows where title is more than three words. At this step we also need to decide the total size of the dataset. As augmenting the data with either human written or GPT generated responses is going to be expensive, we need to figure out what’s the minimum required size of dataset for our use-case.
Augment the dataset with human written/ GPT generated data
This is the final step of dataset generation. At this step, we augment the anchor dataset with human written or GPT generated data. The most important part in this step is coming up with the prompt used for generating the response from GPT or question that will be answered by humans. For finalizing the prompt we can leverage OpenAI Playground to test out different prompts with different models, temperature, frequency penalty and presence penalty. To increase the diversity of the dataset, we can finalize n prompts and use them uniformly to get the responses. In case of human responses, we would want to finalize the question by giving different variations of the questions to small survey population and then checking the results to finalize the n questions. Once the prompts or the questions are finalized, we can use the OpenAI API to generate the GPT generated responses or use service such as Mechanical Turk to get the human written responses.
Conclusion
In conclusion, with the widespread use of large language models like ChatGPT, there is a growing need for models that can detect text generated by these models. This article introduced GPT-wiki-intro dataset and outlined a framework for generating similar datasets. The availability of such datasets will play a critical role in developing robust models for detecting GPT-generated text and address the unintended consequences of the use of these models.
Citation
If you find this work helpful, please consider citing:
@misc {aaditya_bhat_2023,
author = { {Aaditya Bhat} },
title = { GPT-wiki-intro (Revision 0e458f5) },
year = 2023,
url = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro },
doi = { 10.57967/hf/0326 },
publisher = { Hugging Face }
}
Introducing a Dataset to Detect GPT-Generated Text was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
...